MLIR

Multi-Level IR Compiler Framework

Passes

This document describes the available MLIR passes and their contracts.

General Transformation Passes 

-affine-loop-fusion: Fuse affine loop nests 

Options 

-fusion-compute-tolerance   : Fractional increase in additional computation tolerated while fusing
-fusion-fast-mem-space      : Faster memory space number to promote fusion buffers to
-fusion-local-buf-threshold : Threshold size (KiB) for promoting local buffers to fast memory space
-fusion-maximal             : Enables maximal loop fusion

-affine-pipeline-data-transfer: Pipeline non-blocking data transfers between explicitly managed levels of the memory hierarchy 

This pass performs a transformation to overlap non-blocking DMA operations in a loop with computations through double buffering. This is achieved by advancing dma_start operations with respect to other operations.

Input

func @pipelinedatatransfer() {
  %0 = alloc() : memref<256xf32>
  %1 = alloc() : memref<32xf32, 1>
  %2 = alloc() : memref<1xf32>
  %c0 = constant 0 : index
  %c128 = constant 128 : index
  affine.for %i0 = 0 to 8 {
    affine.dma_start %0[%i0], %1[%i0], %2[%c0], %c128 : memref<256xf32>, memref<32xf32, 1>, memref<1xf32>
    affine.dma_wait %2[%c0], %c128 : memref<1xf32>
    %3 = affine.load %1[%i0] : memref<32xf32, 1>
    %4 = "compute"(%3) : (f32) -> f32
    affine.store %4, %1[%i0] : memref<32xf32, 1>
  }
  return
}

Output

module {
  func @pipelinedatatransfer() {
    %c8 = constant 8 : index
    %c0 = constant 0 : index
    %0 = alloc() : memref<256xf32>
    %c0_0 = constant 0 : index
    %c128 = constant 128 : index
    %1 = alloc() : memref<2x32xf32, 1>
    %2 = alloc() : memref<2x1xf32>
    affine.dma_start %0[%c0], %1[%c0 mod 2, %c0], %2[%c0 mod 2, symbol(%c0_0)], %c128 : memref<256xf32>, memref<2x32xf32, 1>, memref<2x1xf32>
    affine.for %arg0 = 1 to 8 {
      affine.dma_start %0[%arg0], %1[%arg0 mod 2, %arg0], %2[%arg0 mod 2, symbol(%c0_0)], %c128 : memref<256xf32>, memref<2x32xf32, 1>, memref<2x1xf32>
      %8 = affine.apply #map3(%arg0)
      %9 = affine.apply #map4(%8)
      %10 = affine.apply #map4(%8)
      affine.dma_wait %2[%8 mod 2, symbol(%c0_0)], %c128 : memref<2x1xf32>
      %11 = affine.load %1[%8 mod 2, %8] : memref<2x32xf32, 1>
      %12 = "compute"(%11) : (f32) -> f32
      affine.store %12, %1[%8 mod 2, %8] : memref<2x32xf32, 1>
    }
    %3 = affine.apply #map3(%c8)
    %4 = affine.apply #map4(%3)
    %5 = affine.apply #map4(%3)
    affine.dma_wait %2[%3 mod 2, symbol(%c0_0)], %c128 : memref<2x1xf32>
    %6 = affine.load %1[%3 mod 2, %3] : memref<2x32xf32, 1>
    %7 = "compute"(%6) : (f32) -> f32
    affine.store %7, %1[%3 mod 2, %3] : memref<2x32xf32, 1>
    dealloc %2 : memref<2x1xf32>
    dealloc %1 : memref<2x32xf32, 1>
    return
  }
}

-buffer-placement: Optimizes placement of alloc and dealloc operations 

This pass implements an algorithm to optimize the placement of alloc and dealloc operations. This pass also inserts missing dealloc operations automatically to reclaim memory.

Input

#map0 = affine_map<(d0) -> (d0)>
module {
  func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
    cond_br %arg0, ^bb1, ^bb2
  ^bb1:
    br ^bb3(%arg1 : memref<2xf32>)
  ^bb2:
    %0 = alloc() : memref<2xf32>
    linalg.generic {args_in = 1 : i64, args_out = 1 : i64, indexing_maps = [#map0, #map0], iterator_types = ["parallel"]} %arg1, %0 {
    ^bb0(%gen1_arg0: f32, %gen1_arg1: f32):
      %tmp1 = exp %gen1_arg0 : f32
      linalg.yield %tmp1 : f32
    }: memref<2xf32>, memref<2xf32>
    br ^bb3(%0 : memref<2xf32>)
  ^bb3(%1: memref<2xf32>):
    "linalg.copy"(%1, %arg2) : (memref<2xf32>, memref<2xf32>) -> ()
    return
  }
}

Output

#map0 = affine_map<(d0) -> (d0)>
module {
  func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
    %0 = alloc() : memref<2xf32>
    cond_br %arg0, ^bb1, ^bb2
  ^bb1: // pred: ^bb0
    br ^bb3(%arg1 : memref<2xf32>)
  ^bb2: // pred: ^bb0
    linalg.generic {args_in = 1 : i64, args_out = 1 : i64, indexing_maps = [#map0, #map0], iterator_types = ["parallel"]} %arg1, %0 {
    ^bb0(%arg3: f32, %arg4: f32):       // no predecessors
      %2 = exp %arg3 : f32
      linalg.yield %2 : f32
    }: memref<2xf32>, memref<2xf32>
    br ^bb3(%0 : memref<2xf32>)
  ^bb3(%1: memref<2xf32>):      // 2 preds: ^bb1, ^bb2
    linalg.copy(%1, %arg2) : memref<2xf32>, memref<2xf32>
    dealloc %0 : memref<2xf32>
    return
  }
}

-canonicalize: Canonicalize operations 

This pass performs various types of canonicalizations over a set of operations. See Operation Canonicalization for more details.

-copy-removal: Remove the redundant copies from input IR 

-cse: Eliminate common sub-expressions 

This pass implements a generalized algorithm for common sub-expression elimination. This pass relies on information provided by the Memory SideEffect interface to identify when it is safe to eliminate operations. See Common subexpression elimination for more general details on this optimization.

Statistics 

num-cse'd : Number of operations CSE'd
num-dce'd : Number of operations DCE'd

-inline: Inline function calls 

Options 

-disable-simplify : Disable running simplifications during inlining
-max-iterations   : Maximum number of iterations when inlining within an SCC

-loop-coalescing: Coalesce nested loops with independent bounds into a single loop 

-loop-invariant-code-motion: Hoist loop invariant instructions outside of the loop 

-memref-dataflow-opt: Perform store/load forwarding for memrefs 

This pass performs store to load forwarding for memref’s to eliminate memory accesses and potentially the entire memref if all its accesses are forwarded.

Input

func @store_load_affine_apply() -> memref<10x10xf32> {
  %cf7 = constant 7.0 : f32
  %m = alloc() : memref<10x10xf32>
  affine.for %i0 = 0 to 10 {
    affine.for %i1 = 0 to 10 {
      affine.store %cf7, %m[%i0, %i1] : memref<10x10xf32>
      %v0 = affine.load %m[%i0, %i1] : memref<10x10xf32>
      %v1 = addf %v0, %v0 : f32
    }
  }
  return %m : memref<10x10xf32>
}

Output

module {
  func @store_load_affine_apply() -> memref<10x10xf32> {
    %cst = constant 7.000000e+00 : f32
    %0 = alloc() : memref<10x10xf32>
    affine.for %arg0 = 0 to 10 {
      affine.for %arg1 = 0 to 10 {
        affine.store %cst, %0[%arg0, %arg1] : memref<10x10xf32>
        %1 = addf %cst, %cst : f32
      }
    }
    return %0 : memref<10x10xf32>
  }
}

-normalize-memrefs: Normalize memrefs 

-parallel-loop-collapsing: Collapse parallel loops to use less induction variables 

Options 

-collapsed-indices-0 : Which loop indices to combine 0th loop index
-collapsed-indices-1 : Which loop indices to combine into the position 1 loop index
-collapsed-indices-2 : Which loop indices to combine into the position 2 loop index

-print-cfg-graph: Print CFG graph per-Region 

-print-op-graph: Print op graph per-Region 

-print-op-stats: Print statistics of operations 

-sccp: Sparse Conditional Constant Propagation 

This pass implements a general algorithm for sparse conditional constant propagation. This algorithm detects values that are known to be constant and optimistically propagates this throughout the IR. Any values proven to be constant are replaced, and removed if possible.

This implementation is based on the algorithm described by Wegman and Zadeck in “Constant Propagation with Conditional Branches” (1991).

-snapshot-op-locations: Generate new locations from the current IR 

This pass allows for generating new locations from the IR during any stage of compilation, by snapshotting the IR to a file and using that file to generate new locations for the operations.

Depending on the value of the tag option, different resulting locations may be generated:

  • If unset, the original location of the operation is replaced.

Example:

// old:
... loc("original_source.cpp":1:1)

// new:
... loc("snapshot_source.mlir":10:10)
  • If set, the new location is fused with the original location in the form of a Name Location with the specified tag.

Example:

// old:
... loc("original_source.cpp":1:1)

// new:
... loc(fused["original_source.cpp":1:1, "snapshot"("snapshot_source.mlir":10:10)])

Options 

-filename : The filename to print the generated IR
-tag      : A tag to use when fusing the new locations with the original. If unset, the locations are replaced.

-strip-debuginfo: Strip debug info from all operations 

This pass strips the IR of any location information, by replacing all operation locations with unknown .

-symbol-dce: Eliminate dead symbols 

This pass deletes all symbols that are found to be unreachable. This is done by computing the set of operations that are known to be live, propagating that liveness to other symbols, and then deleting all symbols that are not within this live set. Live symbols are those that have a visibility that extends beyond the IR, e.g. public, or those that are referenced by live symbols or other non-Symbol operations.

For example, consider the following input:

func @dead_private_function() attributes { sym_visibility = "private" }
func @live_private_function() attributes { sym_visibility = "private" }

// Note: The `public` isn't necessary here, as this is the default.
func @public_function() attributes { sym_visibility = "public" } {
  "foo.return"() {uses = [@live_private_function]} : () -> ()
}

A known live function, public_function, contains a reference to an otherwise non-live function live_private_function. After running symbol-dce, only these two symbols should remain, as the final symbol dead_private_function is not visible outside of the current IR and there are no links to known-live operations. After running, we get the expected:

func @live_private_function() attributes { sym_visibility = "private" }

func @public_function() attributes { sym_visibility = "public" } {
  "foo.return"() {uses = [@live_private_function]} : () -> ()
}

See Symbols and SymbolTables for more information on Symbols.

Conversion Passes 

-convert-affine-for-to-gpu: Convert top-level AffineFor Ops to GPU kernels 

Options 

-gpu-block-dims  : Number of GPU block dimensions for mapping
-gpu-thread-dims : Number of GPU thread dimensions for mapping

-convert-avx512-to-llvm: Convert the operations from the avx512 dialect into the LLVM dialect 

-convert-gpu-launch-to-vulkan-launch: Convert gpu.launch_func to vulkanLaunch external call 

-convert-gpu-to-nvvm: Generate NVVM operations for gpu operations 

Options 

-index-bitwidth : Bitwidth of the index type, 0 to use size of machine word

-convert-gpu-to-rocdl: Generate ROCDL operations for gpu operations 

Options 

-index-bitwidth : Bitwidth of the index type, 0 to use size of machine word

-convert-gpu-to-spirv: Convert GPU dialect to SPIR-V dialect 

-convert-linalg-to-llvm: Convert the operations from the linalg dialect into the LLVM dialect 

-convert-linalg-to-spirv: Convert Linalg ops to SPIR-V ops 

-convert-linalg-to-std: Convert the operations from the linalg dialect into the Standard dialect 

-convert-openmp-to-llvm: Convert the OpenMP ops to OpenMP ops with LLVM dialect 

-convert-parallel-loops-to-gpu: Convert mapped scf.parallel ops to gpu launch operations 

-convert-scf-to-std: Convert SCF dialect to Standard dialect, replacing structured control flow with a CFG 

-convert-shape-to-std: Convert operations from the shape dialect into the standard dialect 

-convert-spirv-to-llvm: Convert SPIR-V dialect to LLVM dialect 

-convert-std-to-llvm: Convert scalar and vector operations from the Standard to the LLVM dialect 

Convert standard operations into the LLVM IR dialect operations.

Input invariant 

  • operations including: arithmetic on integers and floats, constants, direct calls, returns and branches;
  • no tensor types;
  • all vector are one-dimensional;
  • all blocks are reachable by following the successors of the first basic block;

If other operations are present and their results are required by the LLVM IR dialect operations, the pass will fail. Any LLVM IR operations or types already present in the IR will be kept as is.

Output IR 

Functions converted to LLVM IR. Function arguments types are converted one-to-one. Function results are converted one-to-one and, in case more than 1 value is returned, packed into an LLVM IR struct type. Function calls and returns are updated accordingly. Block argument types are updated to use LLVM IR types.

Options 

-use-aligned-alloc             : Use aligned_alloc in place of malloc for heap allocations
-use-bare-ptr-memref-call-conv : Replace FuncOp's MemRef arguments with bare pointers to the MemRef element types
-emit-c-wrappers               : Emit wrappers for C-compatible pointer-to-struct memref descriptors
-index-bitwidth                : Bitwidth of the index type, 0 to use size of machine word
-data-layout                   : String description (LLVM format) of the data layout that is expected on the produced module

-convert-std-to-spirv: Convert Standard Ops to SPIR-V dialect 

-convert-vector-to-llvm: Lower the operations from the vector dialect into the LLVM dialect 

Options 

-reassociate-fp-reductions  : Allows llvm to reassociate floating-point reductions for speed
-enable-index-optimizations : Allows compiler to assume indices fit in 32-bit if that yields faster code

-convert-vector-to-rocdl: Lower the operations from the vector dialect into the ROCDL dialect 

-convert-vector-to-scf: Lower the operations from the vector dialect into the SCF dialect 

Options 

-full-unroll : Perform full unrolling when converting vector transfers to SCF

-gpu-to-llvm: Convert GPU dialect to LLVM dialect with GPU runtime calls 

Options 

-gpu-binary-annotation : Annotation attribute string for GPU binary

-launch-func-to-vulkan: Convert vulkanLaunch external call to Vulkan runtime external calls 

-legalize-std-for-spirv: Legalize standard ops for SPIR-V lowering 

-lower-affine: Lower Affine operations to a combination of Standard and SCF operations 

Convert operations from the affine dialect into operations from the SCF and standard dialects.

affine.for operations are converted to scf.for operations that are free of certain structural restrictions (on their bounds and step). affine.if is similarly converted to the scf.if operation. affine.apply operations are converted into sequences of primitive arithmetic operations from the standard dialect that have the same effect, using operands of the index type. Consequently, named maps and sets thare are no longer in use may be removed from the module.

For example, %r = affine.apply affine_map<(d0, d1)[s0] -> (d0 + 2*d1 + s0)>(%d0, %d1)[%s0] can be converted into:

%d0 = <...>
%d1 = <...>
%s0 = <...>
%0 = constant 2 : index
%1 = muli %0, %d1
%2 = addi %d0, %1
%r = addi %2, %s0

Input invariant 

  • no Tensor types;

These restrictions may be lifted in the future.

Output IR 

Functions with affine.for and affine.if operations eliminated. These functions may contain operations from the Standard dialect in addition to those already present before the pass.

Invariants 

  • Functions without a body are not modified.
  • The semantics of the other functions is preserved.
  • Individual operations other than those mentioned above are not modified if they do not depend on the loop iterator value or on the result of affine.apply.

affine Dialect Passes 

-affine-data-copy-generate: Generate explicit copying for affine memory operations 

Options 

-fast-mem-capacity          : Set fast memory space capacity in KiB (default: unlimited)
-fast-mem-space             : Fast memory space identifier for copy generation (default: 1)
-generate-dma               : Generate DMA instead of point-wise copy
-min-dma-transfer           : Minimum DMA transfer size supported by the target in bytes
-slow-mem-space             : Slow memory space identifier for copy generation (default: 0)
-skip-non-unit-stride-loops : Testing purposes: avoid non-unit stride loop choice depths for copy placement
-tag-mem-space              : Tag memory space identifier for copy generation (default: 0)

-affine-loop-invariant-code-motion: Hoist loop invariant instructions outside of affine loops 

-affine-loop-tile: Tile affine loop nests 

Options 

-cache-size : Set size of cache to tile for in KiB
-separate   : Separate full and partial tiles
-tile-size  : Use this tile size for all loops
-tile-sizes : List of tile sizes for each perfect nest (overridden by -tile-size)

-affine-loop-unroll: Unroll affine loops 

Options 

-unroll-factor         : Use this unroll factor for all loops being unrolled
-unroll-up-to-factor   : Allow unrolling up to the factor specified
-unroll-full           : Fully unroll loops
-unroll-num-reps       : Unroll innermost loops repeatedly this many times
-unroll-full-threshold : Unroll all loops with trip count less than or equal to this

-affine-loop-unroll-jam: Unroll and jam affine loops 

Options 

-unroll-jam-factor : Use this unroll jam factor for all loops (default 4)

-affine-parallel-normalize: Normalize affine.parallel ops so that lower bounds are 0 and steps are 1 

-affine-parallelize: Convert affine.for ops into 1-D affine.parallel 

-affine-super-vectorize: Vectorize to a target independent n-D vector abstraction 

Options 

-virtual-vector-size  : Specify an n-D virtual vector size for vectorization
-test-fastest-varying : Specify a 1-D, 2-D or 3-D pattern of fastest varying memory dimensions to match. See defaultPatterns in Vectorize.cpp for a description and examples. This is used for testing purposes

-simplify-affine-structures: Simplify affine expressions in maps/sets and normalize memrefs 

gpu Dialect Passes 

-gpu-kernel-outlining: Outline gpu.launch bodies to kernel functions 

linalg Dialect Passes 

-convert-linalg-on-tensors-to-buffers: Convert the Linalg operations which work on tensor-type operands or results to use buffers instead 

-convert-linalg-to-affine-loops: Lower the operations from the linalg dialect into affine loops 

-convert-linalg-to-loops: Lower the operations from the linalg dialect into loops 

-convert-linalg-to-parallel-loops: Lower the operations from the linalg dialect into parallel loops 

-linalg-fold-unit-extent-dims: Remove unit-extent dimension in Linalg ops on tensors 

Options 

-fold-one-trip-loops-only : Only folds the one-trip loops from Linalg ops on tensors (for testing purposes only)

-linalg-fusion: Fuse operations in the linalg dialect 

-linalg-fusion-for-tensor-ops: Fuse operations on RankedTensorType in linalg dialect 

-linalg-promote-subviews: Promote subview ops to local buffers 

Options 

-test-promote-dynamic : Test generation of dynamic promoted buffers
-test-use-alloca      : Test generation of alloca'ed buffers.

-linalg-tile: Tile operations in the linalg dialect 

Options 

-linalg-tile-sizes : Test generation of dynamic promoted buffers

-linalg-tile-to-parallel-loops: Tile operations in the linalg dialect to parallel loops 

Options 

-linalg-tile-sizes : Test generation of dynamic promoted buffers

llvm Dialect Passes 

-llvm-legalize-for-export: Legalize LLVM dialect to be convertible to LLVM IR 

loop Dialect Passes 

quant Dialect Passes 

-quant-convert-const: Converts constants followed by qbarrier to actual quantized values 

-quant-convert-simulated-quantization: Converts training-time simulated quantization ops to corresponding quantize/dequantize casts 

shape Dialect Passes 

-remove-shape-constraints: Replace all cstr_ ops with a true witness 

-shape-to-shape-lowering: Legalize Shape dialect to be convertible to Standard 

spv Dialect Passes 

-decorate-spirv-composite-type-layout: Decorate SPIR-V composite type with layout info 

-spirv-lower-abi-attrs: Decorate SPIR-V composite type with layout info 

-spirv-rewrite-inserts: Rewrite sequential chains of spv.CompositeInsert operations into spv.CompositeConstruct operations 

-spirv-update-vce: Deduce and attach minimal (version, capabilities, extensions) requirements to spv.module ops 

standard Dialect Passes 

-expand-atomic: Expands AtomicRMWOp into GenericAtomicRMWOp.