MLIR

Multi-Level IR Compiler Framework

MLIR Passes

This document describes the available MLIR passes and their contracts.

General Transformation Passes

-affine-loop-fusion: Fuse affine loop nests

Options

-fusion-compute-tolerance   : Fractional increase in additional computation tolerated while fusing
-fusion-fast-mem-space      : Faster memory space number to promote fusion buffers to
-fusion-local-buf-threshold : Threshold size (KiB) for promoting local buffers to fast memory space
-fusion-maximal             : Enables maximal loop fusion

-affine-pipeline-data-transfer: Pipeline non-blocking data transfers between explicitly managed levels of the memory hierarchy

This pass performs a transformation to overlap non-blocking DMA operations in a loop with computations through double buffering. This is achieved by advancing dma_start operations with respect to other operations.

Input

func @pipelinedatatransfer() {
  %0 = alloc() : memref<256xf32>
  %1 = alloc() : memref<32xf32, 1>
  %2 = alloc() : memref<1xf32>
  %c0 = constant 0 : index
  %c128 = constant 128 : index
  affine.for %i0 = 0 to 8 {
    affine.dma_start %0[%i0], %1[%i0], %2[%c0], %c128 : memref<256xf32>, memref<32xf32, 1>, memref<1xf32>
    affine.dma_wait %2[%c0], %c128 : memref<1xf32>
    %3 = affine.load %1[%i0] : memref<32xf32, 1>
    %4 = "compute"(%3) : (f32) -> f32
    affine.store %4, %1[%i0] : memref<32xf32, 1>
  }
  return
}

Output

module {
  func @pipelinedatatransfer() {
    %c8 = constant 8 : index
    %c0 = constant 0 : index
    %0 = alloc() : memref<256xf32>
    %c0_0 = constant 0 : index
    %c128 = constant 128 : index
    %1 = alloc() : memref<2x32xf32, 1>
    %2 = alloc() : memref<2x1xf32>
    affine.dma_start %0[%c0], %1[%c0 mod 2, %c0], %2[%c0 mod 2, symbol(%c0_0)], %c128 : memref<256xf32>, memref<2x32xf32, 1>, memref<2x1xf32>
    affine.for %arg0 = 1 to 8 {
      affine.dma_start %0[%arg0], %1[%arg0 mod 2, %arg0], %2[%arg0 mod 2, symbol(%c0_0)], %c128 : memref<256xf32>, memref<2x32xf32, 1>, memref<2x1xf32>
      %8 = affine.apply #map3(%arg0)
      %9 = affine.apply #map4(%8)
      %10 = affine.apply #map4(%8)
      affine.dma_wait %2[%8 mod 2, symbol(%c0_0)], %c128 : memref<2x1xf32>
      %11 = affine.load %1[%8 mod 2, %8] : memref<2x32xf32, 1>
      %12 = "compute"(%11) : (f32) -> f32
      affine.store %12, %1[%8 mod 2, %8] : memref<2x32xf32, 1>
    }
    %3 = affine.apply #map3(%c8)
    %4 = affine.apply #map4(%3)
    %5 = affine.apply #map4(%3)
    affine.dma_wait %2[%3 mod 2, symbol(%c0_0)], %c128 : memref<2x1xf32>
    %6 = affine.load %1[%3 mod 2, %3] : memref<2x32xf32, 1>
    %7 = "compute"(%6) : (f32) -> f32
    affine.store %7, %1[%3 mod 2, %3] : memref<2x32xf32, 1>
    dealloc %2 : memref<2x1xf32>
    dealloc %1 : memref<2x32xf32, 1>
    return
  }
}

-canonicalize: Canonicalize operations

-cse: Eliminate common sub-expressions

Statistics

num-cse'd : Number of operations CSE'd
num-dce'd : Number of operations DCE'd

-inline: Inline function calls

Options

-disable-simplify : Disable running simplifications during inlining
-max-iterations   : Maximum number of iterations when inlining within an SCC

-loop-coalescing: Coalesce nested loops with independent bounds into a single loop

-loop-invariant-code-motion: Hoist loop invariant instructions outside of the loop

-memref-dataflow-opt: Perform store/load forwarding for memrefs

This pass performs store to load forwarding for memref’s to eliminate memory accesses and potentially the entire memref if all its accesses are forwarded.

Input

func @store_load_affine_apply() -> memref<10x10xf32> {
  %cf7 = constant 7.0 : f32
  %m = alloc() : memref<10x10xf32>
  affine.for %i0 = 0 to 10 {
    affine.for %i1 = 0 to 10 {
      affine.store %cf7, %m[%i0, %i1] : memref<10x10xf32>
      %v0 = affine.load %m[%i0, %i1] : memref<10x10xf32>
      %v1 = addf %v0, %v0 : f32
    }
  }
  return %m : memref<10x10xf32>
}

Output

module {
  func @store_load_affine_apply() -> memref<10x10xf32> {
    %cst = constant 7.000000e+00 : f32
    %0 = alloc() : memref<10x10xf32>
    affine.for %arg0 = 0 to 10 {
      affine.for %arg1 = 0 to 10 {
        affine.store %cst, %0[%arg0, %arg1] : memref<10x10xf32>
        %1 = addf %cst, %cst : f32
      }
    }
    return %0 : memref<10x10xf32>
  }
}

-parallel-loop-collapsing: Collapse parallel loops to use less induction variables

Options

-collapsed-indices-0 : Which loop indices to combine 0th loop index
-collapsed-indices-1 : Which loop indices to combine into the position 1 loop index
-collapsed-indices-2 : Which loop indices to combine into the position 2 loop index

-print-cfg-graph: Print CFG graph per-Region

-print-op-graph: Print op graph per-Region

-print-op-stats: Print statistics of operations

-snapshot-op-locations: Generate new locations from the current IR

Options

-filename : The filename to print the generated IR
-tag      : A tag to use when fusing the new locations with the original. If unset, the locations are replaced.

-strip-debuginfo: Strip debug info from all operations

-symbol-dce: Eliminate dead symbols

Conversion Passes

-convert-avx512-to-llvm: Convert the operations from the avx512 dialect into the LLVM dialect

-convert-gpu-launch-to-vulkan-launch: Convert gpu.launch_func to vulkanLaunch external call

-convert-gpu-to-nvvm: Generate NVVM operations for gpu operations

-convert-gpu-to-rocdl: Generate ROCDL operations for gpu operations

-convert-gpu-to-spirv: Convert GPU dialect to SPIR-V dialect

-convert-linalg-to-llvm: Convert the operations from the linalg dialect into the LLVM dialect

-convert-linalg-to-spirv: Convert Linalg ops to SPIR-V ops

-convert-loop-op-to-gpu: Convert top-level loop::ForOp to GPU kernels

Options

-gpu-num-workgroups : Num workgroups in the GPU launch
-gpu-workgroup-size : Workgroup Size in the GPU launch

-convert-loop-to-std: Convert Loop dialect to Standard dialect, replacing structured control flow with a CFG

-convert-loops-to-gpu: Convert top-level loops to GPU kernels

Options

-gpu-block-dims  : Number of GPU block dimensions for mapping
-gpu-thread-dims : Number of GPU thread dimensions for mapping

-convert-parallel-loops-to-gpu: Convert mapped loop.parallel ops to gpu launch operations

-convert-std-to-llvm: Convert scalar and vector operations from the Standard to the LLVM dialect

Convert standard operations into the LLVM IR dialect operations.

Input invariant

  • operations including: arithmetic on integers and floats, constants, direct calls, returns and branches;
  • no tensor types;
  • all vector are one-dimensional;
  • all blocks are reachable by following the successors of the first basic block;

If other operations are present and their results are required by the LLVM IR dialect operations, the pass will fail. Any LLVM IR operations or types already present in the IR will be kept as is.

Output IR

Functions converted to LLVM IR. Function arguments types are converted one-to-one. Function results are converted one-to-one and, in case more than 1 value is returned, packed into an LLVM IR struct type. Function calls and returns are updated accordingly. Block argument types are updated to use LLVM IR types.

Options

-use-aligned-alloc             : Use aligned_alloc in place of malloc for heap allocations
-use-bare-ptr-memref-call-conv : Replace FuncOp's MemRef arguments with bare pointers to the MemRef element types
-emit-c-wrappers               : Emit wrappers for C-compatible pointer-to-struct memref descriptors
-index-bitwidth                : Bitwidth of the index type, 0 to use size of machine word

-convert-std-to-spirv: Convert Standard Ops to SPIR-V dialect

-convert-vector-to-llvm: Lower the operations from the vector dialect into the LLVM dialect

-launch-func-to-cuda: Convert all launch_func ops to CUDA runtime calls

-launch-func-to-vulkan: Convert vulkanLaunch external call to Vulkan runtime external calls

-legalize-std-for-spirv: Legalize standard ops for SPIR-V lowering

-lower-affine: Lower Affine operations to a combination of Standard and Loop operations

Convert operations from the affine dialect into operations from the loop and standard dialects.

affine.for operations are converted to loop.for operations that are free of certain structural restrictions (on their bounds and step). affine.if is similarly converted to the loop.if operation. affine.apply operations are converted into sequences of primitive arithmetic operations from the standard dialect that have the same effect, using operands of the index type. Consequently, named maps and sets thare are no longer in use may be removed from the module.

For example, %r = affine.apply affine_map<(d0, d1)[s0] -> (d0 + 2*d1 + s0)>(%d0, %d1)[%s0] can be converted into:

%d0 = <...>
%d1 = <...>
%s0 = <...>
%0 = constant 2 : index
%1 = muli %0, %d1
%2 = addi %d0, %1
%r = addi %2, %s0

Input invariant

  • no Tensor types;

These restrictions may be lifted in the future.

Output IR

Functions with affine.for and affine.if operations eliminated. These functions may contain operations from the Standard dialect in addition to those already present before the pass.

Invariants

  • Functions without a body are not modified.
  • The semantics of the other functions is preserved.
  • Individual operations other than those mentioned above are not modified if they do not depend on the loop iterator value or on the result of affine.apply.

affine Dialect Passes

-affine-data-copy-generate: Generate explicit copying for affine memory operations

Options

-fast-mem-capacity          : Set fast memory space capacity in KiB (default: unlimited)
-fast-mem-space             : Fast memory space identifier for copy generation (default: 1)
-generate-dma               : Generate DMA instead of point-wise copy
-min-dma-transfer           : Minimum DMA transfer size supported by the target in bytes
-slow-mem-space             : Slow memory space identifier for copy generation (default: 0)
-skip-non-unit-stride-loops : Testing purposes: avoid non-unit stride loop choice depths for copy placement
-tag-mem-space              : Tag memory space identifier for copy generation (default: 0)

-affine-loop-invariant-code-motion: Hoist loop invariant instructions outside of affine loops

-affine-loop-tile: Tile affine loop nests

Options

-cache-size : Set size of cache to tile for in KiB
-separate   : Separate full and partial tiles
-tile-size  : Use this tile size for all loops
-tile-sizes : List of tile sizes for each perfect nest (overridden by -tile-size)

-affine-loop-unroll: Unroll affine loops

Options

-unroll-factor         : Use this unroll factor for all loops being unrolled
-unroll-full           : Fully unroll loops
-unroll-num-reps       : Unroll innermost loops repeatedly this many times
-unroll-full-threshold : Unroll all loops with trip count less than or equal to this

-affine-loop-unroll-jam: Unroll and jam affine loops

Options

-unroll-jam-factor : Use this unroll jam factor for all loops (default 4)

-affine-super-vectorize: Vectorize to a target independent n-D vector abstraction

Options

-virtual-vector-size  : Specify an n-D virtual vector size for vectorization
-test-fastest-varying : Specify a 1-D, 2-D or 3-D pattern of fastest varying memory dimensions to match. See defaultPatterns in Vectorize.cpp for a description and examples. This is used for testing purposes

-simplify-affine-structures: Simplify affine expressions in maps/sets and normalize memrefs

gpu Dialect Passes

-gpu-kernel-outlining: Outline gpu.launch bodies to kernel functions

linalg Dialect Passes

-convert-linalg-to-affine-loops: Lower the operations from the linalg dialect into affine loops

-convert-linalg-to-loops: Lower the operations from the linalg dialect into loops

-convert-linalg-to-parallel-loops: Lower the operations from the linalg dialect into parallel loops

-linalg-fusion: Fuse operations in the linalg dialect

-linalg-fusion-for-tensor-ops: Fuse operations on RankedTensorType in linalg dialect

-linalg-promote-subviews: Promote subview ops to local buffers

Options

-test-promote-dynamic : Test generation of dynamic promoted buffers

-linalg-tile: Tile operations in the linalg dialect

Options

-linalg-tile-sizes : Test generation of dynamic promoted buffers

-linalg-tile-to-parallel-loops: Tile operations in the linalg dialect to parallel loops

Options

-linalg-tile-sizes : Test generation of dynamic promoted buffers

llvm Dialect Passes

-llvm-legalize-for-export: Legalize LLVM dialect to be convertible to LLVM IR

loop Dialect Passes

-parallel-loop-fusion: Fuse adjacent parallel loops

-parallel-loop-specialization: Specialize parallel loops for vectorization

-parallel-loop-tiling: Tile parallel loops

Options

-parallel-loop-tile-sizes : Factors to tile parallel loops by

quant Dialect Passes

-quant-convert-const: Converts constants followed by qbarrier to actual quantized values

-quant-convert-simulated-quantization: Converts training-time simulated quantization ops to corresponding quantize/dequantize casts

spv Dialect Passes

-decorate-spirv-composite-type-layout: Decorate SPIR-V composite type with layout info

-spirv-lower-abi-attrs: Decorate SPIR-V composite type with layout info

-spirv-update-vce: Deduce and attach minimal (version, capabilities, extensions) requirements to spv.module ops