# Passes

This document describes the available MLIR passes and their contracts.

## General Transformation Passes ¶

`-affine-loop-fusion`

: Fuse affine loop nests ¶

This pass performs fusion of loop nests using a slicing-based approach. It combines two fusion strategies: producer-consumer fusion and sibling fusion. Producer-consumer fusion is aimed at fusing pairs of loops where the first one writes to a memref that the second reads. Sibling fusion targets pairs of loops that share no dependences between them but that load from the same memref. The fused loop nests, when possible, are rewritten to access significantly smaller local buffers instead of the original memref’s, and the latter are often either completely optimized away or contracted. This transformation leads to enhanced locality and lower memory footprint through the elimination or contraction of temporaries/intermediate memref’s. These benefits are sometimes achieved at the expense of redundant computation through a cost model that evaluates available choices such as the depth at which a source slice should be materialized in the designation slice.

Example 1: Producer-consumer fusion. Input:

```
func @producer_consumer_fusion(%arg0: memref<10xf32>, %arg1: memref<10xf32>) {
%0 = alloc() : memref<10xf32>
%1 = alloc() : memref<10xf32>
%cst = constant 0.000000e+00 : f32
affine.for %arg2 = 0 to 10 {
affine.store %cst, %0[%arg2] : memref<10xf32>
affine.store %cst, %1[%arg2] : memref<10xf32>
}
affine.for %arg2 = 0 to 10 {
%2 = affine.load %0[%arg2] : memref<10xf32>
%3 = addf %2, %2 : f32
affine.store %3, %arg0[%arg2] : memref<10xf32>
}
affine.for %arg2 = 0 to 10 {
%2 = affine.load %1[%arg2] : memref<10xf32>
%3 = mulf %2, %2 : f32
affine.store %3, %arg1[%arg2] : memref<10xf32>
}
return
}
```

Output:

```
func @producer_consumer_fusion(%arg0: memref<10xf32>, %arg1: memref<10xf32>) {
%0 = alloc() : memref<1xf32>
%1 = alloc() : memref<1xf32>
%cst = constant 0.000000e+00 : f32
affine.for %arg2 = 0 to 10 {
affine.store %cst, %0[0] : memref<1xf32>
affine.store %cst, %1[0] : memref<1xf32>
%2 = affine.load %1[0] : memref<1xf32>
%3 = mulf %2, %2 : f32
affine.store %3, %arg1[%arg2] : memref<10xf32>
%4 = affine.load %0[0] : memref<1xf32>
%5 = addf %4, %4 : f32
affine.store %5, %arg0[%arg2] : memref<10xf32>
}
return
}
```

Example 2: Sibling fusion. Input:

```
func @sibling_fusion(%arg0: memref<10x10xf32>, %arg1: memref<10x10xf32>,
%arg2: memref<10x10xf32>, %arg3: memref<10x10xf32>,
%arg4: memref<10x10xf32>) {
affine.for %arg5 = 0 to 3 {
affine.for %arg6 = 0 to 3 {
%0 = affine.load %arg0[%arg5, %arg6] : memref<10x10xf32>
%1 = affine.load %arg1[%arg5, %arg6] : memref<10x10xf32>
%2 = mulf %0, %1 : f32
affine.store %2, %arg3[%arg5, %arg6] : memref<10x10xf32>
}
}
affine.for %arg5 = 0 to 3 {
affine.for %arg6 = 0 to 3 {
%0 = affine.load %arg0[%arg5, %arg6] : memref<10x10xf32>
%1 = affine.load %arg2[%arg5, %arg6] : memref<10x10xf32>
%2 = addf %0, %1 : f32
affine.store %2, %arg4[%arg5, %arg6] : memref<10x10xf32>
}
}
return
}
```

Output:

```
func @sibling_fusion(%arg0: memref<10x10xf32>, %arg1: memref<10x10xf32>,
%arg2: memref<10x10xf32>, %arg3: memref<10x10xf32>,
%arg4: memref<10x10xf32>) {
affine.for %arg5 = 0 to 3 {
affine.for %arg6 = 0 to 3 {
%0 = affine.load %arg0[%arg5, %arg6] : memref<10x10xf32>
%1 = affine.load %arg1[%arg5, %arg6] : memref<10x10xf32>
%2 = mulf %0, %1 : f32
affine.store %2, %arg3[%arg5, %arg6] : memref<10x10xf32>
%3 = affine.load %arg0[%arg5, %arg6] : memref<10x10xf32>
%4 = affine.load %arg2[%arg5, %arg6] : memref<10x10xf32>
%5 = addf %3, %4 : f32
affine.store %5, %arg4[%arg5, %arg6] : memref<10x10xf32>
}
}
return
}
```

#### Options ¶

```
-fusion-compute-tolerance : Fractional increase in additional computation tolerated while fusing
-fusion-fast-mem-space : Faster memory space number to promote fusion buffers to
-fusion-local-buf-threshold : Threshold size (KiB) for promoting local buffers to fast memory space
-fusion-maximal : Enables maximal loop fusion
```

`-affine-pipeline-data-transfer`

: Pipeline non-blocking data transfers between explicitly managed levels of the memory hierarchy ¶

This pass performs a transformation to overlap non-blocking DMA operations in a loop with computations through double buffering. This is achieved by advancing dma_start operations with respect to other operations.

Input

```
func @pipelinedatatransfer() {
%0 = alloc() : memref<256xf32>
%1 = alloc() : memref<32xf32, 1>
%2 = alloc() : memref<1xf32>
%c0 = constant 0 : index
%c128 = constant 128 : index
affine.for %i0 = 0 to 8 {
affine.dma_start %0[%i0], %1[%i0], %2[%c0], %c128 : memref<256xf32>, memref<32xf32, 1>, memref<1xf32>
affine.dma_wait %2[%c0], %c128 : memref<1xf32>
%3 = affine.load %1[%i0] : memref<32xf32, 1>
%4 = "compute"(%3) : (f32) -> f32
affine.store %4, %1[%i0] : memref<32xf32, 1>
}
return
}
```

Output

```
module {
func @pipelinedatatransfer() {
%c8 = constant 8 : index
%c0 = constant 0 : index
%0 = alloc() : memref<256xf32>
%c0_0 = constant 0 : index
%c128 = constant 128 : index
%1 = alloc() : memref<2x32xf32, 1>
%2 = alloc() : memref<2x1xf32>
affine.dma_start %0[%c0], %1[%c0 mod 2, %c0], %2[%c0 mod 2, symbol(%c0_0)], %c128 : memref<256xf32>, memref<2x32xf32, 1>, memref<2x1xf32>
affine.for %arg0 = 1 to 8 {
affine.dma_start %0[%arg0], %1[%arg0 mod 2, %arg0], %2[%arg0 mod 2, symbol(%c0_0)], %c128 : memref<256xf32>, memref<2x32xf32, 1>, memref<2x1xf32>
%8 = affine.apply #map3(%arg0)
%9 = affine.apply #map4(%8)
%10 = affine.apply #map4(%8)
affine.dma_wait %2[%8 mod 2, symbol(%c0_0)], %c128 : memref<2x1xf32>
%11 = affine.load %1[%8 mod 2, %8] : memref<2x32xf32, 1>
%12 = "compute"(%11) : (f32) -> f32
affine.store %12, %1[%8 mod 2, %8] : memref<2x32xf32, 1>
}
%3 = affine.apply #map3(%c8)
%4 = affine.apply #map4(%3)
%5 = affine.apply #map4(%3)
affine.dma_wait %2[%3 mod 2, symbol(%c0_0)], %c128 : memref<2x1xf32>
%6 = affine.load %1[%3 mod 2, %3] : memref<2x32xf32, 1>
%7 = "compute"(%6) : (f32) -> f32
affine.store %7, %1[%3 mod 2, %3] : memref<2x32xf32, 1>
dealloc %2 : memref<2x1xf32>
dealloc %1 : memref<2x32xf32, 1>
return
}
}
```

`-buffer-deallocation`

: Adds all required dealloc operations for all allocations in the input program ¶

This pass implements an algorithm to automatically introduce all required deallocation operations for all buffers in the input program. This ensures that the resulting program does not have any memory leaks.

Input

```
#map0 = affine_map<(d0) -> (d0)>
module {
func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
cond_br %arg0, ^bb1, ^bb2
^bb1:
br ^bb3(%arg1 : memref<2xf32>)
^bb2:
%0 = alloc() : memref<2xf32>
linalg.generic {
args_in = 1 : i64,
args_out = 1 : i64,
indexing_maps = [#map0, #map0],
iterator_types = ["parallel"]} %arg1, %0 {
^bb0(%gen1_arg0: f32, %gen1_arg1: f32):
%tmp1 = exp %gen1_arg0 : f32
linalg.yield %tmp1 : f32
}: memref<2xf32>, memref<2xf32>
br ^bb3(%0 : memref<2xf32>)
^bb3(%1: memref<2xf32>):
"linalg.copy"(%1, %arg2) : (memref<2xf32>, memref<2xf32>) -> ()
return
}
}
```

Output

```
#map0 = affine_map<(d0) -> (d0)>
module {
func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
cond_br %arg0, ^bb1, ^bb2
^bb1: // pred: ^bb0
%0 = alloc() : memref<2xf32>
linalg.copy(%arg1, %0) : memref<2xf32>, memref<2xf32>
br ^bb3(%0 : memref<2xf32>)
^bb2: // pred: ^bb0
%1 = alloc() : memref<2xf32>
linalg.generic {
args_in = 1 : i64,
args_out = 1 : i64,
indexing_maps = [#map0, #map0],
iterator_types = ["parallel"]} %arg1, %1 {
^bb0(%arg3: f32, %arg4: f32): // no predecessors
%4 = exp %arg3 : f32
linalg.yield %4 : f32
}: memref<2xf32>, memref<2xf32>
%2 = alloc() : memref<2xf32>
linalg.copy(%1, %2) : memref<2xf32>, memref<2xf32>
dealloc %1 : memref<2xf32>
br ^bb3(%2 : memref<2xf32>)
^bb3(%3: memref<2xf32>): // 2 preds: ^bb1, ^bb2
linalg.copy(%3, %arg2) : memref<2xf32>, memref<2xf32>
dealloc %3 : memref<2xf32>
return
}
}
```

`-buffer-hoisting`

: Optimizes placement of allocation operations by moving them into common dominators and out of nested regions ¶

This pass implements an approach to aggressively move allocations upwards into common dominators and out of nested regions.

`-buffer-loop-hoisting`

: Optimizes placement of allocation operations by moving them out of loop nests ¶

This pass implements an approach to aggressively move allocations upwards out of loop nests. It does not move allocations into common dominators.

`-buffer-results-to-out-params`

: Converts memref-typed function results to out-params ¶

Some calling conventions prefer to pass output memrefs as “out params”. The conversion to this calling convention must be done as an atomic transformation of the entire program (hence this is a module pass).

For example, if a call is rewritten, the callee needs to be rewritten otherwise the IR will end up invalid. Thus, this transformation require an atomic change to the entire program (e.g. the whole module).

This pass is expected to run immediately after bufferization is finished. At that point, tensor-typed results will have been converted to memref-typed results, and can be consistently converted to out params.

All memref-typed results are appended to the function argument list.

The main issue with this pass (and the out-param calling convention) is that buffers for results need to be allocated in the caller. This currently only works for static shaped memrefs.

`-canonicalize`

: Canonicalize operations ¶

This pass performs various types of canonicalizations over a set of operations. See Operation Canonicalization for more details.

#### Options ¶

```
-top-down : Seed the worklist in general top-down order
-region-simplify : Seed the worklist in general top-down order
-max-iterations : Seed the worklist in general top-down order
-disable-patterns : Labels of patterns that should be filtered out during application
-enable-patterns : Labels of patterns that should be used during application, all other patterns are filtered out
```

`-cse`

: Eliminate common sub-expressions ¶

This pass implements a generalized algorithm for common sub-expression
elimination. This pass relies on information provided by the
`Memory SideEffect`

interface to identify when it is safe to eliminate
operations. See
Common subexpression elimination
for more general details on this optimization.

#### Statistics ¶

```
num-cse'd : Number of operations CSE'd
num-dce'd : Number of operations DCE'd
```

`-finalizing-bufferize`

: Finalize a partial bufferization ¶

A bufferize pass that finalizes a partial bufferization by removing
remaining `memref.tensor_load`

and `memref.buffer_cast`

operations.

The removal of those operations is only possible if the operations only
exist in pairs, i.e., all uses of `memref.tensor_load`

operations are
`memref.buffer_cast`

operations.

This pass will fail if not all operations can be removed or if any operation with tensor typed operands remains.

`-inline`

: Inline function calls ¶

#### Options ¶

```
-default-pipeline : The default optimizer pipeline used for callables
-op-pipelines : Callable operation specific optimizer pipelines (in the form of `dialect.op(pipeline)`)
-max-iterations : Maximum number of iterations when inlining within an SCC
```

`-loop-coalescing`

: Coalesce nested loops with independent bounds into a single loop ¶

`-loop-invariant-code-motion`

: Hoist loop invariant instructions outside of the loop ¶

`-normalize-memrefs`

: Normalize memrefs ¶

This pass transforms memref types with a non-trivial layout map into memref types with an identity layout map, e.g. (i, j) -> (i, j). This pass is inter-procedural, in the sense that it can modify function interfaces and call sites that pass memref types. In order to modify memref types while preserving the original behavior, users of those memref types are also modified to incorporate the resulting layout map. For instance, an [AffineLoadOp] (https://mlir.llvm.org/docs/Dialects/Affine/#affineload-affineloadop) will be updated to compose the layout map with with the affine expression contained in the op. Operations marked with the [MemRefsNormalizable] (https://mlir.llvm.org/docs/Traits/#memrefsnormalizable) trait are expected to be normalizable. Supported operations include affine operations, memref.alloc, memref.dealloc, and std.return.

Given an appropriate layout map specified in the code, this transformation can express tiled or linearized access to multi-dimensional data structures, but will not modify memref types without an explicit layout map.

Currently this pass is limited to only modify functions where all memref types can be normalized. If a function contains any operations that are not MemRefNormalizable, then the function and any functions that call or call it will not be modified.

Input

```
#tile = affine_map<(i) -> (i floordiv 4, i mod 4)>
func @matmul(%A: memref<16xf64, #tile>,
%B: index, %C: memref<16xf64>) -> (memref<16xf64, #tile>) {
affine.for %arg3 = 0 to 16 {
%a = affine.load %A[%arg3] : memref<16xf64, #tile>
%p = mulf %a, %a : f64
affine.store %p, %A[%arg3] : memref<16xf64, #tile>
}
%c = alloc() : memref<16xf64, #tile>
%d = affine.load %c[0] : memref<16xf64, #tile>
return %A: memref<16xf64, #tile>
}
```

Output

```
func @matmul(%arg0: memref<4x4xf64>, %arg1: index, %arg2: memref<16xf64>)
-> memref<4x4xf64> {
affine.for %arg3 = 0 to 16 {
%3 = affine.load %arg0[%arg3 floordiv 4, %arg3 mod 4]: memref<4x4xf64>
%4 = mulf %3, %3 : f64
affine.store %4, %arg0[%arg3 floordiv 4, %arg3 mod 4]: memref<4x4xf64>
}
%0 = alloc() : memref<4x4xf64>
%1 = affine.apply #map1()
%2 = affine.load %0[0, 0] : memref<4x4xf64>
return %arg0 : memref<4x4xf64>
}
```

Input

```
#linear8 = affine_map<(i, j) -> (i * 8 + j)>
func @linearize(%arg0: memref<8x8xi32, #linear8>,
%arg1: memref<8x8xi32, #linear8>,
%arg2: memref<8x8xi32, #linear8>) {
%c8 = constant 8 : index
%c0 = constant 0 : index
%c1 = constant 1 : index
affine.for %arg3 = %c0 to %c8 {
affine.for %arg4 = %c0 to %c8 {
affine.for %arg5 = %c0 to %c8 {
%0 = affine.load %arg0[%arg3, %arg5] : memref<8x8xi32, #linear8>
%1 = affine.load %arg1[%arg5, %arg4] : memref<8x8xi32, #linear8>
%2 = affine.load %arg2[%arg3, %arg4] : memref<8x8xi32, #linear8>
%3 = muli %0, %1 : i32
%4 = addi %2, %3 : i32
affine.store %4, %arg2[%arg3, %arg4] : memref<8x8xi32, #linear8>
}
}
}
return
}
```

Output

```
func @linearize(%arg0: memref<64xi32>,
%arg1: memref<64xi32>,
%arg2: memref<64xi32>) {
%c8 = constant 8 : index
%c0 = constant 0 : index
affine.for %arg3 = %c0 to %c8 {
affine.for %arg4 = %c0 to %c8 {
affine.for %arg5 = %c0 to %c8 {
%0 = affine.load %arg0[%arg3 * 8 + %arg5] : memref<64xi32>
%1 = affine.load %arg1[%arg5 * 8 + %arg4] : memref<64xi32>
%2 = affine.load %arg2[%arg3 * 8 + %arg4] : memref<64xi32>
%3 = muli %0, %1 : i32
%4 = addi %2, %3 : i32
affine.store %4, %arg2[%arg3 * 8 + %arg4] : memref<64xi32>
}
}
}
return
`
### `-parallel-loop-collapsing`: Collapse parallel loops to use less induction variables
#### Options
```

-collapsed-indices-0 : Which loop indices to combine 0th loop index -collapsed-indices-1 : Which loop indices to combine into the position 1 loop index -collapsed-indices-2 : Which loop indices to combine into the position 2 loop index

```
### `-print-cfg-graph`: Print CFG graph per-Region
### `-print-op-stats`: Print statistics of operations
### `-promote-buffers-to-stack`: Promotes heap-based allocations to automatically managed stack-based allocations
This pass implements a simple algorithm to convert heap-based memory
allocations to stack-based ones. It uses a built-in heuristic to decide
whether it makes sense to convert an allocation. Furthermore, dynamic
shaped buffers that are limited by the rank of the tensor can be
converted. They are only transformed if they are considered to be small.
#### Options
```

-max-alloc-size-in-bytes : Maximal size in bytes to promote allocations to stack. -bitwidth-of-index-type : Bitwidth of the index type. Used for size estimation. -max-rank-of-allocated-memref : Maximal memref rank to promote dynamic buffers.

```
### `-sccp`: Sparse Conditional Constant Propagation
This pass implements a general algorithm for sparse conditional constant
propagation. This algorithm detects values that are known to be constant and
optimistically propagates this throughout the IR. Any values proven to be
constant are replaced, and removed if possible.
This implementation is based on the algorithm described by Wegman and Zadeck
in [“Constant Propagation with Conditional Branches”](https://dl.acm.org/doi/10.1145/103135.103136) (1991).
### `-snapshot-op-locations`: Generate new locations from the current IR
This pass allows for generating new locations from the IR during any stage
of compilation, by snapshotting the IR to a file and using that file to
generate new locations for the operations.
Depending on the value of the `tag` option, different resulting locations
may be generated:
* If unset, the original location of the operation is replaced.
Example:
```mlir
// old:
... loc("original_source.cpp":1:1)
// new:
... loc("snapshot_source.mlir":10:10)
```

- If set, the new location is fused with the original location in the form
of a
`Name Location`

with the specified tag.

Example:

```
// old:
... loc("original_source.cpp":1:1)
// new:
... loc(fused["original_source.cpp":1:1, "snapshot"("snapshot_source.mlir":10:10)])
```

#### Options ¶

```
-filename : The filename to print the generated IR
-tag : A tag to use when fusing the new locations with the original. If unset, the locations are replaced.
```

`-strip-debuginfo`

: Strip debug info from all operations ¶

This pass strips the IR of any location information, by replacing all
operation locations with
`unknown`

.

`-symbol-dce`

: Eliminate dead symbols ¶

This pass deletes all symbols that are found to be unreachable. This is done
by computing the set of operations that are known to be live, propagating
that liveness to other symbols, and then deleting all symbols that are not
within this live set. Live symbols are those that have a
visibility
that extends
beyond the IR, e.g. `public`

, or those that are referenced by live symbols
or other non-Symbol operations.

For example, consider the following input:

```
func private @dead_private_function()
func private @live_private_function()
// Note: The `public` isn't necessary here, as this is the default.
func public @public_function() {
"foo.return"() {uses = [@live_private_function]} : () -> ()
}
```

A known live function, `public_function`

, contains a reference to an
otherwise non-live function `live_private_function`

. After running
`symbol-dce`

, only these two symbols should remain, as the final symbol
`dead_private_function`

is not visible outside of the current IR and there
are no links to known-live operations. After running, we get the expected:

```
func private @live_private_function()
func public @public_function() {
"foo.return"() {uses = [@live_private_function]} : () -> ()
}
```

See
Symbols and SymbolTables
for more
information on `Symbols`

.

`-view-op-graph`

: Print graphviz view of module ¶

This pass prints a graphviz per block of a module.

- Op are represented as nodes;
- Uses as edges;

#### Options ¶

```
-title : The prefix of the title of the graph
-short-names : Use short names
```

## Conversion Passes ¶

`-arm-neon-2d-to-intr`

: Convert Arm NEON structured ops to intrinsics ¶

`-convert-affine-for-to-gpu`

: Convert top-level AffineFor Ops to GPU kernels ¶

#### Options ¶

```
-gpu-block-dims : Number of GPU block dimensions for mapping
-gpu-thread-dims : Number of GPU thread dimensions for mapping
```

`-convert-async-to-llvm`

: Convert the operations from the async dialect into the LLVM dialect ¶

Convert `async.execute`

operations to LLVM coroutines and use async runtime
API to execute them.

`-convert-complex-to-llvm`

: Convert Complex dialect to LLVM dialect ¶

`-convert-complex-to-standard`

: Convert Complex dialect to standard dialect ¶

`-convert-gpu-launch-to-vulkan-launch`

: Convert gpu.launch_func to vulkanLaunch external call ¶

This pass is only intended for the mlir-vulkan-runner.

`-convert-gpu-to-nvvm`

: Generate NVVM operations for gpu operations ¶

#### Options ¶

```
-index-bitwidth : Bitwidth of the index type, 0 to use size of machine word
```

`-convert-gpu-to-rocdl`

: Generate ROCDL operations for gpu operations ¶

#### Options ¶

```
-index-bitwidth : Bitwidth of the index type, 0 to use size of machine word
```

`-convert-gpu-to-spirv`

: Convert GPU dialect to SPIR-V dialect ¶

This pass converts supported GPU device ops to SPIR-V ops. It does not handle GPU host ops.

A `gpu.func`

op can have parameters to pass in resources. But in SPIR-V
entry functions cannot take parameters; they use descriptors to access
resources. By default, parameters to a `gpu.func`

op will be converted to
global variables. These global variables will be assigned sequential binding
numbers following their order in the original `gpu.func`

op, starting from
0, in set 0. One can attach `spv.interface_var_abi`

to those parameters
to control the set and binding if wanted.

`-convert-linalg-to-llvm`

: Convert the operations from the linalg dialect into the LLVM dialect ¶

`-convert-linalg-to-spirv`

: Convert Linalg dialect to SPIR-V dialect ¶

This pass converts supported Linalg ops to SPIR-V ops. It’s quite experimental and are expected to migrate to other proper conversions.

`-convert-linalg-to-std`

: Convert the operations from the linalg dialect into the Standard dialect ¶

`-convert-math-to-libm`

: Convert Math dialect to libm calls ¶

This pass converts supported Math ops to libm calls.

`-convert-math-to-llvm`

: Convert Math dialect to LLVM dialect ¶

This pass converts supported Math ops to LLVM dialect intrinsics.

`-convert-memref-to-llvm`

: Convert operations from the MemRef dialect to the LLVM dialect ¶

#### Options ¶

```
-use-aligned-alloc : Use aligned_alloc in place of malloc for heap allocations
-index-bitwidth : Bitwidth of the index type, 0 to use size of machine word
```

`-convert-openacc-to-llvm`

: Convert the OpenACC ops to LLVM dialect ¶

`-convert-openacc-to-scf`

: Convert the OpenACC ops to OpenACC with SCF dialect ¶

`-convert-openmp-to-llvm`

: Convert the OpenMP ops to OpenMP ops with LLVM dialect ¶

`-convert-parallel-loops-to-gpu`

: Convert mapped scf.parallel ops to gpu launch operations ¶

`-convert-pdl-to-pdl-interp`

: Convert PDL ops to PDL interpreter ops ¶

`-convert-scf-to-openmp`

: Convert SCF parallel loop to OpenMP parallel + workshare constructs. ¶

`-convert-scf-to-spirv`

: Convert SCF dialect to SPIR-V dialect. ¶

This pass converts SCF ops into SPIR-V structured control flow ops. SPIR-V structured control flow ops does not support yielding values. So for SCF ops yielding values, SPIR-V variables are created for holding the values and load/store operations are emitted for updating them.

`-convert-scf-to-std`

: Convert SCF dialect to Standard dialect, replacing structured control flow with a CFG ¶

`-convert-shape-constraints`

: Convert shape constraint operations to the standard dialect ¶

This pass eliminates shape constraints from the program, converting them to eager (side-effecting) error handling code.

This pass is separate from the regular convert-shape-to-standard, despite converting between the same dialects, because converting shape constraints can happen at a different part of the program than general shape computation lowering.

`-convert-shape-to-std`

: Convert operations from the shape dialect into the standard dialect ¶

`-convert-spirv-to-llvm`

: Convert SPIR-V dialect to LLVM dialect ¶

See https://mlir.llvm.org/docs/SPIRVToLLVMDialectConversion/ for more details.

`-convert-std-to-llvm`

: Convert scalar and vector operations from the Standard to the LLVM dialect ¶

Convert standard operations into the LLVM IR dialect operations.

#### Input invariant ¶

- operations including: arithmetic on integers and floats, constants, direct calls, returns and branches;
- no
`tensor`

types; - all
`vector`

are one-dimensional; - all blocks are reachable by following the successors of the first basic block;

If other operations are present and their results are required by the LLVM IR dialect operations, the pass will fail. Any LLVM IR operations or types already present in the IR will be kept as is.

#### Output IR ¶

Functions converted to LLVM IR. Function arguments types are converted one-to-one. Function results are converted one-to-one and, in case more than 1 value is returned, packed into an LLVM IR struct type. Function calls and returns are updated accordingly. Block argument types are updated to use LLVM IR types.

#### Options ¶

```
-use-bare-ptr-memref-call-conv : Replace FuncOp's MemRef arguments with bare pointers to the MemRef element types
-emit-c-wrappers : Emit wrappers for C-compatible pointer-to-struct memref descriptors
-index-bitwidth : Bitwidth of the index type, 0 to use size of machine word
-data-layout : String description (LLVM format) of the data layout that is expected on the produced module
```

`-convert-std-to-spirv`

: Convert Standard dialect to SPIR-V dialect ¶

#### Options ¶

```
-emulate-non-32-bit-scalar-types : Emulate non-32-bit scalar types with 32-bit ones if missing native support
```

`-convert-vector-to-gpu`

: Lower the operations from the vector dialect into the GPU dialect ¶

`-convert-vector-to-llvm`

: Lower the operations from the vector dialect into the LLVM dialect ¶

Convert operations from the vector dialect into the LLVM IR dialect operations. The lowering pass provides several options to control the kinds of optimizations that are allowed. It also provides options that enable the use of one or more architectural-specific dialects (AMX, X86Vector, ArmNeon, ArmSVE, etc.) in combination with the architectural-neutral vector dialect lowering.

#### Options ¶

```
-reassociate-fp-reductions : Allows llvm to reassociate floating-point reductions for speed
-enable-index-optimizations : Allows compiler to assume indices fit in 32-bit if that yields faster code
-enable-amx : Enables the use of AMX dialect while lowering the vector dialect.
-enable-arm-neon : Enables the use of ArmNeon dialect while lowering the vector dialect.
-enable-arm-sve : Enables the use of ArmSVE dialect while lowering the vector dialect.
-enable-x86vector : Enables the use of X86Vector dialect while lowering the vector dialect.
```

`-convert-vector-to-rocdl`

: Lower the operations from the vector dialect into the ROCDL dialect ¶

`-convert-vector-to-scf`

: Lower the operations from the vector dialect into the SCF dialect ¶

#### Options ¶

```
-full-unroll : Perform full unrolling when converting vector transfers to SCF
-target-rank : Target vector rank to which transfer ops should be lowered
-lower-permutation-maps : Replace permutation maps with vector transposes/broadcasts before lowering transfer ops
-lower-tensors : Lower transfer ops that operate on tensors
```

`-convert-vector-to-spirv`

: Convert Vector dialect to SPIR-V dialect ¶

`-gpu-to-llvm`

: Convert GPU dialect to LLVM dialect with GPU runtime calls ¶

`-launch-func-to-vulkan`

: Convert vulkanLaunch external call to Vulkan runtime external calls ¶

This pass is only intended for the mlir-vulkan-runner.

`-lower-affine`

: Lower Affine operations to a combination of Standard and SCF operations ¶

Convert operations from the affine dialect into operations from the SCF and standard dialects.

`affine.for`

operations are converted to `scf.for`

operations that are free
of certain structural restrictions (on their bounds and step). `affine.if`

is similarly converted to the `scf.if`

operation. `affine.apply`

operations
are converted into sequences of primitive arithmetic operations from the
standard dialect that have the same effect, using operands of the `index`

type. Consequently, named maps and sets thare are no longer in use may be
removed from the module.

For example, `%r = affine.apply affine_map<(d0, d1)[s0] -> (d0 + 2*d1 + s0)>(%d0, %d1)[%s0]`

can be converted into:

```
%d0 = <...>
%d1 = <...>
%s0 = <...>
%0 = constant 2 : index
%1 = muli %0, %d1
%2 = addi %d0, %1
%r = addi %2, %s0
```

#### Input invariant ¶

- no
`Tensor`

types;

These restrictions may be lifted in the future.

#### Output IR ¶

Functions with `affine.for`

and `affine.if`

operations eliminated. These
functions may contain operations from the Standard dialect in addition to
those already present before the pass.

#### Invariants ¶

- Functions without a body are not modified.
- The semantics of the other functions is preserved.
- Individual operations other than those mentioned above are not modified
if they do not depend on the loop iterator value or on the result of
`affine.apply`

.

`-lower-host-to-llvm`

: Lowers the host module code and `gpu.launch_func`

to LLVM ¶

`-tosa-to-linalg-on-tensors`

: Lower TOSA to LinAlg on tensors ¶

Pass that converts TOSA operations to the equivalent operations using the tensor operations in LinAlg.

`-tosa-to-scf`

: Lower TOSA to the SCF dialect ¶

Pass that converts TOSA’s control flow operations to the equivalent SCF operations.

`-tosa-to-standard`

: Lower TOSA to the Standard dialect ¶

Pass that converts TOSA operations to the equivalent operations using the operations in the Standard dialect.

`async`

Dialect Passes ¶

`-async-parallel-for`

: Convert scf.parallel operations to multiple async compute ops executed concurrently for non-overlapping iteration ranges ¶

#### Options ¶

```
-async-dispatch : Dispatch async compute tasks using recursive work splitting. If `false` async compute tasks will be launched using simple for loop in the caller thread.
-num-workers : The number of available workers to execute async operations.
-target-block-size : The target block size for sharding parallel operation.
```

`-async-runtime-policy-based-ref-counting`

: Policy based reference counting for Async runtime operations ¶

This pass works at the async runtime abtraction level, after all
`async.execute`

and `async.await`

operations are lowered to the async
runtime API calls, and async coroutine operations.

This pass doesn’t rely on reference counted values liveness analysis, and instead uses simple policy to create reference counting operations. If the program violates any of the assumptions, then this pass might lead to memory leaks or runtime errors.

The default reference counting policy assumptions:

- Async token can be awaited or added to the group only once.
- Async value or group can be awaited only once.

Under these assumptions reference counting only needs to drop reference:

- After
`async.runtime.await`

operation for async tokens and groups (until error handling is not implemented for the sync await). - After
`async.runtime.is_error`

operation for async tokens and groups (this is the last operation in the coroutine resume function). - After
`async.runtime.load`

operation for async values.

This pass introduces significanly less runtime overhead compared to the automatic reference counting.

`-async-runtime-ref-counting`

: Automatic reference counting for Async runtime operations ¶

This pass works at the async runtime abtraction level, after all
`async.execute`

and `async.await`

operations are lowered to the async
runtime API calls, and async coroutine operations.

It relies on the LLVM coroutines switched-resume lowering semantics for the correct placing of the reference counting operations.

See: https://llvm.org/docs/Coroutines.html#switched-resume-lowering

`-async-runtime-ref-counting-opt`

: Optimize automatic reference counting operations for theAsync runtime by removing redundant operations ¶

`-async-to-async-runtime`

: Lower high level async operations (e.g. async.execute) to theexplicit async.runtime and async.coro operations ¶

`affine`

Dialect Passes ¶

`-affine-data-copy-generate`

: Generate explicit copying for affine memory operations ¶

#### Options ¶

```
-fast-mem-capacity : Set fast memory space capacity in KiB (default: unlimited)
-fast-mem-space : Fast memory space identifier for copy generation (default: 1)
-generate-dma : Generate DMA instead of point-wise copy
-min-dma-transfer : Minimum DMA transfer size supported by the target in bytes
-slow-mem-space : Slow memory space identifier for copy generation (default: 0)
-skip-non-unit-stride-loops : Testing purposes: avoid non-unit stride loop choice depths for copy placement
-tag-mem-space : Tag memory space identifier for copy generation (default: 0)
```

`-affine-loop-invariant-code-motion`

: Hoist loop invariant instructions outside of affine loops ¶

`-affine-loop-normalize`

: Apply normalization transformations to affine loop-like ops ¶

`-affine-loop-tile`

: Tile affine loop nests ¶

#### Options ¶

```
-cache-size : Set size of cache to tile for in KiB
-separate : Separate full and partial tiles
-tile-size : Use this tile size for all loops
-tile-sizes : List of tile sizes for each perfect nest (overridden by -tile-size)
```

`-affine-loop-unroll`

: Unroll affine loops ¶

#### Options ¶

```
-unroll-factor : Use this unroll factor for all loops being unrolled
-unroll-up-to-factor : Allow unrolling up to the factor specified
-unroll-full : Fully unroll loops
-unroll-num-reps : Unroll innermost loops repeatedly this many times
-unroll-full-threshold : Unroll all loops with trip count less than or equal to this
```

`-affine-loop-unroll-jam`

: Unroll and jam affine loops ¶

#### Options ¶

```
-unroll-jam-factor : Use this unroll jam factor for all loops (default 4)
```

`-affine-parallelize`

: Convert affine.for ops into 1-D affine.parallel ¶

#### Options ¶

```
-max-nested : Maximum number of nested parallel loops to produce. Defaults to unlimited (UINT_MAX).
-parallel-reductions : Whether to parallelize reduction loops. Defaults to false.
```

`-affine-scalrep`

: Replace affine memref acceses by scalars by forwarding stores to loads and eliminating redundant loads ¶

This pass performs store to load forwarding and redundant load elimination for affine memref accesses and potentially eliminates the entire memref if all its accesses are forwarded.

Input

```
func @store_load_affine_apply() -> memref<10x10xf32> {
%cf7 = constant 7.0 : f32
%m = alloc() : memref<10x10xf32>
affine.for %i0 = 0 to 10 {
affine.for %i1 = 0 to 10 {
affine.store %cf7, %m[%i0, %i1] : memref<10x10xf32>
%v0 = affine.load %m[%i0, %i1] : memref<10x10xf32>
%v1 = addf %v0, %v0 : f32
}
}
return %m : memref<10x10xf32>
}
```

Output

```
module {
func @store_load_affine_apply() -> memref<10x10xf32> {
%cst = constant 7.000000e+00 : f32
%0 = alloc() : memref<10x10xf32>
affine.for %arg0 = 0 to 10 {
affine.for %arg1 = 0 to 10 {
affine.store %cst, %0[%arg0, %arg1] : memref<10x10xf32>
%1 = addf %cst, %cst : f32
}
}
return %0 : memref<10x10xf32>
}
}
```

`-affine-super-vectorize`

: Vectorize to a target independent n-D vector abstraction ¶

#### Options ¶

```
-virtual-vector-size : Specify an n-D virtual vector size for vectorization
-test-fastest-varying : Specify a 1-D, 2-D or 3-D pattern of fastest varying memory dimensions to match. See defaultPatterns in Vectorize.cpp for a description and examples. This is used for testing purposes
-vectorize-reductions : Vectorize known reductions expressed via iter_args. Switched off by default.
```

`-simplify-affine-structures`

: Simplify affine expressions in maps/sets and normalize memrefs ¶

`gpu`

Dialect Passes ¶

`-gpu-async-region`

: Make GPU ops async ¶

`-gpu-kernel-outlining`

: Outline gpu.launch bodies to kernel functions ¶

`linalg`

Dialect Passes ¶

`-convert-elementwise-to-linalg`

: Convert ElementwiseMappable ops to linalg ¶

Convert ops with the `ElementwiseMappable`

trait to linalg parallel loops.

This pass only converts ops that operate on ranked tensors.

`-convert-linalg-tiled-loops-to-scf`

: Lower linalg tiled loops to SCF loops and parallel loops ¶

`-convert-linalg-to-affine-loops`

: Lower the operations from the linalg dialect into affine loops ¶

`-convert-linalg-to-loops`

: Lower the operations from the linalg dialect into loops ¶

`-convert-linalg-to-parallel-loops`

: Lower the operations from the linalg dialect into parallel loops ¶

`-linalg-bufferize`

: Bufferize the linalg dialect ¶

`-linalg-comprehensive-module-bufferize`

: Bufferize (tensor into memref) for a Module. ¶

This pass implements a cross-dialect bufferization approach and performs an analysis to determine which op operands and results may be bufferized in the same buffers. The analysis is performed on topologically sorted CallOp and FuncOp within a module. It provides analyses and bufferization across function boundaries. Within a function boundary, the analysis is performed on SSA use-def chains starting from function operands that are annotated with the ‘inplaceable’ attribute.

#### Options ¶

```
-test-analysis-only : Only runs inplaceability analysis (for testing purposes only)
```

`-linalg-detensorize`

: Detensorize linalg ops ¶

Detensoring is the process through which a tensor value is convereted to one or potentially more primitive value(s). During this process, operations with such detensored operands are also converted to an equivalent form that works on primitives.

The detensoring process is driven by linalg-on-tensor ops. In particular, a
linalg-on-tensor op is checked to see whether *all* its operands can be
detensored. If so, those operands are converted to their primitive
counterparts and the linalg op is replaced by an equivalent op that takes
those new primitive values as operands. Therefore, detensoring an op can be
divided into 2 main logical phases:

- Detect/match an op that can be detensored.
- Detensor the operands of the op and replace it with a primitive equivalent.

In addition to detensoring individual ops, this pass detensors internal control flow inside a function. All blocks except for the entry block are detensored by converting their arguments whenever possible.

`-linalg-fold-reshape-ops-by-linearization`

: Fold TensorReshapeOps with generic/indexed generic ops by linearization ¶

#### Options ¶

```
-allow-folding-unit-dim-reshapes : Allow fusing linalg.tensor_reshape ops that performs unit dimension collapsing
```

`-linalg-fold-unit-extent-dims`

: Remove unit-extent dimension in Linalg ops on tensors ¶

#### Options ¶

```
-fold-one-trip-loops-only : Only folds the one-trip loops from Linalg ops on tensors (for testing purposes only)
```

`-linalg-fuse-elementwise-ops`

: Fuse elementwise operations on tensors ¶

#### Options ¶

```
-allow-folding-unit-dim-reshapes : Allow fusing linalg.tensor_reshape ops that performs unit dimension collapsing
```

`-linalg-generalize-named-ops`

: Convert named ops into generic ops ¶

`-linalg-inline-scalar-operands`

: Inline scalar operands into linalg generic ops ¶

`-linalg-promote-subviews`

: Promote subview ops to local buffers ¶

#### Options ¶

```
-test-promote-dynamic : Test generation of dynamic promoted buffers
-test-use-alloca : Test generation of alloca'ed buffers.
```

`-linalg-tile`

: Tile operations in the linalg dialect ¶

#### Options ¶

```
-linalg-tile-sizes : Tile sizes
```

`-linalg-tile-to-parallel-loops`

: Tile operations in the linalg dialect to parallel loops ¶

#### Options ¶

```
-linalg-tile-sizes : Tile sizes
```

`-linalg-tile-to-tiled-loop`

: Tile operations in the linalg dialect to linalg.tiled_loop ¶

#### Options ¶

```
-linalg-tile-sizes : Tile sizes
-linalg-distribution-types : DistributionTypes
```

`llvm`

Dialect Passes ¶

`-llvm-legalize-for-export`

: Legalize LLVM dialect to be convertible to LLVM IR ¶

`memref`

Dialect Passes ¶

`-fold-memref-subview-ops`

: Fold memref.subview ops into consumer load/store ops ¶

The pass folds loading/storing from/to subview ops to loading/storing from/to the original memref.

`-resolve-ranked-shaped-type-result-dims`

: Resolve memref.dim of result values of ranked shape type ¶

The pass resolves memref.dim of result of operations that
implement the `ReifyRankedShapedTypeOpInterface`

in terms of
shapes of its operands.

`-resolve-shaped-type-result-dims`

: Resolve memref.dim of result values ¶

The pass resolves memref.dim of result of operations that
implement the `InferShapedTypeOpInterface`

or
`ReifyRankedShapedTypeOpInterface`

in terms of shapes of its
operands.

`quant`

Dialect Passes ¶

`-quant-convert-const`

: Converts constants followed by qbarrier to actual quantized values ¶

`-quant-convert-simulated-quantization`

: Converts training-time simulated quantization ops to corresponding quantize/dequantize casts ¶

## Reducer Passes ¶

`-opt-reduction-pass`

: A wrapper pass that reduces the file with optimization passes ¶

#### Options ¶

```
-opt-pass : The optimization passes used for reduction, e.g., symbol-dce
-test : The location of the tester which tests the file interestingness
-test-arg : arguments of the tester
```

`-reduction-tree`

: Reduce the input with reduction-tree algorithm ¶

#### Options ¶

```
-traversal-mode : The graph traversal mode, the default is single-path mode
-test : The location of the tester which tests the file interestingness
-test-arg : arguments of the tester
```

`scf`

Dialect Passes ¶

`-for-loop-range-folding`

: Fold add/mul ops into loop range ¶

`-for-loop-specialization`

: Specialize `for`

loops for vectorization ¶

`-parallel-loop-fusion`

: Fuse adjacent parallel loops ¶

`-parallel-loop-specialization`

: Specialize parallel loops for vectorization ¶

`-parallel-loop-tiling`

: Tile parallel loops ¶

#### Options ¶

```
-parallel-loop-tile-sizes : Factors to tile parallel loops by
```

`-scf-bufferize`

: Bufferize the scf dialect. ¶

`shape`

Dialect Passes ¶

`-remove-shape-constraints`

: Replace all cstr_ ops with a true witness ¶

`-shape-bufferize`

: Bufferize the shape dialect. ¶

`-shape-to-shape-lowering`

: Legalize Shape dialect to be convertible to Standard ¶

`sparse_tensor`

Dialect Passes ¶

`-sparse-tensor-conversion`

: Apply conversion rules to sparse tensors ¶

`-sparsification`

: Automatically generate sparse tensor code from annotations ¶

#### Options ¶

```
-parallelization-strategy : Set the parallelization strategy
-vectorization-strategy : Set the vectorization strategy
-vl : Set the vector length
-enable-simd-index32 : Enable i32 indexing into vectors (for efficiency)
```

`spv`

Dialect Passes ¶

`-decorate-spirv-composite-type-layout`

: Decorate SPIR-V composite type with layout info ¶

`-spirv-lower-abi-attrs`

: Decorate SPIR-V composite type with layout info ¶

`-spirv-rewrite-inserts`

: Rewrite sequential chains of spv.CompositeInsert operations into spv.CompositeConstruct operations ¶

`-spirv-update-vce`

: Deduce and attach minimal (version, capabilities, extensions) requirements to spv.module ops ¶

`standard`

Dialect Passes ¶

`-func-bufferize`

: Bufferize func/call/return ops ¶

A bufferize pass that bufferizes std.func and std.call ops.

Because this pass updates std.func ops, it must be a module pass. It is useful to keep this pass separate from other bufferizations so that the other ones can be run at function-level in parallel.

This pass must be done atomically because it changes func op signatures, which requires atomically updating calls as well throughout the entire module.

This pass also changes the type of block arguments, which requires that all
successor arguments of predecessors be converted. This is achieved by
rewriting terminators based on the information provided by the
`BranchOpInterface`

.
As this pass rewrites function operations, it also rewrites the
corresponding return operations. Other return-like operations that
implement the `ReturnLike`

trait are not rewritten in general, as they
require that the corresponding parent operation is also rewritten.
Finally, this pass fails for unknown terminators, as we cannot decide
whether they need rewriting.

`-std-bufferize`

: Bufferize the std dialect ¶

`-std-expand`

: Legalize std operations to be convertible to LLVM. ¶

`-tensor-constant-bufferize`

: Bufferize tensor constants. ¶

This pass bufferizes tensor constants.

This pass needs to be a module pass because it inserts memref.global ops into the module, which cannot be done safely from a function pass due to multi-threading. Most other bufferization passes can run in parallel at function granularity.

`tensor`

Dialect Passes ¶

`-tensor-bufferize`

: Bufferize the `tensor`

dialect ¶

## TOSA Dialect Passes ¶

`-tosa-infer-shapes`

: Propagate shapes across TOSA operations ¶

Pass that uses operand types and propagates shapes to TOSA operations. This includes legalizing rankless and dynamic shapes towards static.

`-tosa-make-broadcastable`

: TOSA rank Reshape to enable Broadcasting ¶

Pass that enables broadcast by making all input arrays have the same number of dimensions. Insert RESHAPE operations to prepend dimensions of size one until the number of dimensions is equal. Implements approach similar to step 1 of Numpy 4-step broadcasting: https://numpy.org/doc/stable/reference/ufuncs.html#broadcasting