MLIR

Multi-Level IR Compiler Framework

Using `mlir-opt`

mlir-opt is a command-line entry point for running passes and lowerings on MLIR code. This tutorial will explain how to use mlir-opt, show some examples of its usage, and mention some useful tips for working with it.

Prerequisites:

mlir-opt basics 

The mlir-opt tool loads a textual IR or bytecode into an in-memory structure, and optionally executes a sequence of passes before serializing back the IR (textual form by default). It is intended as a testing and debugging utility.

After building the MLIR project, the mlir-opt binary (located in build/bin) is the entry point for running passes and lowerings, as well as emitting debug and diagnostic data.

Running mlir-opt with no flags will consume textual or bytecode IR from the standard input, parse and run verifiers on it, and write the textual format back to the standard output. This is a good way to test if an input MLIR is well-formed.

mlir-opt --help shows a complete list of flags (there are nearly 1000). Each pass has its own flag, though it is recommended to use --pass-pipeline to run passes rather than bare flags.

Running a pass 

Next we run convert-to-llvm, which converts all supported dialects to the llvm dialect, on the following IR:

// mlir/test/Examples/mlir-opt/ctlz.mlir
module {
  func.func @main(%arg0: i32) -> i32 {
    %0 = math.ctlz %arg0 : i32
    func.return %0 : i32
  }
}

After building MLIR, and from the llvm-project base directory, run

build/bin/mlir-opt --pass-pipeline="builtin.module(convert-math-to-llvm)" mlir/test/Examples/mlir-opt/ctlz.mlir

which produces

module {
  func.func @main(%arg0: i32) -> i32 {
    %0 = "llvm.intr.ctlz"(%arg0) <{is_zero_poison = false}> : (i32) -> i32
    return %0 : i32
  }
}

Note that llvm here is MLIR’s llvm dialect, which would still need to be processed through mlir-translate to generate LLVM-IR.

Running a pass with options 

Next we will show how to run a pass that takes configuration options. Consider the following IR containing loops with poor cache locality.

// mlir/test/Examples/mlir-opt/loop_fusion.mlir
module {
  func.func @producer_consumer_fusion(%arg0: memref<10xf32>, %arg1: memref<10xf32>) {
    %0 = memref.alloc() : memref<10xf32>
    %1 = memref.alloc() : memref<10xf32>
    %cst = arith.constant 0.000000e+00 : f32
    affine.for %arg2 = 0 to 10 {
      affine.store %cst, %0[%arg2] : memref<10xf32>
      affine.store %cst, %1[%arg2] : memref<10xf32>
    }
    affine.for %arg2 = 0 to 10 {
      %2 = affine.load %0[%arg2] : memref<10xf32>
      %3 = arith.addf %2, %2 : f32
      affine.store %3, %arg0[%arg2] : memref<10xf32>
    }
    affine.for %arg2 = 0 to 10 {
      %2 = affine.load %1[%arg2] : memref<10xf32>
      %3 = arith.mulf %2, %2 : f32
      affine.store %3, %arg1[%arg2] : memref<10xf32>
    }
    return
  }
}

Running this with the affine-loop-fusion pass produces a fused loop.

build/bin/mlir-opt --pass-pipeline="builtin.module(affine-loop-fusion)" mlir/test/Examples/mlir-opt/loop_fusion.mlir
module {
  func.func @producer_consumer_fusion(%arg0: memref<10xf32>, %arg1: memref<10xf32>) {
    %alloc = memref.alloc() : memref<1xf32>
    %alloc_0 = memref.alloc() : memref<1xf32>
    %cst = arith.constant 0.000000e+00 : f32
    affine.for %arg2 = 0 to 10 {
      affine.store %cst, %alloc[0] : memref<1xf32>
      affine.store %cst, %alloc_0[0] : memref<1xf32>
      %0 = affine.load %alloc_0[0] : memref<1xf32>
      %1 = arith.mulf %0, %0 : f32
      affine.store %1, %arg1[%arg2] : memref<10xf32>
      %2 = affine.load %alloc[0] : memref<1xf32>
      %3 = arith.addf %2, %2 : f32
      affine.store %3, %arg0[%arg2] : memref<10xf32>
    }
    return
  }
}

This pass has options that allow the user to configure its behavior. For example, the fusion-compute-tolerance option is described as the “fractional increase in additional computation tolerated while fusing.” If this value is set to zero on the command line, the pass will not fuse the loops.

build/bin/mlir-opt --pass-pipeline="builtin.module(affine-loop-fusion{fusion-compute-tolerance=0})" \
mlir/test/Examples/mlir-opt/loop_fusion.mlir
module {
  func.func @producer_consumer_fusion(%arg0: memref<10xf32>, %arg1: memref<10xf32>) {
    %alloc = memref.alloc() : memref<10xf32>
    %alloc_0 = memref.alloc() : memref<10xf32>
    %cst = arith.constant 0.000000e+00 : f32
    affine.for %arg2 = 0 to 10 {
      affine.store %cst, %alloc[%arg2] : memref<10xf32>
      affine.store %cst, %alloc_0[%arg2] : memref<10xf32>
    }
    affine.for %arg2 = 0 to 10 {
      %0 = affine.load %alloc[%arg2] : memref<10xf32>
      %1 = arith.addf %0, %0 : f32
      affine.store %1, %arg0[%arg2] : memref<10xf32>
    }
    affine.for %arg2 = 0 to 10 {
      %0 = affine.load %alloc_0[%arg2] : memref<10xf32>
      %1 = arith.mulf %0, %0 : f32
      affine.store %1, %arg1[%arg2] : memref<10xf32>
    }
    return
  }
}

Options passed to a pass are specified via the syntax {option1=value1 option2=value2 ...}, i.e., use space-separated key=value pairs for each option.

Building a pass pipeline on the command line 

The --pass-pipeline flag supports combining multiple passes into a pipeline. So far we have used the trivial pipeline with a single pass that is “anchored” on the top-level builtin.module op. Pass anchoring is a way for passes to specify that they only run on particular ops. While many passes are anchored on builtin.module, if you try to run a pass that is anchored on some other op inside --pass-pipeline="builtin.module(pass-name)", it will not run.

Multiple passes can be chained together by providing the pass names in a comma-separated list in the --pass-pipeline string, e.g., --pass-pipeline="builtin.module(pass1,pass2)". The passes will be run sequentially.

To use passes that have nontrivial anchoring, the appropriate level of nesting must be specified in the pass pipeline. For example, consider the following IR which has the same redundant code, but in two different levels of nesting.

module {
  module {
    func.func @func1(%arg0: i32) -> i32 {
      %0 = arith.addi %arg0, %arg0 : i32
      %1 = arith.addi %arg0, %arg0 : i32
      %2 = arith.addi %0, %1 : i32
      func.return %2 : i32
    }
  }

  gpu.module @gpu_module {
    gpu.func @func2(%arg0: i32) -> i32 {
      %0 = arith.addi %arg0, %arg0 : i32
      %1 = arith.addi %arg0, %arg0 : i32
      %2 = arith.addi %0, %1 : i32
      gpu.return %2 : i32
    }
  }
}

The following pipeline runs cse (common subexpression elimination) but only on the func.func inside the two builtin.module ops.

build/bin/mlir-opt mlir/test/Examples/mlir-opt/ctlz.mlir --pass-pipeline='
    builtin.module(
        builtin.module(
            func.func(cse,canonicalize),
            convert-to-llvm
        )
    )'

The output leaves the gpu.module alone

module {
  module {
    llvm.func @func1(%arg0: i32) -> i32 {
      %0 = llvm.add %arg0, %arg0 : i32
      %1 = llvm.add %0, %0 : i32
      llvm.return %1 : i32
    }
  }
  gpu.module @gpu_module {
    gpu.func @func2(%arg0: i32) -> i32 {
      %0 = arith.addi %arg0, %arg0 : i32
      %1 = arith.addi %arg0, %arg0 : i32
      %2 = arith.addi %0, %1 : i32
      gpu.return %2 : i32
    }
  }
}

Specifying a pass pipeline with nested anchoring is also beneficial for performance reasons: passes with anchoring can run on IR subsets in parallel, which provides better threaded runtime and cache locality within threads. For example, even if a pass is not restricted to anchor on func.func, running builtin.module(func.func(cse, canonicalize)) is more efficient than builtin.module(cse, canonicalize).

For a spec of the pass-pipeline textual description language, see the docs. For more general information on pass management, see Pass Infrastructure.

Useful CLI flags 

  • --debug prints all debug information produced by LLVM_DEBUG calls.
  • --debug-only="my-tag" prints only the debug information produced by LLVM_DEBUG in files that have the macro #define DEBUG_TYPE "my-tag". This often allows you to print only debug information associated with a specific pass.
    • "greedy-rewriter" only prints debug information for patterns applied with the greedy rewriter engine.
    • "dialect-conversion" only prints debug information for the dialect conversion framework.
  • --emit-bytecode emits MLIR in the bytecode format.
  • --mlir-pass-statistics print statistics about the passes run. These are generated via pass statistics.
  • --mlir-print-ir-after-all prints the IR after each pass.
    • See also --mlir-print-ir-after-change, --mlir-print-ir-after-failure, and analogous versions of these flags with before instead of after.
    • When using print-ir flags, adding --mlir-print-ir-tree-dir writes the IRs to files in a directory tree, making them easier to inspect versus a large dump to the terminal.
  • --mlir-timing displays execution times of each pass.

Further readering