MLIR

Multi-Level IR Compiler Framework

Dialect 'gpu' definition

Operation definition

gpu.all_reduce (gpu::AllReduceOp)

Reduce values among workgroup.

Description:

The “all_reduce” op reduces the value of every work item across a local workgroup. The result is equal for all work items of a workgroup.

For example, both

  %1 = "gpu.all_reduce"(%0) ({}) { op = "add" } : (f32) -> (f32)
  %2 = "gpu.all_reduce"(%0) ({
  ^bb(%lhs : f32, %rhs : f32):
    %sum = addf %lhs, %rhs : f32
    "gpu.yield"(%sum) : (f32) -> ()
  }) : (f32) -> (f32)

compute the sum of each work item’s %0 value. The first version specifies the accumulation as operation, whereas the second version specifies the accumulation as code region. The accumulation operation must either be add or mul.

Either none or all work items of a workgroup need to execute this op in convergence.

Operands:

  1. value: any type

Attributes:

AttributeMLIR TypeDescription
opStringAttrbuilt-in reduction operations supported by gpu.allreduce. attribute

Results:

  1. «unnamed»: any type

gpu.barrier (gpu::BarrierOp)

Synchronizes all work items of a workgroup.

Description:

The “barrier” op synchronizes all work items of a workgroup. It is used to coordinate communication between the work items of the workgroup.

  gpu.barrier

waits until all work items in the workgroup have reached this point and all memory accesses made by these work items prior to the op are visible to all work items in the workgroup. Data hazards between work items accessing the same memory can be avoided by synchronizing work items in-between these accesses.

Either none or all work items of a workgroup need to execute this op in convergence.

Operands:

Attributes:

Results:

gpu.block_dim (gpu::BlockDimOp)

Description:

Operands:

Attributes:

AttributeMLIR TypeDescription
dimensionStringAttrstring attribute attribute

Results:

  1. «unnamed»: index

gpu.block_id (gpu::BlockIdOp)

Description:

Operands:

Attributes:

AttributeMLIR TypeDescription
dimensionStringAttrstring attribute attribute

Results:

  1. «unnamed»: index

gpu.func (gpu::GPUFuncOp)

Function executable on a GPU

Description:

Defines a function that can be executed on a GPU. This supports memory attribution and its body has a particular execution model.

GPU functions are either kernels (as indicated by the kernel attribute) or regular functions. The former can be launched from the host side, while the latter are device side only.

The memory attribution defines SSA values that correspond to memory buffers allocated in the memory hierarchy of the GPU (see below).

The operation has one attached region that corresponds to the body of the function. The region arguments consist of the function arguments without modification, followed by buffers defined in memory annotations. The body of a GPU function, when launched, is executed by multiple work items. There are no guarantees on the order in which work items execute, or on the connection between them. In particular, work items are not necessarily executed in lock-step. Synchronization ops such as “gpu.barrier” should be used to coordinate work items. Declarations of GPU functions, i.e. not having the body region, are not supported.

Syntax:

op ::= `gpu.func` symbol-ref-id `(` argument-list `)` (`->`
function-result-list)?
       memory-attribution `kernel`? function-attributes? region

memory-attribution ::= (`workgroup` `(` ssa-id-and-type-list `)`)?
                       (`private` `(` ssa-id-and-type-list `)`)?

Example:

gpu.func @foo(%arg0: index)
    workgroup(%workgroup: memref<32xf32, 3>)
    private(%private: memref<1xf32, 5>)
    kernel
    attributes {qux: "quux"} {
  gpu.return
}

The generic form illustrates the concept

"gpu.func"(%arg: index) {sym_name: "foo", kernel, qux: "quux"} ({
^bb0(%arg0: index, %workgroup: memref<32xf32, 3>,
     %private: memref<1xf32, 5>):
  "gpu.return"() : () -> ()
}) : (index) -> ()

Note the non-default memory spaces used in memref types in memory attribution.

Operands:

Attributes:

Results:

gpu.module (gpu::GPUModuleOp)

A top level compilation unit containing code to be run on a GPU.

Description:

GPU module contains code that is intended to be run on a GPU. A host device can launch this code through a gpu.launc_func that creates a fully qualified symbol through the gpu.module’s symbol and a gpu.func symbol contained in the gpu.module.

The module’s top-level scope is modeled by a single region with a single block. GPU modules are required to have a name that is used for symbol resolution by the gpu.launch_func operation.

Using an op with a region to define a GPU module enables “embedding” GPU modules with SIMT execution models in other dialects in a clean manner and allows filtering of code regions to execute passes on only code intended to or not intended to be run on the separate device.

  gpu.module @symbol_name {
  gpu.func {}
    ...
  gpu.module_end
}

Operands:

Attributes:

Results:

gpu.grid_dim (gpu::GridDimOp)

Description:

Operands:

Attributes:

AttributeMLIR TypeDescription
dimensionStringAttrstring attribute attribute

Results:

  1. «unnamed»: index

gpu.launch_func (gpu::LaunchFuncOp)

Launches a function as a GPU kerneel

Description:

Launch a kernel function on the specified grid of thread blocks. gpu.launch operations are lowered to gpu.launch_func operations by outlining the kernel body into a function in a dedicated module, which reflects the separate compilation process. The kernel function is required to have the gpu.kernel attribute. The module containing the kernel function is required to have the gpu.kernel_module attribute and must be named. And finally, the module containing the kernel module (which thus cannot be the top-level module) is required to have the gpu.container_module attribute. The gpu.launch_func operation has a string attribute named kernel to specify the name of the kernel function to launch and an attribute named kernel_module to specify the name of the module containing that kernel function.

The operation takes at least six operands, with the first three operands being grid sizes along x,y,z dimensions and the following three being block sizes along x,y,z dimensions. When a lower-dimensional kernel is required, unused sizes must be explicitly set to 1. The remaining operands are passed as arguments to the kernel function.

A custom syntax for this operation is currently not available.

Example:

module attributes {gpu.container_module} {

  // This module creates a separate compilation unit for the GPU compiler.
  module @kernels attributes {gpu.kernel_module} {
    func @kernel_1(%arg0 : f32, %arg1 : !llvm<"float*">)
        attributes { nvvm.kernel = true } {

      // Operations that produce block/thread IDs and dimensions are
      // injected when outlining the `gpu.launch` body to a function called
      // by `gpu.launch_func`.
      %tIdX = "gpu.thread_id"() {dimension = "x"} : () -> (index)
      %tIdY = "gpu.thread_id"() {dimension = "y"} : () -> (index)
      %tIdZ = "gpu.thread_id"() {dimension = "z"} : () -> (index)

      %bDimX = "gpu.block_dim"() {dimension = "x"} : () -> (index)
      %bDimY = "gpu.block_dim"() {dimension = "y"} : () -> (index)
      %bDimZ = "gpu.block_dim"() {dimension = "z"} : () -> (index)

      %bIdX = "gpu.block_id"() {dimension = "x"} : () -> (index)
      %bIdY = "gpu.block_id"() {dimension = "y"} : () -> (index)
      %bIdZ = "gpu.block_id"() {dimension = "z"} : () -> (index)

      %gDimX = "gpu.grid_dim"() {dimension = "x"} : () -> (index)
      %gDimY = "gpu.grid_dim"() {dimension = "y"} : () -> (index)
      %gDimZ = "gpu.grid_dim"() {dimension = "z"} : () -> (index)

      "some_op"(%bx, %tx) : (index, index) -> ()
      %42 = load %arg1[%bx] : memref<?xf32, 1>
    }
  }

  "gpu.launch_func"(%cst, %cst, %cst,  // Grid sizes.
                    %cst, %cst, %cst,  // Block sizes.
                    %arg0, %arg1)      // Arguments passed to the kernel.
        { kernel_module = @kernels,    // Module containing the kernel.
          kernel = "kernel_1" }        // Kernel function.
        : (index, index, index, index, index, index, f32, !llvm<"float*">)
          -> ()
}

Operands:

  1. gridSizeX: integer, index or LLVM dialect equivalent
  2. gridSizeY: integer, index or LLVM dialect equivalent
  3. gridSizeZ: integer, index or LLVM dialect equivalent
  4. blockSizeX: integer, index or LLVM dialect equivalent
  5. blockSizeY: integer, index or LLVM dialect equivalent
  6. blockSizeZ: integer, index or LLVM dialect equivalent
  7. operands: any type

Attributes:

Results:

gpu.launch (gpu::LaunchOp)

GPU kernel launch operation

Description:

Launch a kernel on the specified grid of thread blocks. The body of the kernel is defined by the single region that this operation contains. The operation takes six operands, with first three operands being grid sizes along x,y,z dimensions and the following three arguments being block sizes along x,y,z dimension. When a lower-dimensional kernel is required, unused sizes must be explicitly set to 1.

The body region has twelve arguments, grouped as follows:

  • three arguments that contain block identifiers along x,y,z dimensions;
  • three arguments that contain thread identifiers along x,y,z dimensions;
  • operands of the gpu.launch operation as is (i.e. the operands for grid and block sizes).

Syntax:

operation ::= `gpu.launch` `block` `(` ssa-id-list `)` `in` ssa-reassignment
                         `threads` `(` ssa-id-list `)` `in` ssa-reassignment
                           region attr-dict?
ssa-reassignment ::= `(` ssa-id `=` ssa-use (`,` ssa-id `=` ssa-use)* `)`

Example:

gpu.launch blocks(%bx, %by, %bz) in (%sz_bx = %0, %sz_by = %1, %sz_bz = %2)
           threads(%tx, %ty, %tz) in (%sz_tx = %3, %sz_ty = %4, %sz_tz = %5) {
  // Block and thread identifiers, as well as block/grid sizes are
  // immediately usable inside body region.
  "some_op"(%bx, %tx) : (index, index) -> ()
  // Assuming %val1 is defined outside the gpu.launch region.
  %42 = load %val1[%bx] : memref<?xf32, 1>
}

// Generic syntax explains how the pretty syntax maps to the IR structure.
"gpu.launch"(%cst, %cst, %c1,  // Grid sizes.
             %cst, %c1, %c1)   // Block sizes.

    {/*attributes*/}
    // All sizes and identifiers have "index" size.
    : (index, index, index, index, index, index) -> () {
// The operation passes block and thread identifiers, followed by grid and
// block sizes.
^bb0(%bx : index, %by : index, %bz : index,
     %tx : index, %ty : index, %tz : index,
     %num_bx : index, %num_by : index, %num_bz : index,
     %num_tx : index, %num_ty : index, %num_tz : index)
  "some_op"(%bx, %tx) : (index, index) -> ()
  %3 = "std.load"(%val1, %bx) : (memref<?xf32, 1>, index) -> f32
}

Rationale: using operation/block arguments gives analyses a clear way of understanding that a value has additional semantics (e.g., we will need to know what value corresponds to threadIdx.x for coalescing). We can recover these properties by analyzing the operations producing values, but it is easier just to have that information by construction.

Operands:

  1. gridSizeX: index
  2. gridSizeY: index
  3. gridSizeZ: index
  4. blockSizeX: index
  5. blockSizeY: index
  6. blockSizeZ: index

Attributes:

Results:

gpu.module_end (gpu::ModuleEndOp)

A pseudo op that marks the end of a gpu.module.

Description:

This op terminates the only block inside the only region of a gpu.module.

Operands:

Attributes:

Results:

gpu.return (gpu::ReturnOp)

Terminator for GPU functions.

Description:

A terminator operation for regions that appear in the body of gpu.func functions. The operands to the gpu.return are the result values returned by an incovation of the gpu.func.

Operands:

  1. operands: any type

Attributes:

Results:

gpu.shuffle (gpu::ShuffleOp)

Shuffles values within a subgroup.

Description:

The “shuffle” op moves values to a different invocation within the same subgroup.

For example

  %1, %2 = gpu.shuffle %0, %offset, %width xor : f32

for lane k returns the value from lane k ^ offset and true if that lane is smaller than %width. Otherwise it returns an unspecified value and false. A lane is the index of an invocation relative to its subgroup.

The width specifies the number of invocations that participate in the shuffle. The width needs to be the same for all invocations that participate in the shuffle. Exactly the first width invocations of a subgroup need to execute this op in convergence.

Operands:

  1. value: any type
  2. offset: 32-bit integer
  3. width: 32-bit integer

Attributes:

AttributeMLIR TypeDescription
modeStringAttrIndexing modes supported by gpu.shuffle. attribute

Results:

  1. result: any type
  2. valid: 1-bit integer

gpu.terminator (gpu::TerminatorOp)

Terminator for GPU launch regions.

Description:

A terminator operation for regions that appear in the body of gpu.launch operation. These regions are not expected to return any value so the terminator takes no operands.

Operands:

Attributes:

Results:

gpu.thread_id (gpu::ThreadIdOp)

Description:

Operands:

Attributes:

AttributeMLIR TypeDescription
dimensionStringAttrstring attribute attribute

Results:

  1. «unnamed»: index

gpu.yield (gpu::YieldOp)

GPU yield operation

Description:

“gpu.yield” is a special terminator operation for blocks inside regions in gpu ops. It returns values to the immediately enclosing gpu op.

Example:

Operands:

  1. values: any type

Attributes:

Results: