'gpu' Dialect
Note: this dialect is more likely to change than others in the near future; use with caution.
This dialect provides middle-level abstractions for launching GPU kernels
following a programming model similar to that of CUDA or OpenCL. It provides
abstractions for kernel invocations (and may eventually provide those for device
management) that are not present at the lower level (e.g., as LLVM IR intrinsics
for GPUs). Its goal is to abstract away device- and driver-specific
manipulations to launch a GPU kernel and provide a simple path towards GPU
execution from MLIR. It may be targeted, for example, by DSLs using MLIR. The
dialect uses gpu
as its canonical prefix.
Memory attribution ¶
Memory buffers are defined at the function level, either in “gpu.launch” or in “gpu.func” ops. This encoding makes it clear where the memory belongs and makes the lifetime of the memory visible. The memory is only accessible while the kernel is launched/the function is currently invoked. The latter is more strict than actual GPU implementations but using static memory at the function level is just for convenience. It is also always possible to pass pointers to the workgroup memory into other functions, provided they expect the correct memory space.
The buffers are considered live throughout the execution of the GPU function
body. The absence of memory attribution syntax means that the function does not
require special buffers. Rationale: although the underlying models declare
memory buffers at the module level, we chose to do it at the function level to
provide some structuring for the lifetime of those buffers; this avoids the
incentive to use the buffers for communicating between different kernels or
launches of the same kernel, which should be done through function arguments
instead; we chose not to use alloca
-style approach that would require more
complex lifetime analysis following the principles of MLIR that promote
structure and representing analysis results in the IR.
Operations ¶
gpu.all_reduce
(::mlir::gpu::AllReduceOp) ¶
Reduce values among workgroup.
Syntax:
operation ::= `gpu.all_reduce` custom<AllReduceOperation>($op) $value
(`uniform` $uniform^)? $body attr-dict
`:` functional-type(operands, results)
The all_reduce
op reduces the value of every work item across a local
workgroup. The result is equal for all work items of a workgroup.
For example, both
%1 = gpu.all_reduce add %0 {} : (f32) -> (f32)
%2 = gpu.all_reduce %0 {
^bb(%lhs : f32, %rhs : f32):
%sum = arith.addf %lhs, %rhs : f32
"gpu.yield"(%sum) : (f32) -> ()
} : (f32) -> (f32)
compute the sum of each work item’s %0 value. The first version specifies
the accumulation as operation, whereas the second version specifies the
accumulation as code region. The accumulation operation must be one of:
add
, and
, max
, min
, mul
, or
, xor
.
If uniform
flag is set either none or all work items of a workgroup
need to execute this op in convergence.
Traits: IsolatedFromAbove, SameOperandsAndResultType
Interfaces: InferTypeOpInterface
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
op | ::mlir::gpu::AllReduceOperationAttr | built-in reduction operations supported by gpu.allreduce. |
uniform | ::mlir::UnitAttr | unit attribute |
Operands: ¶
Operand | Description |
---|---|
value | any type |
Results: ¶
Result | Description |
---|---|
«unnamed» | any type |
gpu.alloc
(::mlir::gpu::AllocOp) ¶
GPU memory allocation operation.
Syntax:
operation ::= `gpu.alloc` custom<AsyncDependencies>(type($asyncToken), $asyncDependencies) (` ` `host_shared` $hostShared^)? ` `
`(` $dynamicSizes `)` (`` `[` $symbolOperands^ `]`)? attr-dict `:` type($memref)
The gpu.alloc
operation allocates a region of memory on the GPU. It is
similar to the memref.alloc
op, but supports asynchronous GPU execution.
The op does not execute before all async dependencies have finished executing.
If the async
keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it also returns a !gpu.async.token.
If the host_shared
keyword is present, the memory will be allocated in a
memory accessible both on host and on device.
Example:
%memref, %token = gpu.alloc async [%dep] host_shared (%width) : memref<64x?xf32, 1>
Traits: AttrSizedOperandSegments
Interfaces: GPU_AsyncOpInterface
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
hostShared | ::mlir::UnitAttr | unit attribute |
Operands: ¶
Operand | Description |
---|---|
asyncDependencies | async token type |
dynamicSizes | index |
symbolOperands | index |
Results: ¶
Result | Description |
---|---|
memref | memref of any type values |
asyncToken | async token type |
gpu.barrier
(::mlir::gpu::BarrierOp) ¶
Synchronizes all work items of a workgroup.
Syntax:
operation ::= `gpu.barrier` attr-dict
The “barrier” op synchronizes all work items of a workgroup. It is used to coordinate communication between the work items of the workgroup.
gpu.barrier
waits until all work items in the workgroup have reached this point and all memory accesses made by these work items prior to the op are visible to all work items in the workgroup. Data hazards between work items accessing the same memory can be avoided by synchronizing work items in-between these accesses.
Either none or all work items of a workgroup need to execute this op in convergence.
gpu.block_dim
(::mlir::gpu::BlockDimOp) ¶
Syntax:
operation ::= `gpu.block_dim` $dimension attr-dict
Returns the number of threads in the thread block (aka the block size) along
the x, y, or z dimension
.
Example:
%bDimX = gpu.block_dim x
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, InferIntRangeInterface, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
dimension | ::mlir::gpu::DimensionAttr | a dimension, either ‘x’, ‘y’, or ‘z’ |
Results: ¶
Result | Description |
---|---|
«unnamed» | index |
gpu.block_id
(::mlir::gpu::BlockIdOp) ¶
Syntax:
operation ::= `gpu.block_id` $dimension attr-dict
Returns the block id, i.e. the index of the current block within the grid
along the x, y, or z dimension
.
Example:
%bIdY = gpu.block_id y
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, InferIntRangeInterface, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
dimension | ::mlir::gpu::DimensionAttr | a dimension, either ‘x’, ‘y’, or ‘z’ |
Results: ¶
Result | Description |
---|---|
«unnamed» | index |
gpu.dealloc
(::mlir::gpu::DeallocOp) ¶
GPU memory deallocation operation
Syntax:
operation ::= `gpu.dealloc` custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$memref attr-dict `:` type($memref)
The gpu.dealloc
operation frees the region of memory referenced by a
memref which was originally created by the gpu.alloc
operation. It is
similar to the memref.dealloc
op, but supports asynchronous GPU execution.
The op does not execute before all async dependencies have finished executing.
If the async
keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token.
Example:
%token = gpu.dealloc async [%dep] %memref : memref<8x64xf32, 1>
Interfaces: GPU_AsyncOpInterface
Operands: ¶
Operand | Description |
---|---|
asyncDependencies | async token type |
memref | memref of any type values |
Results: ¶
Result | Description |
---|---|
asyncToken | async token type |
gpu.func
(::mlir::gpu::GPUFuncOp) ¶
Function executable on a GPU
Defines a function that can be executed on a GPU. This supports memory attribution and its body has a particular execution model.
GPU functions are either kernels (as indicated by the kernel
attribute) or
regular functions. The former can be launched from the host side, while the
latter are device side only.
The memory attribution defines SSA values that correspond to memory buffers allocated in the memory hierarchy of the GPU (see below).
The operation has one attached region that corresponds to the body of the function. The region arguments consist of the function arguments without modification, followed by buffers defined in memory annotations. The body of a GPU function, when launched, is executed by multiple work items. There are no guarantees on the order in which work items execute, or on the connection between them. In particular, work items are not necessarily executed in lock-step. Synchronization ops such as “gpu.barrier” should be used to coordinate work items. Declarations of GPU functions, i.e. not having the body region, are not supported.
A function may optionally be annotated with the block and/or grid sizes
that will be used when it is launched using the gpu.known_block_size
and
gpu.known_grid_size
attributes, respectively. If set, these attributes must
be arrays of three 32-bit integers giving the x, y, and z launch dimensions.
Launching a kernel that has these annotations, or that calls a function with
these annotations, using a block size or grid size other than what is specified
is undefined behavior.
Syntax:
op ::= `gpu.func` symbol-ref-id `(` argument-list `)` (`->`
function-result-list)?
memory-attribution `kernel`? function-attributes? region
memory-attribution ::= (`workgroup` `(` ssa-id-and-type-list `)`)?
(`private` `(` ssa-id-and-type-list `)`)?
Example:
gpu.func @foo(%arg0: index)
workgroup(%workgroup: memref<32xf32, 3>)
private(%private: memref<1xf32, 5>)
kernel
attributes {qux: "quux"} {
gpu.return
}
The generic form illustrates the concept
"gpu.func"(%arg: index) {sym_name: "foo", kernel, qux: "quux"} ({
^bb0(%arg0: index, %workgroup: memref<32xf32, 3>,
%private: memref<1xf32, 5>):
"gpu.return"() : () -> ()
}) : (index) -> ()
Note the non-default memory spaces used in memref types in memory attribution.
Traits: AutomaticAllocationScope, HasParent
Interfaces: FunctionOpInterface, Symbol
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
function_type | ::mlir::TypeAttr | type attribute of function type |
arg_attrs | ::mlir::ArrayAttr | Array of dictionary attributes |
res_attrs | ::mlir::ArrayAttr | Array of dictionary attributes |
gpu.module
(::mlir::gpu::GPUModuleOp) ¶
A top level compilation unit containing code to be run on a GPU.
GPU module contains code that is intended to be run on a GPU. A host device can launch this code through a gpu.launc_func that creates a fully qualified symbol through the gpu.module’s symbol and a gpu.func symbol contained in the gpu.module.
The module’s top-level scope is modeled by a single region with a single block. GPU modules are required to have a name that is used for symbol resolution by the gpu.launch_func operation.
Using an op with a region to define a GPU module enables “embedding” GPU modules with SIMT execution models in other dialects in a clean manner and allows filtering of code regions to execute passes on only code intended to or not intended to be run on the separate device.
gpu.module @symbol_name {
gpu.func {}
...
gpu.module_end
}
Traits: HasDefaultDLTIDataLayout, IsolatedFromAbove, SingleBlockImplicitTerminator
Interfaces: DataLayoutOpInterface, Symbol
gpu.global_id
(::mlir::gpu::GlobalIdOp) ¶
Syntax:
operation ::= `gpu.global_id` $dimension attr-dict
Returns the unique global workitem/thread id, i.e., the unique index of the
current workitem/thread within all workgroups / grid along the x, y, or z
dimension
.
Example:
%gidX = gpu.global_id x
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, InferIntRangeInterface, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
dimension | ::mlir::gpu::DimensionAttr | a dimension, either ‘x’, ‘y’, or ‘z’ |
Results: ¶
Result | Description |
---|---|
«unnamed» | index |
gpu.grid_dim
(::mlir::gpu::GridDimOp) ¶
Syntax:
operation ::= `gpu.grid_dim` $dimension attr-dict
Returns the number of thread blocks in the grid along the x, y, or z
dimension
.
Example:
%gDimZ = gpu.grid_dim z
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, InferIntRangeInterface, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
dimension | ::mlir::gpu::DimensionAttr | a dimension, either ‘x’, ‘y’, or ‘z’ |
Results: ¶
Result | Description |
---|---|
«unnamed» | index |
gpu.host_register
(::mlir::gpu::HostRegisterOp) ¶
Registers a memref for access from device.
Syntax:
operation ::= `gpu.host_register` $value attr-dict `:` type($value)
This op maps the provided host buffer into the device address space.
This operation may not be supported in every environment, there is not yet a way to check at runtime whether this feature is supported.
Writes from the host are guaranteed to be visible to device kernels that are launched afterwards. Writes from the device are guaranteed to be visible on the host after synchronizing with the device kernel completion.
Operands: ¶
Operand | Description |
---|---|
value | unranked.memref of any type values |
gpu.lane_id
(::mlir::gpu::LaneIdOp) ¶
Syntax:
operation ::= `gpu.lane_id` attr-dict
Returns the lane id within the subgroup (warp/wave).
Example:
%laneId = gpu.lane_id
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, InferIntRangeInterface, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Results: ¶
Result | Description |
---|---|
result | index |
gpu.launch_func
(::mlir::gpu::LaunchFuncOp) ¶
Launches a function as a GPU kernel
Syntax:
operation ::= `gpu.launch_func` custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$kernel
`blocks` `in` ` ` `(`$gridSizeX`,` $gridSizeY`,` $gridSizeZ`)`
`threads` `in` ` ` `(`$blockSizeX`,` $blockSizeY`,` $blockSizeZ`)`
(`dynamic_shared_memory_size` $dynamicSharedMemorySize^)?
custom<LaunchFuncOperands>($kernelOperands, type($kernelOperands)) attr-dict
Launch a kernel function on the specified grid of thread blocks.
gpu.launch
operations are lowered to gpu.launch_func
operations by
outlining the kernel body into a function in a dedicated module, which
reflects the separate compilation process. The kernel function is required
to have the gpu.kernel
attribute. The module containing the kernel
function is required to be a gpu.module. And finally, the module containing
the kernel module (which thus cannot be the top-level module) is required
to have the gpu.container_module
attribute. The gpu.launch_func
operation has a symbol attribute named kernel
to identify the fully
specified kernel function to launch (both the gpu.module and func).
The gpu.launch_func
supports async dependencies: the kernel does not start
executing until the ops producing those async dependencies have completed.
By the default, the host implicitly blocks until kernel execution has
completed. If the async
keyword is present, the host does not block but
instead a !gpu.async.token
is returned. Other async GPU ops can take this
token as dependency.
The operation requires at least the grid and block sizes along the x,y,z
dimensions as arguments. When a lower-dimensional kernel is required,
unused sizes must be explicitly set to 1
.
The remaining operands are optional. The first optional operand corresponds to the amount of dynamic shared memory a kernel’s workgroup should be allocated; when this operand is not present, a zero size is assumed.
The remaining operands if present are passed as arguments to the kernel function.
Example:
module attributes {gpu.container_module} {
// This module creates a separate compilation unit for the GPU compiler.
gpu.module @kernels {
func.func @kernel_1(%arg0 : f32, %arg1 : memref<?xf32, 1>)
attributes { nvvm.kernel = true } {
// Operations that produce block/thread IDs and dimensions are
// injected when outlining the `gpu.launch` body to a function called
// by `gpu.launch_func`.
%tIdX = gpu.thread_id x
%tIdY = gpu.thread_id y
%tIdZ = gpu.thread_id z
%bDimX = gpu.block_dim x
%bDimY = gpu.block_dim y
%bDimZ = gpu.block_dim z
%bIdX = gpu.block_id x
%bIdY = gpu.block_id y
%bIdZ = gpu.block_id z
%gDimX = gpu.grid_dim x
%gDimY = gpu.grid_dim y
%gDimZ = gpu.grid_dim z
"some_op"(%bx, %tx) : (index, index) -> ()
%42 = load %arg1[%bx] : memref<?xf32, 1>
}
}
%t0 = gpu.wait async
gpu.launch_func
async // (Optional) Don't block host, return token.
[%t0] // (Optional) Execute only after %t0 has completed.
@kernels::@kernel_1 // Kernel function.
blocks in (%cst, %cst, %cst) // Grid size.
threads in (%cst, %cst, %cst) // Block size.
dynamic_shared_memory_size %s // (Optional) Amount of dynamic shared
// memory to allocate for a workgroup.
args(%arg0 : f32, // (Optional) Kernel arguments.
%arg1 : memref<?xf32, 1>)
}
Traits: AttrSizedOperandSegments
Interfaces: GPU_AsyncOpInterface
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
kernel | ::mlir::SymbolRefAttr | symbol reference attribute |
Operands: ¶
Operand | Description |
---|---|
asyncDependencies | async token type |
gridSizeX | index |
gridSizeY | index |
gridSizeZ | index |
blockSizeX | index |
blockSizeY | index |
blockSizeZ | index |
dynamicSharedMemorySize | 32-bit signless integer |
kernelOperands | any type |
Results: ¶
Result | Description |
---|---|
asyncToken | async token type |
gpu.launch
(::mlir::gpu::LaunchOp) ¶
GPU kernel launch operation
Launch a kernel on the specified grid of thread blocks. The body of the kernel is defined by the single region that this operation contains. The operation takes an optional list of async dependencies followed by six operands and an optional operand.
The async
keyword indicates the kernel should be launched asynchronously;
the operation returns a new !gpu.async.token when the keyword is specified.
The kernel launched does not start executing until the ops producing its
async dependencies (optional operands) have completed.
The first three operands (following any async dependencies) are grid sizes
along the x,y,z dimensions and the following three are block sizes along the
x,y,z dimensions. When a lower-dimensional kernel is required, unused sizes
must be explicitly set to 1
. The last operand is optional and corresponds
to the amount of dynamic shared memory a kernel’s workgroup should be
allocated; when this operand is not present, a zero size is assumed.
The body region has twelve arguments, grouped as follows:
- three arguments that contain block identifiers along x,y,z dimensions;
- three arguments that contain thread identifiers along x,y,z dimensions;
- operands of the
gpu.launch
operation as is (i.e. the operands for grid and block sizes).
Syntax:
operation ::= `gpu.launch` (`async` (`[` ssa-id-list `]`)? )?
`block` `(` ssa-id-list `)` `in` ssa-reassignment
`threads` `(` ssa-id-list `)` `in` ssa-reassignment
(dynamic_shared_memory_size ssa-use)?
region attr-dict?
ssa-reassignment ::= `(` ssa-id `=` ssa-use (`,` ssa-id `=` ssa-use)* `)`
Example:
gpu.launch blocks(%bx, %by, %bz) in (%sz_bx = %0, %sz_by = %1, %sz_bz = %2)
threads(%tx, %ty, %tz) in (%sz_tx = %3, %sz_ty = %4, %sz_tz = %5) {
// Block and thread identifiers, as well as block/grid sizes are
// immediately usable inside body region.
"some_op"(%bx, %tx) : (index, index) -> ()
// Assuming %val1 is defined outside the gpu.launch region.
%42 = load %val1[%bx] : memref<?xf32, 1>
}
// Generic syntax explains how the pretty syntax maps to the IR structure.
"gpu.launch"(%cst, %cst, %c1, // Grid sizes.
%cst, %c1, %c1) // Block sizes.
{/*attributes*/}
// All sizes and identifiers have "index" size.
: (index, index, index, index, index, index) -> () {
// The operation passes block and thread identifiers, followed by grid and
// block sizes.
^bb0(%bx : index, %by : index, %bz : index,
%tx : index, %ty : index, %tz : index,
%num_bx : index, %num_by : index, %num_bz : index,
%num_tx : index, %num_ty : index, %num_tz : index)
"some_op"(%bx, %tx) : (index, index) -> ()
%3 = "memref.load"(%val1, %bx) : (memref<?xf32, 1>, index) -> f32
}
Rationale: using operation/block arguments gives analyses a clear way of understanding that a value has additional semantics (e.g., we will need to know what value corresponds to threadIdx.x for coalescing). We can recover these properties by analyzing the operations producing values, but it is easier just to have that information by construction.
Traits: AttrSizedOperandSegments, AutomaticAllocationScope
Interfaces: GPU_AsyncOpInterface, InferIntRangeInterface
Operands: ¶
Operand | Description |
---|---|
asyncDependencies | async token type |
gridSizeX | index |
gridSizeY | index |
gridSizeZ | index |
blockSizeX | index |
blockSizeY | index |
blockSizeZ | index |
dynamicSharedMemorySize | 32-bit signless integer |
Results: ¶
Result | Description |
---|---|
asyncToken | async token type |
gpu.memcpy
(::mlir::gpu::MemcpyOp) ¶
GPU memcpy operation
Syntax:
operation ::= `gpu.memcpy` custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$dst`,` $src `:` type($dst)`,` type($src) attr-dict
The gpu.memcpy
operation copies the content of one memref to another.
The op does not execute before all async dependencies have finished executing.
If the async
keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token.
Example:
%token = gpu.memcpy async [%dep] %dst, %src : memref<?xf32, 1>, memref<?xf32>
Interfaces: GPU_AsyncOpInterface
Operands: ¶
Operand | Description |
---|---|
asyncDependencies | async token type |
dst | memref of any type values |
src | memref of any type values |
Results: ¶
Result | Description |
---|---|
asyncToken | async token type |
gpu.memset
(::mlir::gpu::MemsetOp) ¶
GPU memset operation
Syntax:
operation ::= `gpu.memset` custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$dst`,` $value `:` type($dst)`,` type($value) attr-dict
The gpu.memset
operation sets the content of memref to a scalar value.
The op does not execute before all async dependencies have finished executing.
If the async
keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token.
Example:
%token = gpu.memset async [%dep] %dst, %value : memref<?xf32, 1>, f32
Interfaces: GPU_AsyncOpInterface
Operands: ¶
Operand | Description |
---|---|
asyncDependencies | async token type |
dst | memref of any type values |
value | any type |
Results: ¶
Result | Description |
---|---|
asyncToken | async token type |
gpu.module_end
(::mlir::gpu::ModuleEndOp) ¶
A pseudo op that marks the end of a gpu.module.
Syntax:
operation ::= `gpu.module_end` attr-dict
This op terminates the only block inside the only region of a gpu.module
.
Traits: HasParent
gpu.num_subgroups
(::mlir::gpu::NumSubgroupsOp) ¶
Syntax:
operation ::= `gpu.num_subgroups` attr-dict `:` type($result)
Returns the number of subgroups within a workgroup.
Example:
%numSg = gpu.num_subgroups : index
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, InferIntRangeInterface, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Results: ¶
Result | Description |
---|---|
result | index |
gpu.printf
(::mlir::gpu::PrintfOp) ¶
Device-side printf, as in CUDA or OpenCL, for debugging
Syntax:
operation ::= `gpu.printf` $format attr-dict ($args^ `:` type($args))?
gpu.printf
takes a literal format string format
and an arbitrary number of
scalar arguments that should be printed.
The format string is a C-style printf string, subject to any restrictions imposed by one’s target platform.
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
format | ::mlir::StringAttr | string attribute |
Operands: ¶
Operand | Description |
---|---|
args | integer or index or floating-point |
gpu.return
(::mlir::gpu::ReturnOp) ¶
Terminator for GPU functions.
Syntax:
operation ::= `gpu.return` attr-dict ($operands^ `:` type($operands))?
A terminator operation for regions that appear in the body of gpu.func
functions. The operands to the gpu.return
are the result values returned
by an invocation of the gpu.func
.
Traits: AlwaysSpeculatableImplTrait, HasParent
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands: ¶
Operand | Description |
---|---|
operands | any type |
gpu.set_default_device
(::mlir::gpu::SetDefaultDeviceOp) ¶
Set default GPU for operations after this by index
Syntax:
operation ::= `gpu.set_default_device` attr-dict $devIndex
Operation that sets the current default GPU, using a zero-based index into the set of GPUs on the system. The default GPU setting may be thread-local.
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}
Operands: ¶
Operand | Description |
---|---|
devIndex | 32-bit signless integer |
gpu.shuffle
(::mlir::gpu::ShuffleOp) ¶
Shuffles values within a subgroup.
Syntax:
operation ::= `gpu.shuffle` $mode $value `,` $offset `,` $width attr-dict `:` type($value)
The “shuffle” op moves values to a different invocation within the same subgroup.
Example:
%1, %2 = gpu.shuffle %0, %offset, %width xor : f32
For lane k returns the value from lane k ^ offset
and true
if that lane
is smaller than %width. Otherwise it returns an unspecified value and
false
. A lane is the index of an invocation relative to its subgroup.
The width specifies the number of invocations that participate in the
shuffle. The width needs to be the same for all invocations that participate
in the shuffle. Exactly the first width
invocations of a subgroup need to
execute this op in convergence.
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
mode | ::mlir::gpu::ShuffleModeAttr | Indexing modes supported by gpu.shuffle. |
Operands: ¶
Operand | Description |
---|---|
value | i32 or f32 |
offset | 32-bit signless integer |
width | 32-bit signless integer |
Results: ¶
Result | Description |
---|---|
shuffleResult | i32 or f32 |
valid | 1-bit signless integer |
gpu.subgroup_id
(::mlir::gpu::SubgroupIdOp) ¶
Syntax:
operation ::= `gpu.subgroup_id` attr-dict `:` type($result)
Returns the subgroup id, i.e. the index of the current subgroup within the workgroup.
Example:
%sgId = gpu.subgroup_id : index
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, InferIntRangeInterface, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Results: ¶
Result | Description |
---|---|
result | index |
gpu.subgroup_mma_compute
(::mlir::gpu::SubgroupMmaComputeOp) ¶
GPU warp synchronous matrix multiply accumulate
Syntax:
operation ::= `gpu.subgroup_mma_compute` $opA`,` $opB`,` $opC attr-dict `:` type($opA)`,` type($opB) `->` type($res)
The gpu.subgroup_mma_compute
operation performs a matrix-multiply accumulate (mma)
operation using all the threads in a subgroup.
This operation takes three !gpu.mma_matrix
s as arguments: these hold A
,
B
and C
operands for the mma operation. The operation performed is represented
as C += A * B
. The op returns a !gpu.mma_matrix
which contains the result of
the operation held by all threads in a subgroup. a_transpose
or
b_transpose
if present, signify that the respective operand was loaded in a
transposed manner. The transpose opernads are required to map to correct
underlying intrisics but they currently do not seem to affect correctness
even if they are absent given that the operands were loaded correctly using
the transpose
attribute in gpu.subgroup_mma_load_matrix
op.
This op is meant to be used along with gpu.subgroup_mma_store_matrix
and
gpu.subgroup_mma_load_matrix
ops.
Example:
%D = gpu.subgroup_mma_compute_matrix %A, %B, %C :
!gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">>
-> !gpu.mma_matrix<16x16xf16, "COp">
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
a_transpose | ::mlir::UnitAttr | unit attribute |
b_transpose | ::mlir::UnitAttr | unit attribute |
Operands: ¶
Operand | Description |
---|---|
opA | gpu.mma_matrix of 16-bit float or 32-bit float values |
opB | gpu.mma_matrix of 16-bit float or 32-bit float values |
opC | gpu.mma_matrix of 16-bit float or 32-bit float values |
Results: ¶
Result | Description |
---|---|
res | MMAMatrix type |
gpu.subgroup_mma_constant_matrix
(::mlir::gpu::SubgroupMmaConstantMatrixOp) ¶
GPU warp synchronous constant matrix
Syntax:
operation ::= `gpu.subgroup_mma_constant_matrix` $value attr-dict `:` type($res)
The gpu.subgroup_mma_constant_matrix
creates a !gpu.mma_matrix
with
constant elements.
The operation takes a scalar input and return a !gpu.mma_matrix
where
each element of is equal to the operand constant. The destination
mma_matrix type must have elememt type equal to the constant type. Since
the layout of !gpu.mma_matrix
is opaque this only support setting all the
elements to the same value.
This op is meant to be used along with gpu.subgroup_mma_compute
.
Example:
%0 = gpu.subgroup_mma_constant_matrix %a :
!gpu.mma_matrix<16x16xf16, "AOp">
%1 = gpu.subgroup_mma_constant_matrix %b :
!gpu.mma_matrix<16x16xf32, "COp">
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands: ¶
Operand | Description |
---|---|
value | 16-bit float or 32-bit float |
Results: ¶
Result | Description |
---|---|
res | MMAMatrix type |
gpu.subgroup_mma_elementwise
(::mlir::gpu::SubgroupMmaElementwiseOp) ¶
GPU warp elementwise operation on a matrix
Syntax:
operation ::= `gpu.subgroup_mma_elementwise` $opType $args attr-dict `:` functional-type($args, $res)
The gpu.subgroup_mma_elementwise
takes !gpu.mma_matrix
inputs and
compute a new !gpu.mma_matrix
by applying an elementwise operation to each
element.
Since the operation is elementwise and the matrix type must match, the matrix elements are processed independently of the matrix layout.
This op is meant to be used along with gpu.subgroup_mma_compute
.
Example:
%0 = %A, %B { opType = "ADD" } :
(!gpu.mma_matrix<16x16xf16, "COp">, !gpu.mma_matrix<16x16xf16, "COp">)
-> !gpu.mma_matrix<16x16xf16, "COp">
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
opType | ::mlir::gpu::MMAElementwiseOpAttr | elementwise operation to apply to mma matrix |
Operands: ¶
Operand | Description |
---|---|
args | MMAMatrix type |
Results: ¶
Result | Description |
---|---|
res | MMAMatrix type |
gpu.subgroup_mma_load_matrix
(::mlir::gpu::SubgroupMmaLoadMatrixOp) ¶
GPU warp synchronous matrix load
Syntax:
operation ::= `gpu.subgroup_mma_load_matrix` $srcMemref`[`$indices`]` attr-dict `:` type($srcMemref) `->` type($res)
The gpu.subgroup_mma_load_matrix
operation loads a matrix collectively
using all the threads in a subgroup.
This operation takes a memref as its first operand: it is the source matrix
from which data is to be loaded. The op returns a !gpu.mma_matrix
. The
source memref can be in global memory or shared memory. The load address is
determined using indices
. The matrix being loaded into is the result. The
leadDimension
attribute specifies the leading dimension size of the source
matrix which eventually allows the lowering to determine the size of each
row. If the transpose
attribute is present then the op does a transposed load.
This op is often meant to be used along with gpu.subgroup_mma_store_matrix
and
gpu.subgroup_mma_compute
.
Example:
%0 = gpu.subgroup_mma_load_matrix src[%i,%j] : {leadDimension = 32 : i32}
: memref<32x32xf16, 3>, !gpu.mma_matrix<16x16xf16, "AOp">
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
leadDimension | ::mlir::IntegerAttr | index attribute |
transpose | ::mlir::UnitAttr | unit attribute |
Operands: ¶
Operand | Description |
---|---|
srcMemref | memref of 16-bit float or 32-bit float or vector of 16-bit float or 32-bit float values of ranks 1 values |
indices | index |
Results: ¶
Result | Description |
---|---|
res | MMAMatrix type |
gpu.subgroup_mma_store_matrix
(::mlir::gpu::SubgroupMmaStoreMatrixOp) ¶
GPU warp synchronous matrix store
Syntax:
operation ::= `gpu.subgroup_mma_store_matrix` $src`,` $dstMemref`[`$indices`]` attr-dict `:` type($src)`,` type($dstMemref)
The gpu.subgroup_mma_store_matrix
operation stores a matrix collectively
using all the threads in a subgroup.
This operation takes a !gpu.mma_matrix
and a memref as operands.
!gpu.mma_matrix
is the source value containing the data to be stored into the
destination memref which can be in global or shared memory. The store address
is determined using the indices provided. The leadDimension
attribute
specifies the leading dimension of the destination matrix. If the
transpose
attribute is present then the op does a transposed store.
This op is often meant to be used along with gpu.subgroup_mma_load_matrix
and
gpu.subgroup_mma_compute
.
Example:
gpu.subgroup_mma_store_matrix %D, %sg[%i,%j] : { leadDimension = 32 : i32}
: !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
leadDimension | ::mlir::IntegerAttr | index attribute |
transpose | ::mlir::UnitAttr | unit attribute |
Operands: ¶
Operand | Description |
---|---|
src | gpu.mma_matrix of 16-bit float or 32-bit float values |
dstMemref | memref of 16-bit float or 32-bit float or vector of 16-bit float or 32-bit float values of ranks 1 values |
indices | index |
gpu.subgroup_reduce
(::mlir::gpu::SubgroupReduceOp) ¶
Reduce values among subgroup.
Syntax:
operation ::= `gpu.subgroup_reduce` custom<AllReduceOperation>($op) $value
(`uniform` $uniform^)? attr-dict
`:` functional-type(operands, results)
The subgroup_reduce
op reduces the value of every work item across a
subgroup. The result is equal for all work items of a subgroup.
Example:
%1 = gpu.subgroup_reduce add %0 : (f32) -> (f32)
If uniform
flag is set either none or all work items of a subgroup
need to execute this op in convergence.
Traits: SameOperandsAndResultType
Interfaces: InferTypeOpInterface
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
op | ::mlir::gpu::AllReduceOperationAttr | built-in reduction operations supported by gpu.allreduce. |
uniform | ::mlir::UnitAttr | unit attribute |
Operands: ¶
Operand | Description |
---|---|
value | any type |
Results: ¶
Result | Description |
---|---|
«unnamed» | any type |
gpu.subgroup_size
(::mlir::gpu::SubgroupSizeOp) ¶
Syntax:
operation ::= `gpu.subgroup_size` attr-dict `:` type($result)
Returns the number of threads within a subgroup.
Example:
%sgSz = gpu.subgroup_size : index
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, InferIntRangeInterface, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Results: ¶
Result | Description |
---|---|
result | index |
gpu.terminator
(::mlir::gpu::TerminatorOp) ¶
Terminator for GPU launch regions.
Syntax:
operation ::= `gpu.terminator` attr-dict
A terminator operation for regions that appear in the body of gpu.launch
operation. These regions are not expected to return any value so the
terminator takes no operands.
Traits: AlwaysSpeculatableImplTrait, HasParent
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
gpu.thread_id
(::mlir::gpu::ThreadIdOp) ¶
Syntax:
operation ::= `gpu.thread_id` $dimension attr-dict
Returns the thread id, i.e. the index of the current thread within the block
along the x, y, or z dimension
.
Example:
%tIdX = gpu.thread_id x
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, InferIntRangeInterface, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
dimension | ::mlir::gpu::DimensionAttr | a dimension, either ‘x’, ‘y’, or ‘z’ |
Results: ¶
Result | Description |
---|---|
«unnamed» | index |
gpu.wait
(::mlir::gpu::WaitOp) ¶
Wait for async gpu ops to complete.
Syntax:
operation ::= `gpu.wait` custom<AsyncDependencies>(type($asyncToken), $asyncDependencies) attr-dict
This op synchronizes the host or the device with a list of dependent ops.
If the op contains the async
keyword, it returns a new async token which
is synchronized with the op arguments. This new token is merely a shortcut
to the argument list, and one could replace the uses of the result with the
arguments for the same effect. The async version of this op is primarily
used to make each async token have a single use during lowering and
thereby make forks in async execution explicit. Example usage:
%t0 = gpu.foo async : !gpu.async.token
%t1 = gpu.bar async : !gpu.async.token
%t2 = gpu.wait async [%t0, %t1]
// gpu.baz doesn't run until gpu.foo and gpu.bar have both completed, just
// as if the async dependencies were [%t0, %t1].
%t3 = gpu.baz async [%t2]
If the op does not contain the async
keyword, it does not return a new
async token but blocks until all ops producing the async dependency tokens
finished execution. All dependent memory operations are visible to the host
once this op completes. Example usage:
%t0 = gpu.foo async : !gpu.async.token
%t1 = gpu.bar async : !gpu.async.token
// The gpu.wait op blocks until gpu.foo and gpu.bar have completed.
gpu.wait [%t0, %t1]
Interfaces: GPU_AsyncOpInterface
Operands: ¶
Operand | Description |
---|---|
asyncDependencies | async token type |
Results: ¶
Result | Description |
---|---|
asyncToken | async token type |
gpu.yield
(::mlir::gpu::YieldOp) ¶
GPU yield operation
gpu.yield` is a special terminator operation for blocks inside regions in gpu ops. It returns values to the immediately enclosing gpu op.
Example:
gpu.yield %f0, %f1 : f32, f32
Traits: AlwaysSpeculatableImplTrait, Terminator
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands: ¶
Operand | Description |
---|---|
values | any type |