'nvgpu' Dialect
The NVGPU
dialect provides a bridge between higher-level target-agnostic
dialects (GPU and Vector) and the lower-level target-specific dialect
(LLVM IR based NVVM dialect) for NVIDIA GPUs. This allow representing PTX
specific operations while using MLIR high level dialects such as Memref
and Vector for memory and target-specific register operands, respectively.
Operation definition ¶
nvgpu.device_async_copy
(::mlir::nvgpu::DeviceAsyncCopyOp) ¶
device-side asynchronous copy
Syntax:
operation ::= `nvgpu.device_async_copy` $src `[` $srcIndices `]` `,` $dst `[` $dstIndices `]` `,` $dstElements (`,` $srcElements^)?
attr-dict `:` type($src) `to` type($dst)
The nvgpu.device_async_copy
op initiates an asynchronous copy operation of
elements from source (global memory) to the destination (shared memory)
without blocking the thread. The async copy is added to a group.
This op is meant to be used with nvgpu.device_async_create_group
and
nvgpu.device_async_wait
to synchronize copies as explained in those ops
descriptions.
bypassL1
attribute is hint to the hardware to bypass the L1 cache during
async copy, this hint may be ignored by the hardware.
dstElements
attribute is the total number of elements written to
destination (shared memory).
srcElements
argument is the total number of elements read from
source (global memory).
srcElements
is an optional argument and when present the op only reads
srcElements
number of elements from the source (global memory) and zero fills
the rest of the elements in the destination (shared memory).
In order to do a copy and wait for the result we need the following combination:
// copy 1.
%cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
// copy 2.
%cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 1 contains copy 1 and copy 2.
%token1 = nvgpu.device_async_create_group %cp1, %cp2
// copy 3.
%cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 2 contains copy 3.
%token2 = nvgpu.device_async_create_group %cp3
// after the wait copy 1 and copy 2 are complete.
nvgpu.device_async_wait %token1
// after the wait copy 3 is complete.
nvgpu.device_async_wait %token2
Example:
%0 = nvgpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 :
memref<4x5xf32> to memref<2x7x5xf32, 3>
Traits: AttrSizedOperandSegments
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
dstElements | ::mlir::IntegerAttr | index attribute |
bypassL1 | ::mlir::UnitAttr | unit attribute |
Operands: ¶
Operand | Description |
---|---|
dst | memref of any type values |
dstIndices | index |
src | memref of any type values |
srcIndices | index |
srcElements | index |
Results: ¶
Result | Description |
---|---|
asyncToken | device async token type |
nvgpu.device_async_create_group
(::mlir::nvgpu::DeviceAsyncCreateGroupOp) ¶
device side asynchronous create group operation
Syntax:
operation ::= `nvgpu.device_async_create_group` $inputTokens attr-dict
The nvgpu.device_async_create_group
op creates a group of memory accesses
containing all the pending device_async_copy
operations associated with
argument tokens. Each token can only be part of one group.
It returns a token that can be use to wait until the group fully completes.
This is meant to be used with nvgpu.device_async_wait
to synchronize copies
as explained in those ops descriptions.
Groups are executed in the order they are created.
Example:
%0 = nvgpu.device_async_create_group
Operands: ¶
Operand | Description |
---|---|
inputTokens | device async token type |
Results: ¶
Result | Description |
---|---|
asyncToken | device async token type |
nvgpu.device_async_wait
(::mlir::nvgpu::DeviceAsyncWaitOp) ¶
Wait for async gpu ops to complete.
Syntax:
operation ::= `nvgpu.device_async_wait` $asyncDependencies attr-dict
The nvgpu.device_async_wait
op will block the execution thread until the group
associated with the source token is fully completed.
The optional $numGroup
attribute gives a lower bound of the number of
groups uncompleted when the wait can unblock the thread.
Example:
nvgpu.device_async_wait %0
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
numGroups | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands: ¶
Operand | Description |
---|---|
asyncDependencies | device async token type |
nvgpu.ldmatrix
(::mlir::nvgpu::LdMatrixOp) ¶
Syntax:
operation ::= `nvgpu.ldmatrix` $srcMemref`[` $indices `]` attr-dict `:` type($srcMemref) `->` type($res)
The nvgpu.ldmatrix
op represents loading a matrix fragment from
memory to registers. The source and result type must be compatible
with lowering to the nvvm.ldmatrix
instruction. This op represents
the distributed version of a vector.transfer_read
as an intermediate
step between lowering from vector.transfer_read
to nvvm.ldmatrix
.
This operation is meant to follow the semantic of described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix
Example:
%0 = nvgpu.ldmatrix %sm[%c0, %c0] {numTiles = 4 : i32, transpose = false} :
memref<?x?xf16, 3> -> vector<4x2xf16>
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
transpose | ::mlir::BoolAttr | bool attribute |
numTiles | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands: ¶
Operand | Description |
---|---|
srcMemref | memref of any type values |
indices | index |
Results: ¶
Result | Description |
---|---|
res | vector of any type values |
nvgpu.mma.sp.sync
(::mlir::nvgpu::MmaSparseSyncOp) ¶
Syntax:
operation ::= `nvgpu.mma.sp.sync` `(` $matrixA`,` $matrixB`,` $matrixC `)` `metadata` `(` $sparseMetadata `)` attr-dict
`:` `(` type($matrixA) `,` type($matrixB) `,` type($matrixC) `)` `->` type($res)
The nvgu.mma.sp.sync
operation performs a warp-distributed MMA operation
where operand A is “structured sparse”. In this case, the matrixA
operand
represents the (warp-distributed) non-zero values of operand A, and the
sparse_metadata
operand provides the indices.
The full description of the sparsity storage format and distribution scheme is described in the PTX docs. This operation is meant to follow the semantic described in the PTX documentation here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-sparse-mma
The way the indices are distributed among the threads in a warp is controlled
by the optional sparsity_selector
operand, which is 0
by default. For
more information, please consult the PTX documentation linked above.
Example (targetingthe f16 16x8x32 mma.sp
PTX instruction):
nvgpu.mma.sp.sync (%a, %b, %c) metadata (%meta) {mmaShape = [16, 8, 32]} :
(vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
mmaShape | ::mlir::ArrayAttr | 64-bit integer array attribute |
sparsitySelector | ::mlir::IntegerAttr | 32-bit signless integer attribute |
tf32Enabled | ::mlir::UnitAttr | unit attribute |
Operands: ¶
Operand | Description |
---|---|
matrixA | vector of any type values |
matrixB | vector of any type values |
matrixC | vector of any type values |
sparseMetadata | fixed-length vector of 16-bit signless integer values of length 2 |
Results: ¶
Result | Description |
---|---|
res | vector of any type values |
nvgpu.mma.sync
(::mlir::nvgpu::MmaSyncOp) ¶
Syntax:
operation ::= `nvgpu.mma.sync` `(` $matrixA`,` $matrixB`,` $matrixC `)` attr-dict
`:` `(` type($matrixA) `,` type($matrixB) `,` type($matrixC) `)` `->` type($res)
The nvgpu.mma.sync
op represents the warp-level matrix-multiply-and-
accumulate (mma) operation that is compatible with nvvm.mma.sync
.
The operands and results vector sizes are thread-level onwership to
the warp-level mma operation shape. mmaShape
attribute holds the
warp-level matrix-multiply shape.
The nvgpu.mma.sync
op serves as an intermediate point between lowering from
vector.contract
to nvvm.mma.sync
.
This operation is meant to follow the semantic of described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma
Example:
%res = nvgpu.mma.sync (%matrixA, %matrixB, %matrixC) {mmaShape = [16, 8, 16]} :
(vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf32>) -> vector<2x2xf32>
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
mmaShape | ::mlir::ArrayAttr | 64-bit integer array attribute |
tf32Enabled | ::mlir::UnitAttr | unit attribute |
Operands: ¶
Operand | Description |
---|---|
matrixA | vector of any type values |
matrixB | vector of any type values |
matrixC | vector of any type values |
Results: ¶
Result | Description |
---|---|
res | vector of any type values |
Type definition ¶
DeviceAsyncTokenType ¶
device async token type
Syntax: !nvgpu.device.async.token
nvgpu.device.async.token
is a type returned by an asynchronous operation
that runs on the GPU (device). It is used to establish an SSA-based link
between the async operation (e.g. DeviceAsyncCopy) and operations that
group or synchronize the async operations (e.g. DeviceAsyncCreateGroupOp,
DeviceAsyncWaitOp).