MLIR

Multi-Level IR Compiler Framework

'nvgpu' Dialect

The NVGPU dialect provides a bridge between higher-level target-agnostic dialects (GPU and Vector) and the lower-level target-specific dialect (LLVM IR based NVVM dialect) for NVIDIA GPUs. This allow representing PTX specific operations while using MLIR high level dialects such as Memref and Vector for memory and target-specific register operands, respectively.

Operation definition 

nvgpu.device_async_copy (::mlir::nvgpu::DeviceAsyncCopyOp) 

device-side asynchronous copy

Syntax:

operation ::= `nvgpu.device_async_copy` $src `[` $srcIndices `]` `,` $dst `[` $dstIndices `]` `,` $dstElements (`,` $srcElements^)?
              attr-dict `:` type($src) `to` type($dst)

The nvgpu.device_async_copy op initiates an asynchronous copy operation of elements from source (global memory) to the destination (shared memory) without blocking the thread. The async copy is added to a group.

This op is meant to be used with nvgpu.device_async_create_group and nvgpu.device_async_wait to synchronize copies as explained in those ops descriptions.

bypassL1 attribute is hint to the hardware to bypass the L1 cache during async copy, this hint may be ignored by the hardware.

dstElements attribute is the total number of elements written to destination (shared memory).

srcElements argument is the total number of elements read from source (global memory).

srcElements is an optional argument and when present the op only reads srcElements number of elements from the source (global memory) and zero fills the rest of the elements in the destination (shared memory).

In order to do a copy and wait for the result we need the following combination:

// copy 1.
%cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
// copy 2.
%cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 1 contains copy 1 and copy 2.
%token1 = nvgpu.device_async_create_group %cp1, %cp2
// copy 3.
%cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 2 contains copy 3.
%token2 = nvgpu.device_async_create_group %cp3
// after the wait copy 1 and copy 2 are complete.
nvgpu.device_async_wait %token1
// after the wait copy 3 is complete.
nvgpu.device_async_wait %token2

Example:

%0 = nvgpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 :
  memref<4x5xf32> to memref<2x7x5xf32, 3>

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
dstElements::mlir::IntegerAttrindex attribute
bypassL1::mlir::UnitAttrunit attribute

Operands: 

OperandDescription
dstmemref of any type values
dstIndicesindex
srcmemref of any type values
srcIndicesindex
srcElementsindex

Results: 

ResultDescription
asyncTokendevice async token type

nvgpu.device_async_create_group (::mlir::nvgpu::DeviceAsyncCreateGroupOp) 

device side asynchronous create group operation

Syntax:

operation ::= `nvgpu.device_async_create_group` $inputTokens attr-dict

The nvgpu.device_async_create_group op creates a group of memory accesses containing all the pending device_async_copy operations associated with argument tokens. Each token can only be part of one group.

It returns a token that can be use to wait until the group fully completes.

This is meant to be used with nvgpu.device_async_wait to synchronize copies as explained in those ops descriptions.

Groups are executed in the order they are created.

Example:

%0 = nvgpu.device_async_create_group

Operands: 

OperandDescription
inputTokensdevice async token type

Results: 

ResultDescription
asyncTokendevice async token type

nvgpu.device_async_wait (::mlir::nvgpu::DeviceAsyncWaitOp) 

Wait for async gpu ops to complete.

Syntax:

operation ::= `nvgpu.device_async_wait` $asyncDependencies attr-dict

The nvgpu.device_async_wait op will block the execution thread until the group associated with the source token is fully completed.

The optional $numGroup attribute gives a lower bound of the number of groups uncompleted when the wait can unblock the thread.

Example:

nvgpu.device_async_wait %0

Attributes: 

AttributeMLIR TypeDescription
numGroups::mlir::IntegerAttr32-bit signless integer attribute

Operands: 

OperandDescription
asyncDependenciesdevice async token type

nvgpu.ldmatrix (::mlir::nvgpu::LdMatrixOp) 

Syntax:

operation ::= `nvgpu.ldmatrix` $srcMemref`[` $indices `]` attr-dict `:` type($srcMemref) `->` type($res)

The nvgpu.ldmatrix op represents loading a matrix fragment from memory to registers. The source and result type must be compatible with lowering to the nvvm.ldmatrix instruction. This op represents the distributed version of a vector.transfer_read as an intermediate step between lowering from vector.transfer_read to nvvm.ldmatrix.

This operation is meant to follow the semantic of described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix

Example:

%0 = nvgpu.ldmatrix %sm[%c0, %c0] {numTiles = 4 : i32, transpose = false} :
  memref<?x?xf16, 3> -> vector<4x2xf16>

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource}

Attributes: 

AttributeMLIR TypeDescription
transpose::mlir::BoolAttrbool attribute
numTiles::mlir::IntegerAttr32-bit signless integer attribute

Operands: 

OperandDescription
srcMemrefmemref of any type values
indicesindex

Results: 

ResultDescription
resvector of any type values

nvgpu.mma.sp.sync (::mlir::nvgpu::MmaSparseSyncOp) 

Syntax:

operation ::= `nvgpu.mma.sp.sync` `(` $matrixA`,` $matrixB`,` $matrixC `)` `metadata` `(` $sparseMetadata `)` attr-dict
              `:` `(` type($matrixA) `,` type($matrixB) `,` type($matrixC) `)` `->` type($res)

The nvgu.mma.sp.sync operation performs a warp-distributed MMA operation where operand A is “structured sparse”. In this case, the matrixA operand represents the (warp-distributed) non-zero values of operand A, and the sparse_metadata operand provides the indices.

The full description of the sparsity storage format and distribution scheme is described in the PTX docs. This operation is meant to follow the semantic described in the PTX documentation here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-sparse-mma

The way the indices are distributed among the threads in a warp is controlled by the optional sparsity_selector operand, which is 0 by default. For more information, please consult the PTX documentation linked above.

Example (targetingthe f16 16x8x32 mma.sp PTX instruction):

nvgpu.mma.sp.sync (%a, %b, %c) metadata (%meta) {mmaShape = [16, 8, 32]} :
  (vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
mmaShape::mlir::ArrayAttr64-bit integer array attribute
sparsitySelector::mlir::IntegerAttr32-bit signless integer attribute
tf32Enabled::mlir::UnitAttrunit attribute

Operands: 

OperandDescription
matrixAvector of any type values
matrixBvector of any type values
matrixCvector of any type values
sparseMetadatafixed-length vector of 16-bit signless integer values of length 2

Results: 

ResultDescription
resvector of any type values

nvgpu.mma.sync (::mlir::nvgpu::MmaSyncOp) 

Syntax:

operation ::= `nvgpu.mma.sync` `(` $matrixA`,` $matrixB`,` $matrixC `)` attr-dict
              `:` `(` type($matrixA) `,` type($matrixB) `,` type($matrixC) `)` `->` type($res)

The nvgpu.mma.sync op represents the warp-level matrix-multiply-and- accumulate (mma) operation that is compatible with nvvm.mma.sync. The operands and results vector sizes are thread-level onwership to the warp-level mma operation shape. mmaShape attribute holds the warp-level matrix-multiply shape.

The nvgpu.mma.sync op serves as an intermediate point between lowering from vector.contract to nvvm.mma.sync.

This operation is meant to follow the semantic of described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma

Example:

%res = nvgpu.mma.sync (%matrixA, %matrixB, %matrixC) {mmaShape = [16, 8, 16]} :
    (vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf32>) -> vector<2x2xf32>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
mmaShape::mlir::ArrayAttr64-bit integer array attribute
tf32Enabled::mlir::UnitAttrunit attribute

Operands: 

OperandDescription
matrixAvector of any type values
matrixBvector of any type values
matrixCvector of any type values

Results: 

ResultDescription
resvector of any type values

Type definition 

DeviceAsyncTokenType 

device async token type

Syntax: !nvgpu.device.async.token

nvgpu.device.async.token is a type returned by an asynchronous operation that runs on the GPU (device). It is used to establish an SSA-based link between the async operation (e.g. DeviceAsyncCopy) and operations that group or synchronize the async operations (e.g. DeviceAsyncCreateGroupOp, DeviceAsyncWaitOp).