MLIR

Multi-Level IR Compiler Framework

'nvgpu' Dialect

This NVGPU dialect provides a bridge between the target agnostic GPU and Vector dialects and the lower level LLVM IR based NVVM dialect. This allow representing PTX specific operations while using MLIR high level concepts like memref and 2-D vector.

Type constraint definition 

device async token type 

Operation definition 

nvgpu.device_async_copy (::mlir::nvgpu::DeviceAsyncCopyOp) 

device-side asynchronous copy

Syntax:

operation ::= `nvgpu.device_async_copy` $src `[` $srcIndices `]` `,` $dst `[` $dstIndices `]` `,` $numElements
              attr-dict `:` type($src) `to` type($dst)

The gpu.device_async_copy op initiates an asynchronous copy operation of $size elements from source to the destination without blocking the thread. The destination has to be in shared memory.

This is memory access will be pending to be added to a group.

This op is meant to be used with gpu.device_async_create_group and gpu.device_async_wait to synchronize copies as explained in those ops descriptions. bypassL1 attribute is hint to the backend and hardware that the copy should by pass the L1 cache, this may be dropped by the backend or hardware.

In order to do a copy and wait for the result we need the following combination:

// copy 1.
%cp1 = gpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
// copy 2.
%cp2 = gpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 1 contains copy 1 and copy 2.
%token1 = gpu.device_async_create_group %cp1, %cp2
// copy 3.
%cp3 = gpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 2 contains copy 3.
%token2 = gpu.device_async_create_group %cp3
// after the wait copy 1 and copy 2 are complete.
gpu.device_async_wait %token1
// after the wait copy 3 is complete.
gpu.device_async_wait %token2

Example:

%0 = gpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 :
  memref<4x5xf32> to memref<2x7x5xf32, 3>

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
numElements::mlir::IntegerAttrindex attribute
bypassL1::mlir::UnitAttrunit attribute

Operands: 

OperandDescription
dstmemref of any type values
dstIndicesindex
srcmemref of any type values
srcIndicesindex

Results: 

ResultDescription
asyncTokendevice async token type

nvgpu.device_async_create_group (::mlir::nvgpu::DeviceAsyncCreateGroupOp) 

device side asynchronous create group operation

Syntax:

operation ::= `nvgpu.device_async_create_group` $inputTokens attr-dict

The gpu.device_async_create_group op creates a group of memory accesses containing all the pending device_async_copy operations associated with argument tokens. Each token can only be part of one group.

It returns a token that can be use to wait until the group fully completes.

This is meant to be used with gpu.device_async_wait to synchronize copies as explained in those ops descriptions.

Groups are executed in the order they are created.

Example:

%0 = gpu.device_async_create_group

Operands: 

OperandDescription
inputTokensdevice async token type

Results: 

ResultDescription
asyncTokendevice async token type

nvgpu.device_async_wait (::mlir::nvgpu::DeviceAsyncWaitOp) 

Wait for async gpu ops to complete.

Syntax:

operation ::= `nvgpu.device_async_wait` $asyncDependencies attr-dict

The gpu.device_async_wait op will block the execution thread until the group associated with the source token is fully completed.

The optional $numGroup attribute gives a lower bound of the number of groups uncompleted when the wait can unblock the thread. Example:

gpu.device_async_wait %0

Attributes: 

AttributeMLIR TypeDescription
numGroups::mlir::IntegerAttr32-bit signless integer attribute

Operands: 

OperandDescription
asyncDependenciesdevice async token type

nvgpu.ldmatrix (::mlir::nvgpu::LdMatrixOp) 

Syntax:

operation ::= `nvgpu.ldmatrix` $srcMemref`[` $indices `]` attr-dict `:` type($srcMemref) `->` type($res)

The nvgpu.ldmatrix op represents loading a matrix fragment from memory. The load source and result type must be compatible with lowering to the nvvm.ldmatrix instruction. This op is meant to represent the distributed version of a vector.transfer_read as an intermediate step between lowering from vector.transfer_read to nvvm.ldmatrix.

This operation is meant to follow the semantic of described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix

Example:

%0 = nvgpu.ldmatrix %sm[%c0, %c0] {numTiles = 4 : i32, transpose = false} :
  memref<?x?xf16, 3> -> vector<4x2xf16>

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource}

Attributes: 

AttributeMLIR TypeDescription
transpose::mlir::BoolAttrbool attribute
numTiles::mlir::IntegerAttr32-bit signless integer attribute

Operands: 

OperandDescription
srcMemrefmemref of any type values
indicesindex

Results: 

ResultDescription
resvector of any type values

nvgpu.mma.sync (::mlir::nvgpu::MmaSyncOp) 

Syntax:

operation ::= `nvgpu.mma.sync` `(` $matrixA`,` $matrixB`,` $matrixC `)` attr-dict
              `:` `(` type($matrixA) `,` type($matrixB) `,` type($matrixC) `)` `->` type($res)

The nvgpu.mma.sync op represents the distributed form of a collective matrix-multiply-and-accumulate (mma) operation that is compatible with nvvm.mma.sync. The operands and results are fragments of the full matrix operands. The full shape of the distributed mma operation is given by the mmaShape attribute in the form of a list of dimensions [m, n, k].

This operation is meant to be lowered to the nvvm.mma.sync instruction, and is an intermediate point between lowering from vector.contract to nvvm.mma.sync.

This operation is meant to follow the semantic of described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma

Example:

nvgpu.mma.sync (%a, %b, %c) :
  (vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
mmaShape::mlir::ArrayAttr64-bit integer array attribute

Operands: 

OperandDescription
matrixAvector of any type values
matrixBvector of any type values
matrixCvector of any type values

Results: 

ResultDescription
resvector of any type values