mlir.dialects._nvgpu_ops_gen

Attributes

Classes

_Dialect

DeviceAsyncCopyOp

The nvgpu.device_async_copy op initiates an asynchronous copy operation of

DeviceAsyncCreateGroupOp

The nvgpu.device_async_create_group op creates a group of memory accesses

DeviceAsyncWaitOp

The nvgpu.device_async_wait op will block the execution thread until the group

LdMatrixOp

The nvgpu.ldmatrix op represents loading a matrix fragment from

MBarrierArriveExpectTxOp

A thread executing the Op performs an expect-tx operation on the mbarrier

MBarrierArriveNoCompleteOp

The Op performs arrive-on operation on the mbarrier object and returns a

MBarrierArriveOp

The Op performs arrive-on operation on the mbarrier object and returns a

MBarrierCreateOp

The Op generates one or more mbarrier object, which is a barrier created in

MBarrierGetOp

The nvgpu.mbarrier.get operation retrieves a pointer to a specific

MBarrierInitOp

The Op initializes the mbarrier object with the given number of threads.

MBarrierTestWaitOp

Checks whether the mbarrier object has completed the phase. It is is a

MBarrierTryWaitParityOp

Checks whether the mbarrier object has completed the phase. It is is a

MmaSparseSyncOp

The nvgu.mma.sp.sync operation performs a warp-distributed MMA operation

MmaSyncOp

The nvgpu.mma.sync op represents the warp-level matrix-multiply-and-

RcpOp

Reciprocal calculation for vector types using nvvm.rcp OPs.

TmaAsyncLoadOp

The Op loads a tile memory region from global memory to shared memory by

TmaAsyncStoreOp

The Op store a tile memory region from global memory to shared memory by

TmaCreateDescriptorOp

The Op creates a tensor map descriptor object representing tiled memory

TmaFenceOp

The Op fences the given $tmaDescriptor. This is necessary if the tensor map

TmaPrefetchOp

The Op brings the cache line containing the given $tmaDescriptor for

WarpgroupGenerateDescriptorOp

This Op builds a nvgpu.warpgroup.descriptor that is used by

WarpgroupMmaInitAccumulatorOp

This Op generates and initializes the accumulator matrix for

WarpgroupMmaOp

The nvgpu.warpgroup.mma op performs the warpgroup-level (4 warps)

WarpgroupMmaStoreOp

The nvgpu.warpgroup.mma.store op performs the store of fragmented result

Functions

device_async_copy(→ _ods_ir)

device_async_create_group(→ _ods_ir)

device_async_wait(→ DeviceAsyncWaitOp)

ldmatrix(→ _ods_ir)

mbarrier_arrive_expect_tx(→ MBarrierArriveExpectTxOp)

mbarrier_arrive_nocomplete(→ _ods_ir)

mbarrier_arrive(→ _ods_ir)

mbarrier_create(→ _ods_ir)

mbarrier_get(→ _ods_ir)

mbarrier_init(→ MBarrierInitOp)

mbarrier_test_wait(→ _ods_ir)

mbarrier_try_wait_parity(→ MBarrierTryWaitParityOp)

mma_sp_sync(→ _ods_ir)

mma_sync(→ _ods_ir)

rcp(→ _ods_ir)

tma_async_load(→ TmaAsyncLoadOp)

tma_async_store(→ TmaAsyncStoreOp)

tma_create_descriptor(→ _ods_ir)

tma_fence_descriptor(→ TmaFenceOp)

tma_prefetch_descriptor(→ TmaPrefetchOp)

warpgroup_generate_descriptor(→ _ods_ir)

warpgroup_mma_init_accumulator(→ _ods_ir)

warpgroup_mma(→ _ods_ir)

warpgroup_mma_store(→ WarpgroupMmaStoreOp)

Module Contents

mlir.dialects._nvgpu_ops_gen._ods_ir
class mlir.dialects._nvgpu_ops_gen._Dialect(descriptor: object)

Bases: _ods_ir

DIALECT_NAMESPACE = 'nvgpu'
class mlir.dialects._nvgpu_ops_gen.DeviceAsyncCopyOp(dst, dstIndices, src, srcIndices, dstElements, *, srcElements=None, bypassL1=None, results=None, loc=None, ip=None)

Bases: _ods_ir

The nvgpu.device_async_copy op initiates an asynchronous copy operation of elements from source (global memory) to the destination (shared memory) without blocking the thread. The async copy is added to a group.

This op is meant to be used with nvgpu.device_async_create_group and nvgpu.device_async_wait to synchronize copies as explained in those ops descriptions.

bypassL1 attribute is hint to the hardware to bypass the L1 cache during async copy, this hint may be ignored by the hardware.

dstElements attribute is the total number of elements written to destination (shared memory).

srcElements argument is the total number of elements read from source (global memory).

srcElements is an optional argument and when present the op only reads srcElements number of elements from the source (global memory) and zero fills the rest of the elements in the destination (shared memory).

In order to do a copy and wait for the result we need the following combination:

// copy 1.
%cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
// copy 2.
%cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 1 contains copy 1 and copy 2.
%token1 = nvgpu.device_async_create_group %cp1, %cp2
// copy 3.
%cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 2 contains copy 3.
%token2 = nvgpu.device_async_create_group %cp3
// after the wait copy 1 and copy 2 are complete.
nvgpu.device_async_wait %token1
// after the wait copy 3 is complete.
nvgpu.device_async_wait %token2

Example:

%0 = nvgpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 :
  memref<4x5xf32> to memref<2x7x5xf32, 3>
OPERATION_NAME = 'nvgpu.device_async_copy'
_ODS_OPERAND_SEGMENTS
_ODS_REGIONS = (0, True)
dst() _ods_ir
dstIndices() _ods_ir
src() _ods_ir
srcIndices() _ods_ir
srcElements() _ods_ir | None
dstElements() _ods_ir
bypassL1() bool
asyncToken() _ods_ir
mlir.dialects._nvgpu_ops_gen.device_async_copy(dst, dst_indices, src, src_indices, dst_elements, *, src_elements=None, bypass_l1=None, results=None, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.DeviceAsyncCreateGroupOp(inputTokens, *, results=None, loc=None, ip=None)

Bases: _ods_ir

The nvgpu.device_async_create_group op creates a group of memory accesses containing all the pending device_async_copy operations associated with argument tokens. Each token can only be part of one group.

It returns a token that can be use to wait until the group fully completes.

This is meant to be used with nvgpu.device_async_wait to synchronize copies as explained in those ops descriptions.

Groups are executed in the order they are created.

Example:

%0 = nvgpu.device_async_create_group
OPERATION_NAME = 'nvgpu.device_async_create_group'
_ODS_REGIONS = (0, True)
inputTokens() _ods_ir
asyncToken() _ods_ir
mlir.dialects._nvgpu_ops_gen.device_async_create_group(input_tokens, *, results=None, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.DeviceAsyncWaitOp(asyncDependencies, *, numGroups=None, loc=None, ip=None)

Bases: _ods_ir

The nvgpu.device_async_wait op will block the execution thread until the group associated with the source token is fully completed.

The optional $numGroups attribute gives an upper bound of the number of groups uncompleted when the wait can unblock the thread. For example, if 16 async groups are pushe and $numGroups is set to 12, then the thread will unblock when 12 groups or fewer are in flight (4 groups have completed).

Example:

nvgpu.device_async_wait %0
OPERATION_NAME = 'nvgpu.device_async_wait'
_ODS_REGIONS = (0, True)
asyncDependencies() _ods_ir
numGroups() _ods_ir | None
mlir.dialects._nvgpu_ops_gen.device_async_wait(async_dependencies, *, num_groups=None, loc=None, ip=None) DeviceAsyncWaitOp
class mlir.dialects._nvgpu_ops_gen.LdMatrixOp(res, srcMemref, indices, transpose, numTiles, *, loc=None, ip=None)

Bases: _ods_ir

The nvgpu.ldmatrix op represents loading a matrix fragment from memory to registers. The source and result type must be compatible with lowering to the nvvm.ldmatrix instruction. This op represents the distributed version of a vector.transfer_read as an intermediate step between lowering from vector.transfer_read to nvvm.ldmatrix.

This operation is meant to follow the semantic of described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix

Example:

%0 = nvgpu.ldmatrix %sm[%c0, %c0] {numTiles = 4 : i32, transpose = false} :
  memref<?x?xf16, 3> -> vector<4x2xf16>
OPERATION_NAME = 'nvgpu.ldmatrix'
_ODS_REGIONS = (0, True)
srcMemref() _ods_ir
indices() _ods_ir
transpose() _ods_ir
numTiles() _ods_ir
res() _ods_ir
mlir.dialects._nvgpu_ops_gen.ldmatrix(res, src_memref, indices, transpose, num_tiles, *, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.MBarrierArriveExpectTxOp(barriers, txcount, mbarId, *, predicate=None, loc=None, ip=None)

Bases: _ods_ir

A thread executing the Op performs an expect-tx operation on the mbarrier object at the location specified by the address operand $barrier. The expect-tx operation, with an $txcount argument, increases the tx-count of an mbarrier object by the value specified by $txcount. This makes the current phase of the mbarrier object to expect and track the completion of additional asynchronous transactions.

The $txCount specifies the number of element to the expect-tx operation.

Example:

nvgpu.mbarrier.arrive.expect_tx %barrier, %ic0 : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
OPERATION_NAME = 'nvgpu.mbarrier.arrive.expect_tx'
_ODS_REGIONS = (0, True)
barriers() _ods_ir
txcount() _ods_ir
mbarId() _ods_ir
predicate() _ods_ir | None
mlir.dialects._nvgpu_ops_gen.mbarrier_arrive_expect_tx(barriers, txcount, mbar_id, *, predicate=None, loc=None, ip=None) MBarrierArriveExpectTxOp
class mlir.dialects._nvgpu_ops_gen.MBarrierArriveNoCompleteOp(barriers, mbarId, count, *, results=None, loc=None, ip=None)

Bases: _ods_ir

The Op performs arrive-on operation on the mbarrier object and returns a nvgpu.mbarrier.token.

The Op does not cause the nvgpu.mbarrier to complete its current phase.

Example:

%token = nvgpu.mbarrier.arrive.noComplete %barrier, %count : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>> -> !nvgpu.mbarrier.token
OPERATION_NAME = 'nvgpu.mbarrier.arrive.nocomplete'
_ODS_REGIONS = (0, True)
barriers() _ods_ir
mbarId() _ods_ir
count() _ods_ir
token() _ods_ir
mlir.dialects._nvgpu_ops_gen.mbarrier_arrive_nocomplete(barriers, mbar_id, count, *, results=None, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.MBarrierArriveOp(barriers, mbarId, *, results=None, loc=None, ip=None)

Bases: _ods_ir

The Op performs arrive-on operation on the mbarrier object and returns a nvgpu.mbarrier.token.

For more information, see https://docs.nvidia.com/cuda/parallel-thread-execution/#arrive-on-operation-on-mbarrier-object

Example:

%token = nvgpu.mbarrier.arrive %barrier : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>> -> !nvgpu.mbarrier.token
OPERATION_NAME = 'nvgpu.mbarrier.arrive'
_ODS_REGIONS = (0, True)
barriers() _ods_ir
mbarId() _ods_ir
token() _ods_ir
mlir.dialects._nvgpu_ops_gen.mbarrier_arrive(barriers, mbar_id, *, results=None, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.MBarrierCreateOp(barriers, *, loc=None, ip=None)

Bases: _ods_ir

The Op generates one or more mbarrier object, which is a barrier created in shared memory and supports various synchronization behaviors for threads.

The mbarrier object has the following type and alignment requirements: Type: .b64, Alignment: 8, Memory space: .shared

Example:

%barrier = nvgpu.mbarrier.create -> !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
OPERATION_NAME = 'nvgpu.mbarrier.create'
_ODS_REGIONS = (0, True)
barriers() _ods_ir
mlir.dialects._nvgpu_ops_gen.mbarrier_create(barriers, *, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.MBarrierGetOp(mbarrierPointer, barriers, mbarId, *, loc=None, ip=None)

Bases: _ods_ir

The nvgpu.mbarrier.get operation retrieves a pointer to a specific mbarrier object from a group of barriers created by the nvgpu.mbarrier.create operation.

Example:

%mbars = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 10>
%mbar_pointer = nvgpu.mbarrier.get %mbars[%c2] : !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>>
OPERATION_NAME = 'nvgpu.mbarrier.get'
_ODS_REGIONS = (0, True)
barriers() _ods_ir
mbarId() _ods_ir
mbarrierPointer() _ods_ir
mlir.dialects._nvgpu_ops_gen.mbarrier_get(mbarrier_pointer, barriers, mbar_id, *, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.MBarrierInitOp(barriers, count, mbarId, *, predicate=None, loc=None, ip=None)

Bases: _ods_ir

The Op initializes the mbarrier object with the given number of threads.

Example:

%num_threads = gpu.block_dim x
%barrier = nvgpu.mbarrier.create -> !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
nvgpu.mbarrier.init %barrier, %num_threads : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
OPERATION_NAME = 'nvgpu.mbarrier.init'
_ODS_REGIONS = (0, True)
barriers() _ods_ir
count() _ods_ir
mbarId() _ods_ir
predicate() _ods_ir | None
mlir.dialects._nvgpu_ops_gen.mbarrier_init(barriers, count, mbar_id, *, predicate=None, loc=None, ip=None) MBarrierInitOp
class mlir.dialects._nvgpu_ops_gen.MBarrierTestWaitOp(barriers, token, mbarId, *, results=None, loc=None, ip=None)

Bases: _ods_ir

Checks whether the mbarrier object has completed the phase. It is is a non-blocking instruction which tests for the completion of the phase.

Example:

%isComplete = nvgpu.mbarrier.test.wait %barrier, %token : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>, !nvgpu.mbarrier.token
OPERATION_NAME = 'nvgpu.mbarrier.test.wait'
_ODS_REGIONS = (0, True)
barriers() _ods_ir
token() _ods_ir
mbarId() _ods_ir
waitComplete() _ods_ir
mlir.dialects._nvgpu_ops_gen.mbarrier_test_wait(barriers, token, mbar_id, *, results=None, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.MBarrierTryWaitParityOp(barriers, phaseParity, ticks, mbarId, *, loc=None, ip=None)

Bases: _ods_ir

Checks whether the mbarrier object has completed the phase. It is is a potentially blocking instruction which tests for the completion of the phase. Suspended thread resumes execution when the specified phase completes OR before the phase completes following a system-dependent time limit.

The $phaseParity specifies either even phase (0) or odd phase (1) to wait.

Example:

nvgpu.mbarrier.try_wait.parity %barrier, %phaseParity, %ticks : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
OPERATION_NAME = 'nvgpu.mbarrier.try_wait.parity'
_ODS_REGIONS = (0, True)
barriers() _ods_ir
phaseParity() _ods_ir
ticks() _ods_ir
mbarId() _ods_ir
mlir.dialects._nvgpu_ops_gen.mbarrier_try_wait_parity(barriers, phase_parity, ticks, mbar_id, *, loc=None, ip=None) MBarrierTryWaitParityOp
class mlir.dialects._nvgpu_ops_gen.MmaSparseSyncOp(res, matrixA, matrixB, matrixC, sparseMetadata, mmaShape, *, sparsitySelector=None, tf32Enabled=None, loc=None, ip=None)

Bases: _ods_ir

The nvgu.mma.sp.sync operation performs a warp-distributed MMA operation where operand A is “structured sparse”. In this case, the matrixA operand represents the (warp-distributed) non-zero values of operand A, and the sparse_metadata operand provides the indices.

The full description of the sparsity storage format and distribution scheme is described in the PTX docs. This operation is meant to follow the semantic described in the PTX documentation here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-sparse-mma

The way the indices are distributed among the threads in a warp is controlled by the optional sparsity_selector operand, which is 0 by default. For more information, please consult the PTX documentation linked above.

Example (targetingthe f16 16x8x32 mma.sp PTX instruction):

nvgpu.mma.sp.sync (%a, %b, %c) metadata (%meta) {mmaShape = [16, 8, 32]} :
  (vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
OPERATION_NAME = 'nvgpu.mma.sp.sync'
_ODS_REGIONS = (0, True)
matrixA() _ods_ir
matrixB() _ods_ir
matrixC() _ods_ir
sparseMetadata() _ods_ir
mmaShape() _ods_ir
sparsitySelector() _ods_ir
tf32Enabled() bool
res() _ods_ir
mlir.dialects._nvgpu_ops_gen.mma_sp_sync(res, matrix_a, matrix_b, matrix_c, sparse_metadata, mma_shape, *, sparsity_selector=None, tf32_enabled=None, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.MmaSyncOp(res, matrixA, matrixB, matrixC, mmaShape, *, tf32Enabled=None, loc=None, ip=None)

Bases: _ods_ir

The nvgpu.mma.sync op represents the warp-level matrix-multiply-and- accumulate (mma) operation that is compatible with nvvm.mma.sync. The operands and results vector sizes are thread-level onwership to the warp-level mma operation shape. mmaShape attribute holds the warp-level matrix-multiply shape.

The nvgpu.mma.sync op serves as an intermediate point between lowering from vector.contract to nvvm.mma.sync.

This operation is meant to follow the semantic of described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma

Example:

%res = nvgpu.mma.sync (%matrixA, %matrixB, %matrixC) {mmaShape = [16, 8, 16]} :
    (vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf32>) -> vector<2x2xf32>
OPERATION_NAME = 'nvgpu.mma.sync'
_ODS_REGIONS = (0, True)
matrixA() _ods_ir
matrixB() _ods_ir
matrixC() _ods_ir
mmaShape() _ods_ir
tf32Enabled() bool
res() _ods_ir
mlir.dialects._nvgpu_ops_gen.mma_sync(res, matrix_a, matrix_b, matrix_c, mma_shape, *, tf32_enabled=None, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.RcpOp(in_, *, rounding=None, ftz=None, results=None, loc=None, ip=None)

Bases: _ods_ir

Reciprocal calculation for vector types using nvvm.rcp OPs.

Currently, only the approx rounding mode and ftz are supported, and only for the f32 type.

The input and output must be of the same vector type and shape.

OPERATION_NAME = 'nvgpu.rcp'
_ODS_REGIONS = (0, True)
in_() _ods_ir
rounding() _ods_ir
ftz() bool
out() _ods_ir
mlir.dialects._nvgpu_ops_gen.rcp(in_, *, rounding=None, ftz=None, results=None, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.TmaAsyncLoadOp(dst, barriers, tensorMapDescriptor, coordinates, mbarId, *, multicastMask=None, predicate=None, loc=None, ip=None)

Bases: _ods_ir

The Op loads a tile memory region from global memory to shared memory by Tensor Memory Access (TMA).

$tensorMapDescriptor is tensor map descriptor which has information about tile shape. The descriptor is created by nvgpu.tma.create.descriptor

The Op uses $barrier mbarrier based completion mechanism.

OPERATION_NAME = 'nvgpu.tma.async.load'
_ODS_OPERAND_SEGMENTS
_ODS_REGIONS = (0, True)
dst() _ods_ir
barriers() _ods_ir
tensorMapDescriptor() _ods_ir
coordinates() _ods_ir
mbarId() _ods_ir
multicastMask() _ods_ir | None
predicate() _ods_ir | None
mlir.dialects._nvgpu_ops_gen.tma_async_load(dst, barriers, tensor_map_descriptor, coordinates, mbar_id, *, multicast_mask=None, predicate=None, loc=None, ip=None) TmaAsyncLoadOp
class mlir.dialects._nvgpu_ops_gen.TmaAsyncStoreOp(src, tensorMapDescriptor, coordinates, *, predicate=None, loc=None, ip=None)

Bases: _ods_ir

The Op store a tile memory region from global memory to shared memory by Tensor Memory Access (TMA).

$tensorMapDescriptor is tensor map descriptor which has information about tile shape. The descriptor is created by nvgpu.tma.create.descriptor

OPERATION_NAME = 'nvgpu.tma.async.store'
_ODS_OPERAND_SEGMENTS
_ODS_REGIONS = (0, True)
src() _ods_ir
tensorMapDescriptor() _ods_ir
coordinates() _ods_ir
predicate() _ods_ir | None
mlir.dialects._nvgpu_ops_gen.tma_async_store(src, tensor_map_descriptor, coordinates, *, predicate=None, loc=None, ip=None) TmaAsyncStoreOp
class mlir.dialects._nvgpu_ops_gen.TmaCreateDescriptorOp(tensorMap, tensor, boxDimensions, *, loc=None, ip=None)

Bases: _ods_ir

The Op creates a tensor map descriptor object representing tiled memory region. To do that it calls CUDA Driver’s cuTensorMapEncodeTiled. The descriptor is used by Tensor Memory Access (TMA).

The tensor is the source tensor to be tiled.

The boxDimensions is the size of the tiled memory region in each dimension.

For more information see below: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html

OPERATION_NAME = 'nvgpu.tma.create.descriptor'
_ODS_REGIONS = (0, True)
tensor() _ods_ir
boxDimensions() _ods_ir
tensorMap() _ods_ir
mlir.dialects._nvgpu_ops_gen.tma_create_descriptor(tensor_map, tensor, box_dimensions, *, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.TmaFenceOp(tensorMapDescriptor, *, loc=None, ip=None)

Bases: _ods_ir

The Op fences the given $tmaDescriptor. This is necessary if the tensor map descriptor was modified from the host using cudaMemcpy. In this case, the kernel needs a fence after which it is safe to use tensor.map.

OPERATION_NAME = 'nvgpu.tma.fence.descriptor'
_ODS_REGIONS = (0, True)
tensorMapDescriptor() _ods_ir
mlir.dialects._nvgpu_ops_gen.tma_fence_descriptor(tensor_map_descriptor, *, loc=None, ip=None) TmaFenceOp
class mlir.dialects._nvgpu_ops_gen.TmaPrefetchOp(tensorMapDescriptor, *, predicate=None, loc=None, ip=None)

Bases: _ods_ir

The Op brings the cache line containing the given $tmaDescriptor for subsequent use by the tma.async.load instruction.

OPERATION_NAME = 'nvgpu.tma.prefetch.descriptor'
_ODS_REGIONS = (0, True)
tensorMapDescriptor() _ods_ir
predicate() _ods_ir | None
mlir.dialects._nvgpu_ops_gen.tma_prefetch_descriptor(tensor_map_descriptor, *, predicate=None, loc=None, ip=None) TmaPrefetchOp
class mlir.dialects._nvgpu_ops_gen.WarpgroupGenerateDescriptorOp(descriptor, tensor, tensorMap, *, loc=None, ip=None)

Bases: _ods_ir

This Op builds a nvgpu.warpgroup.descriptor that is used by nvgpu.warpgroup.mma to perform warpgroup-level matrix multiply and accumulate.

The descriptor specifies the properties of the matrix in shared memory that is a multiplicand in the matrix multiply and accumulate operation.

OPERATION_NAME = 'nvgpu.warpgroup.generate.descriptor'
_ODS_REGIONS = (0, True)
tensor() _ods_ir
tensorMap() _ods_ir
descriptor() _ods_ir
mlir.dialects._nvgpu_ops_gen.warpgroup_generate_descriptor(descriptor, tensor, tensor_map, *, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.WarpgroupMmaInitAccumulatorOp(matrixC, *, loc=None, ip=None)

Bases: _ods_ir

This Op generates and initializes the accumulator matrix for nvgpu.warpgroup.mma op to perform matrix-multiply-and-accumulate.

OPERATION_NAME = 'nvgpu.warpgroup.mma.init.accumulator'
_ODS_REGIONS = (0, True)
matrixC() _ods_ir
mlir.dialects._nvgpu_ops_gen.warpgroup_mma_init_accumulator(matrix_c, *, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.WarpgroupMmaOp(matrixD, descriptorA, descriptorB, matrixC, *, waitGroup=None, transposeA=None, transposeB=None, loc=None, ip=None)

Bases: _ods_ir

The nvgpu.warpgroup.mma op performs the warpgroup-level (4 warps) matrix-multiply-and-accumulate (mma) operation that results in nvvm.wgmma.mma_async.

The operands are descriptorA and descriptorB that are wgmma matrix descriptors that shows the properties of the matrix in shared memory. The results are thread-level ownership to the warpgroup-level mma operation shape. The shape is deduced from the descriptor types and output vector.

The Op encapsulates multiple nvvm.wgmma.mma_async operations to complete the given shape. As nvvm.wgmma.async Op, or its corresponding PTX instruction, is asynchronous, this Op groups the nvvm.wgmma.async and surrounds them between wgmma.fence.aligned and wgmma.commit.group.sync.aligned, wgmma.wait.group.sync.aligned Ops.

Example:

%r1,%r2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2:
           !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>,
           !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>,
           !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
           !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
           ->
           !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
           !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
OPERATION_NAME = 'nvgpu.warpgroup.mma'
_ODS_REGIONS = (0, True)
descriptorA() _ods_ir
descriptorB() _ods_ir
matrixC() _ods_ir
waitGroup() _ods_ir | None
transposeA() bool
transposeB() bool
matrixD() _ods_ir
mlir.dialects._nvgpu_ops_gen.warpgroup_mma(matrix_d, descriptor_a, descriptor_b, matrix_c, *, wait_group=None, transpose_a=None, transpose_b=None, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_ops_gen.WarpgroupMmaStoreOp(matrixD, dstMemref, *, loc=None, ip=None)

Bases: _ods_ir

The nvgpu.warpgroup.mma.store op performs the store of fragmented result in $matrixD to given memref.

[See the details of register fragment layout for accumulator matrix D] (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d)

Note that, the op must be run with warp group.

OPERATION_NAME = 'nvgpu.warpgroup.mma.store'
_ODS_REGIONS = (0, True)
matrixD() _ods_ir
dstMemref() _ods_ir
mlir.dialects._nvgpu_ops_gen.warpgroup_mma_store(matrix_d, dst_memref, *, loc=None, ip=None) WarpgroupMmaStoreOp