mlir.dialects._nvgpu_ops_gen¶
Attributes¶
Classes¶
The |
|
The |
|
The |
|
The |
|
A thread executing the Op performs an expect-tx operation on the mbarrier |
|
The Op performs arrive-on operation on the |
|
The Op performs arrive-on operation on the |
|
The Op generates one or more |
|
The |
|
The Op initializes the |
|
Checks whether the mbarrier object has completed the phase. It is is a |
|
Checks whether the mbarrier object has completed the phase. It is is a |
|
The |
|
The |
|
Reciprocal calculation for |
|
The Op loads a tile memory region from global memory to shared memory by |
|
The Op store a tile memory region from global memory to shared memory by |
|
The Op creates a tensor map descriptor object representing tiled memory |
|
The Op fences the given |
|
The Op brings the cache line containing the given |
|
This Op builds a |
|
This Op generates and initializes the accumulator matrix for |
|
The |
|
The |
Functions¶
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Module Contents¶
- mlir.dialects._nvgpu_ops_gen._ods_ir¶
- class mlir.dialects._nvgpu_ops_gen._Dialect(descriptor: object)¶
Bases:
_ods_ir- DIALECT_NAMESPACE = 'nvgpu'¶
- class mlir.dialects._nvgpu_ops_gen.DeviceAsyncCopyOp(dst, dstIndices, src, srcIndices, dstElements, *, srcElements=None, bypassL1=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irThe
nvgpu.device_async_copyop initiates an asynchronous copy operation of elements from source (global memory) to the destination (shared memory) without blocking the thread. The async copy is added to a group.This op is meant to be used with
nvgpu.device_async_create_groupandnvgpu.device_async_waitto synchronize copies as explained in those ops descriptions.bypassL1attribute is hint to the hardware to bypass the L1 cache during async copy, this hint may be ignored by the hardware.dstElementsattribute is the total number of elements written to destination (shared memory).srcElementsargument is the total number of elements read from source (global memory).srcElementsis an optional argument and when present the op only readssrcElementsnumber of elements from the source (global memory) and zero fills the rest of the elements in the destination (shared memory).In order to do a copy and wait for the result we need the following combination:
// copy 1. %cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3> // copy 2. %cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3> // group 1 contains copy 1 and copy 2. %token1 = nvgpu.device_async_create_group %cp1, %cp2 // copy 3. %cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3> // group 2 contains copy 3. %token2 = nvgpu.device_async_create_group %cp3 // after the wait copy 1 and copy 2 are complete. nvgpu.device_async_wait %token1 // after the wait copy 3 is complete. nvgpu.device_async_wait %token2
Example:
%0 = nvgpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
- OPERATION_NAME = 'nvgpu.device_async_copy'¶
- _ODS_OPERAND_SEGMENTS¶
- _ODS_REGIONS = (0, True)¶
- dst() _ods_ir¶
- dstIndices() _ods_ir¶
- src() _ods_ir¶
- srcIndices() _ods_ir¶
- srcElements() _ods_ir | None¶
- dstElements() _ods_ir¶
- bypassL1() bool¶
- asyncToken() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.device_async_copy(dst, dst_indices, src, src_indices, dst_elements, *, src_elements=None, bypass_l1=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.DeviceAsyncCreateGroupOp(inputTokens, *, results=None, loc=None, ip=None)¶
Bases:
_ods_irThe
nvgpu.device_async_create_groupop creates a group of memory accesses containing all the pendingdevice_async_copyoperations associated with argument tokens. Each token can only be part of one group.It returns a token that can be use to wait until the group fully completes.
This is meant to be used with
nvgpu.device_async_waitto synchronize copies as explained in those ops descriptions.Groups are executed in the order they are created.
Example:
%0 = nvgpu.device_async_create_group
- OPERATION_NAME = 'nvgpu.device_async_create_group'¶
- _ODS_REGIONS = (0, True)¶
- inputTokens() _ods_ir¶
- asyncToken() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.device_async_create_group(input_tokens, *, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.DeviceAsyncWaitOp(asyncDependencies, *, numGroups=None, loc=None, ip=None)¶
Bases:
_ods_irThe
nvgpu.device_async_waitop will block the execution thread until the group associated with the source token is fully completed.The optional
$numGroupsattribute gives an upper bound of the number of groups uncompleted when the wait can unblock the thread. For example, if 16 async groups are pushe and$numGroupsis set to 12, then the thread will unblock when 12 groups or fewer are in flight (4 groups have completed).Example:
nvgpu.device_async_wait %0
- OPERATION_NAME = 'nvgpu.device_async_wait'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- numGroups() _ods_ir | None¶
- mlir.dialects._nvgpu_ops_gen.device_async_wait(async_dependencies, *, num_groups=None, loc=None, ip=None) DeviceAsyncWaitOp¶
- class mlir.dialects._nvgpu_ops_gen.LdMatrixOp(res, srcMemref, indices, transpose, numTiles, *, loc=None, ip=None)¶
Bases:
_ods_irThe
nvgpu.ldmatrixop represents loading a matrix fragment from memory to registers. The source and result type must be compatible with lowering to thenvvm.ldmatrixinstruction. This op represents the distributed version of avector.transfer_readas an intermediate step between lowering fromvector.transfer_readtonvvm.ldmatrix.This operation is meant to follow the semantic of described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix
Example:
%0 = nvgpu.ldmatrix %sm[%c0, %c0] {numTiles = 4 : i32, transpose = false} : memref<?x?xf16, 3> -> vector<4x2xf16>
- OPERATION_NAME = 'nvgpu.ldmatrix'¶
- _ODS_REGIONS = (0, True)¶
- srcMemref() _ods_ir¶
- indices() _ods_ir¶
- transpose() _ods_ir¶
- numTiles() _ods_ir¶
- res() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.ldmatrix(res, src_memref, indices, transpose, num_tiles, *, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.MBarrierArriveExpectTxOp(barriers, txcount, mbarId, *, predicate=None, loc=None, ip=None)¶
Bases:
_ods_irA thread executing the Op performs an expect-tx operation on the mbarrier object at the location specified by the address operand $barrier. The expect-tx operation, with an $txcount argument, increases the tx-count of an mbarrier object by the value specified by $txcount. This makes the current phase of the mbarrier object to expect and track the completion of additional asynchronous transactions.
The
$txCountspecifies the number of element to the expect-tx operation.Example:
nvgpu.mbarrier.arrive.expect_tx %barrier, %ic0 : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
- OPERATION_NAME = 'nvgpu.mbarrier.arrive.expect_tx'¶
- _ODS_REGIONS = (0, True)¶
- barriers() _ods_ir¶
- txcount() _ods_ir¶
- mbarId() _ods_ir¶
- predicate() _ods_ir | None¶
- mlir.dialects._nvgpu_ops_gen.mbarrier_arrive_expect_tx(barriers, txcount, mbar_id, *, predicate=None, loc=None, ip=None) MBarrierArriveExpectTxOp¶
- class mlir.dialects._nvgpu_ops_gen.MBarrierArriveNoCompleteOp(barriers, mbarId, count, *, results=None, loc=None, ip=None)¶
Bases:
_ods_irThe Op performs arrive-on operation on the
mbarrierobject and returns anvgpu.mbarrier.token.The Op does not cause the
nvgpu.mbarrierto complete its current phase.Example:
%token = nvgpu.mbarrier.arrive.noComplete %barrier, %count : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>> -> !nvgpu.mbarrier.token
- OPERATION_NAME = 'nvgpu.mbarrier.arrive.nocomplete'¶
- _ODS_REGIONS = (0, True)¶
- barriers() _ods_ir¶
- mbarId() _ods_ir¶
- count() _ods_ir¶
- token() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.mbarrier_arrive_nocomplete(barriers, mbar_id, count, *, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.MBarrierArriveOp(barriers, mbarId, *, results=None, loc=None, ip=None)¶
Bases:
_ods_irThe Op performs arrive-on operation on the
mbarrierobject and returns anvgpu.mbarrier.token.For more information, see https://docs.nvidia.com/cuda/parallel-thread-execution/#arrive-on-operation-on-mbarrier-object
Example:
%token = nvgpu.mbarrier.arrive %barrier : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>> -> !nvgpu.mbarrier.token
- OPERATION_NAME = 'nvgpu.mbarrier.arrive'¶
- _ODS_REGIONS = (0, True)¶
- barriers() _ods_ir¶
- mbarId() _ods_ir¶
- token() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.mbarrier_arrive(barriers, mbar_id, *, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.MBarrierCreateOp(barriers, *, loc=None, ip=None)¶
Bases:
_ods_irThe Op generates one or more
mbarrierobject, which is a barrier created in shared memory and supports various synchronization behaviors for threads.The
mbarrierobject has the following type and alignment requirements: Type: .b64, Alignment: 8, Memory space: .sharedExample:
%barrier = nvgpu.mbarrier.create -> !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
- OPERATION_NAME = 'nvgpu.mbarrier.create'¶
- _ODS_REGIONS = (0, True)¶
- barriers() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.mbarrier_create(barriers, *, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.MBarrierGetOp(mbarrierPointer, barriers, mbarId, *, loc=None, ip=None)¶
Bases:
_ods_irThe
nvgpu.mbarrier.getoperation retrieves a pointer to a specificmbarrierobject from a group of barriers created by thenvgpu.mbarrier.createoperation.Example:
%mbars = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 10> %mbar_pointer = nvgpu.mbarrier.get %mbars[%c2] : !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>>
- OPERATION_NAME = 'nvgpu.mbarrier.get'¶
- _ODS_REGIONS = (0, True)¶
- barriers() _ods_ir¶
- mbarId() _ods_ir¶
- mbarrierPointer() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.mbarrier_get(mbarrier_pointer, barriers, mbar_id, *, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.MBarrierInitOp(barriers, count, mbarId, *, predicate=None, loc=None, ip=None)¶
Bases:
_ods_irThe Op initializes the
mbarrierobject with the given number of threads.Example:
%num_threads = gpu.block_dim x %barrier = nvgpu.mbarrier.create -> !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>> nvgpu.mbarrier.init %barrier, %num_threads : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
- OPERATION_NAME = 'nvgpu.mbarrier.init'¶
- _ODS_REGIONS = (0, True)¶
- barriers() _ods_ir¶
- count() _ods_ir¶
- mbarId() _ods_ir¶
- predicate() _ods_ir | None¶
- mlir.dialects._nvgpu_ops_gen.mbarrier_init(barriers, count, mbar_id, *, predicate=None, loc=None, ip=None) MBarrierInitOp¶
- class mlir.dialects._nvgpu_ops_gen.MBarrierTestWaitOp(barriers, token, mbarId, *, results=None, loc=None, ip=None)¶
Bases:
_ods_irChecks whether the mbarrier object has completed the phase. It is is a non-blocking instruction which tests for the completion of the phase.
Example:
%isComplete = nvgpu.mbarrier.test.wait %barrier, %token : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>, !nvgpu.mbarrier.token
- OPERATION_NAME = 'nvgpu.mbarrier.test.wait'¶
- _ODS_REGIONS = (0, True)¶
- barriers() _ods_ir¶
- token() _ods_ir¶
- mbarId() _ods_ir¶
- waitComplete() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.mbarrier_test_wait(barriers, token, mbar_id, *, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.MBarrierTryWaitParityOp(barriers, phaseParity, ticks, mbarId, *, loc=None, ip=None)¶
Bases:
_ods_irChecks whether the mbarrier object has completed the phase. It is is a potentially blocking instruction which tests for the completion of the phase. Suspended thread resumes execution when the specified phase completes OR before the phase completes following a system-dependent time limit.
The
$phaseParityspecifies either even phase (0) or odd phase (1) to wait.Example:
nvgpu.mbarrier.try_wait.parity %barrier, %phaseParity, %ticks : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
- OPERATION_NAME = 'nvgpu.mbarrier.try_wait.parity'¶
- _ODS_REGIONS = (0, True)¶
- barriers() _ods_ir¶
- phaseParity() _ods_ir¶
- ticks() _ods_ir¶
- mbarId() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.mbarrier_try_wait_parity(barriers, phase_parity, ticks, mbar_id, *, loc=None, ip=None) MBarrierTryWaitParityOp¶
- class mlir.dialects._nvgpu_ops_gen.MmaSparseSyncOp(res, matrixA, matrixB, matrixC, sparseMetadata, mmaShape, *, sparsitySelector=None, tf32Enabled=None, loc=None, ip=None)¶
Bases:
_ods_irThe
nvgu.mma.sp.syncoperation performs a warp-distributed MMA operation where operand A is “structured sparse”. In this case, thematrixAoperand represents the (warp-distributed) non-zero values of operand A, and thesparse_metadataoperand provides the indices.The full description of the sparsity storage format and distribution scheme is described in the PTX docs. This operation is meant to follow the semantic described in the PTX documentation here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-sparse-mma
The way the indices are distributed among the threads in a warp is controlled by the optional
sparsity_selectoroperand, which is0by default. For more information, please consult the PTX documentation linked above.Example (targetingthe f16 16x8x32
mma.spPTX instruction):nvgpu.mma.sp.sync (%a, %b, %c) metadata (%meta) {mmaShape = [16, 8, 32]} : (vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
- OPERATION_NAME = 'nvgpu.mma.sp.sync'¶
- _ODS_REGIONS = (0, True)¶
- matrixA() _ods_ir¶
- matrixB() _ods_ir¶
- matrixC() _ods_ir¶
- sparseMetadata() _ods_ir¶
- mmaShape() _ods_ir¶
- sparsitySelector() _ods_ir¶
- tf32Enabled() bool¶
- res() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.mma_sp_sync(res, matrix_a, matrix_b, matrix_c, sparse_metadata, mma_shape, *, sparsity_selector=None, tf32_enabled=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.MmaSyncOp(res, matrixA, matrixB, matrixC, mmaShape, *, tf32Enabled=None, loc=None, ip=None)¶
Bases:
_ods_irThe
nvgpu.mma.syncop represents the warp-level matrix-multiply-and- accumulate (mma) operation that is compatible withnvvm.mma.sync. The operands and results vector sizes are thread-level onwership to the warp-level mma operation shape.mmaShapeattribute holds the warp-level matrix-multiply shape.The
nvgpu.mma.syncop serves as an intermediate point between lowering fromvector.contracttonvvm.mma.sync.This operation is meant to follow the semantic of described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma
Example:
%res = nvgpu.mma.sync (%matrixA, %matrixB, %matrixC) {mmaShape = [16, 8, 16]} : (vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf32>) -> vector<2x2xf32>
- OPERATION_NAME = 'nvgpu.mma.sync'¶
- _ODS_REGIONS = (0, True)¶
- matrixA() _ods_ir¶
- matrixB() _ods_ir¶
- matrixC() _ods_ir¶
- mmaShape() _ods_ir¶
- tf32Enabled() bool¶
- res() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.mma_sync(res, matrix_a, matrix_b, matrix_c, mma_shape, *, tf32_enabled=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.RcpOp(in_, *, rounding=None, ftz=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReciprocal calculation for
vectortypes usingnvvm.rcpOPs.Currently, only the
approxrounding mode andftzare supported, and only for thef32type.The input and output must be of the same vector type and shape.
- OPERATION_NAME = 'nvgpu.rcp'¶
- _ODS_REGIONS = (0, True)¶
- in_() _ods_ir¶
- rounding() _ods_ir¶
- ftz() bool¶
- out() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.rcp(in_, *, rounding=None, ftz=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.TmaAsyncLoadOp(dst, barriers, tensorMapDescriptor, coordinates, mbarId, *, multicastMask=None, predicate=None, loc=None, ip=None)¶
Bases:
_ods_irThe Op loads a tile memory region from global memory to shared memory by Tensor Memory Access (TMA).
$tensorMapDescriptoris tensor map descriptor which has information about tile shape. The descriptor is created bynvgpu.tma.create.descriptorThe Op uses
$barriermbarrier based completion mechanism.- OPERATION_NAME = 'nvgpu.tma.async.load'¶
- _ODS_OPERAND_SEGMENTS¶
- _ODS_REGIONS = (0, True)¶
- dst() _ods_ir¶
- barriers() _ods_ir¶
- tensorMapDescriptor() _ods_ir¶
- coordinates() _ods_ir¶
- mbarId() _ods_ir¶
- multicastMask() _ods_ir | None¶
- predicate() _ods_ir | None¶
- mlir.dialects._nvgpu_ops_gen.tma_async_load(dst, barriers, tensor_map_descriptor, coordinates, mbar_id, *, multicast_mask=None, predicate=None, loc=None, ip=None) TmaAsyncLoadOp¶
- class mlir.dialects._nvgpu_ops_gen.TmaAsyncStoreOp(src, tensorMapDescriptor, coordinates, *, predicate=None, loc=None, ip=None)¶
Bases:
_ods_irThe Op store a tile memory region from global memory to shared memory by Tensor Memory Access (TMA).
$tensorMapDescriptoris tensor map descriptor which has information about tile shape. The descriptor is created bynvgpu.tma.create.descriptor- OPERATION_NAME = 'nvgpu.tma.async.store'¶
- _ODS_OPERAND_SEGMENTS¶
- _ODS_REGIONS = (0, True)¶
- src() _ods_ir¶
- tensorMapDescriptor() _ods_ir¶
- coordinates() _ods_ir¶
- predicate() _ods_ir | None¶
- mlir.dialects._nvgpu_ops_gen.tma_async_store(src, tensor_map_descriptor, coordinates, *, predicate=None, loc=None, ip=None) TmaAsyncStoreOp¶
- class mlir.dialects._nvgpu_ops_gen.TmaCreateDescriptorOp(tensorMap, tensor, boxDimensions, *, loc=None, ip=None)¶
Bases:
_ods_irThe Op creates a tensor map descriptor object representing tiled memory region. To do that it calls CUDA Driver’s
cuTensorMapEncodeTiled. The descriptor is used by Tensor Memory Access (TMA).The
tensoris the source tensor to be tiled.The
boxDimensionsis the size of the tiled memory region in each dimension.For more information see below: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html
- OPERATION_NAME = 'nvgpu.tma.create.descriptor'¶
- _ODS_REGIONS = (0, True)¶
- tensor() _ods_ir¶
- boxDimensions() _ods_ir¶
- tensorMap() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.tma_create_descriptor(tensor_map, tensor, box_dimensions, *, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.TmaFenceOp(tensorMapDescriptor, *, loc=None, ip=None)¶
Bases:
_ods_irThe Op fences the given
$tmaDescriptor. This is necessary if the tensor map descriptor was modified from the host using cudaMemcpy. In this case, the kernel needs a fence after which it is safe to usetensor.map.- OPERATION_NAME = 'nvgpu.tma.fence.descriptor'¶
- _ODS_REGIONS = (0, True)¶
- tensorMapDescriptor() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.tma_fence_descriptor(tensor_map_descriptor, *, loc=None, ip=None) TmaFenceOp¶
- class mlir.dialects._nvgpu_ops_gen.TmaPrefetchOp(tensorMapDescriptor, *, predicate=None, loc=None, ip=None)¶
Bases:
_ods_irThe Op brings the cache line containing the given
$tmaDescriptorfor subsequent use by thetma.async.loadinstruction.- OPERATION_NAME = 'nvgpu.tma.prefetch.descriptor'¶
- _ODS_REGIONS = (0, True)¶
- tensorMapDescriptor() _ods_ir¶
- predicate() _ods_ir | None¶
- mlir.dialects._nvgpu_ops_gen.tma_prefetch_descriptor(tensor_map_descriptor, *, predicate=None, loc=None, ip=None) TmaPrefetchOp¶
- class mlir.dialects._nvgpu_ops_gen.WarpgroupGenerateDescriptorOp(descriptor, tensor, tensorMap, *, loc=None, ip=None)¶
Bases:
_ods_irThis Op builds a
nvgpu.warpgroup.descriptorthat is used bynvgpu.warpgroup.mmato perform warpgroup-level matrix multiply and accumulate.The descriptor specifies the properties of the matrix in shared memory that is a multiplicand in the matrix multiply and accumulate operation.
- OPERATION_NAME = 'nvgpu.warpgroup.generate.descriptor'¶
- _ODS_REGIONS = (0, True)¶
- tensor() _ods_ir¶
- tensorMap() _ods_ir¶
- descriptor() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.warpgroup_generate_descriptor(descriptor, tensor, tensor_map, *, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.WarpgroupMmaInitAccumulatorOp(matrixC, *, loc=None, ip=None)¶
Bases:
_ods_irThis Op generates and initializes the accumulator matrix for
nvgpu.warpgroup.mmaop to perform matrix-multiply-and-accumulate.- OPERATION_NAME = 'nvgpu.warpgroup.mma.init.accumulator'¶
- _ODS_REGIONS = (0, True)¶
- matrixC() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.warpgroup_mma_init_accumulator(matrix_c, *, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.WarpgroupMmaOp(matrixD, descriptorA, descriptorB, matrixC, *, waitGroup=None, transposeA=None, transposeB=None, loc=None, ip=None)¶
Bases:
_ods_irThe
nvgpu.warpgroup.mmaop performs the warpgroup-level (4 warps) matrix-multiply-and-accumulate (mma) operation that results innvvm.wgmma.mma_async.The operands are
descriptorAanddescriptorBthat are wgmma matrix descriptors that shows the properties of the matrix in shared memory. The results are thread-level ownership to the warpgroup-level mma operation shape. The shape is deduced from the descriptor types and output vector.The Op encapsulates multiple
nvvm.wgmma.mma_asyncoperations to complete the given shape. Asnvvm.wgmma.asyncOp, or its corresponding PTX instruction, is asynchronous, this Op groups thenvvm.wgmma.asyncand surrounds them betweenwgmma.fence.alignedandwgmma.commit.group.sync.aligned,wgmma.wait.group.sync.alignedOps.Example:
%r1,%r2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2: !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> -> !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
- OPERATION_NAME = 'nvgpu.warpgroup.mma'¶
- _ODS_REGIONS = (0, True)¶
- descriptorA() _ods_ir¶
- descriptorB() _ods_ir¶
- matrixC() _ods_ir¶
- waitGroup() _ods_ir | None¶
- transposeA() bool¶
- transposeB() bool¶
- matrixD() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.warpgroup_mma(matrix_d, descriptor_a, descriptor_b, matrix_c, *, wait_group=None, transpose_a=None, transpose_b=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._nvgpu_ops_gen.WarpgroupMmaStoreOp(matrixD, dstMemref, *, loc=None, ip=None)¶
Bases:
_ods_irThe
nvgpu.warpgroup.mma.storeop performs the store of fragmented result in $matrixD to given memref.[See the details of register fragment layout for accumulator matrix D] (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d)
Note that, the op must be run with warp group.
- OPERATION_NAME = 'nvgpu.warpgroup.mma.store'¶
- _ODS_REGIONS = (0, True)¶
- matrixD() _ods_ir¶
- dstMemref() _ods_ir¶
- mlir.dialects._nvgpu_ops_gen.warpgroup_mma_store(matrix_d, dst_memref, *, loc=None, ip=None) WarpgroupMmaStoreOp¶