mlir.dialects._nvgpu_transform_ops_gen

Attributes

Classes

ApplyNVGPUToNVVMConversionPatternsOp

Collects patterns that convert NVGPU dialect ops to NVVM dialect ops. These

CreateAsyncGroupsOp

Look for global to shared memory copies within the targeted op in the form

PipelineSharedMemoryCopiesOp

Applies software pipelining to a given scf.for loop. The pipelining

RewriteCopyAsTmaOp

Rewrite a copy operation on memref to tma operations that transit through

RewriteMatmulAsMmaSyncOp

Rewrite a matmul operation on memref to an mma.sync operation on vectors.

Functions

Module Contents

mlir.dialects._nvgpu_transform_ops_gen._ods_ir
class mlir.dialects._nvgpu_transform_ops_gen.ApplyNVGPUToNVVMConversionPatternsOp(*, loc=None, ip=None)

Bases: _ods_ir

Collects patterns that convert NVGPU dialect ops to NVVM dialect ops. These patterns require an “LLVMTypeConverter”.

OPERATION_NAME = 'transform.apply_conversion_patterns.nvgpu.nvgpu_to_nvvm'
_ODS_REGIONS = (0, True)
mlir.dialects._nvgpu_transform_ops_gen.apply_conversion_patterns_nvgpu_nvgpu_to_nvvm(*, loc=None, ip=None) ApplyNVGPUToNVVMConversionPatternsOp
class mlir.dialects._nvgpu_transform_ops_gen.CreateAsyncGroupsOp(result, target, *, bypass_l1=None, loc=None, ip=None)

Bases: _ods_ir

Look for global to shared memory copies within the targeted op in the form of vector transfer ops and convert them to async copies when possible. Consecutive copies are put into the same group. A “wait” operation is inserted right at the of end the group.

bypass_l1 specifies whether bypassL1 attributes should be added to the async copies. bypass_l1 is a compiler hint: only 16 byte transfers can bypass the L1 cache, so this attribute is not set for any other transfer sizes.

Return modes

This op consumes the target handle and produces the result handle, which is mapped to the same payload operations as the target handle. The op modifies the payload.

OPERATION_NAME = 'transform.nvgpu.create_async_groups'
_ODS_REGIONS = (0, True)
target() _ods_ir
bypass_l1() bool
result() _ods_ir

Shortcut to get an op result if it has only one (throws an error otherwise).

mlir.dialects._nvgpu_transform_ops_gen.nvgpu_create_async_groups(result, target, *, bypass_l1=None, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_transform_ops_gen.PipelineSharedMemoryCopiesOp(result, for_op, depth, *, peel_epilogue=None, failure_propagation_mode=None, loc=None, ip=None)

Bases: _ods_ir

Applies software pipelining to a given scf.for loop. The pipelining strategy will look for a load into shared memory and pipeline it to overlap it with the rest of the loop.

NOTE: It is user responsibility to ensure that there are no dependency between depth iterations of the loop by using multi-buffering. It is also user responsibility to ensure a sufficient amount of shared memory is allocated to cover eventual writes by depth-1 speculative iterations.

depth will indicate how many stages the software pipeline should have. peel_epilogue allows to force the epilogue to be peeled out instead of potentially using predicated operations for the epilogue phase.

Return modes

Consumes the operand handle and produces a result handle pointing to the loop, which may or may not have been pipelined. Produces a definite failure if the loop pipeliner mutated the IR before failing to pipeline, in particular if peel_epilogue is not set and the loop body doesn’t support predication. If failure propagation mode is set to “propagate”, produces a silenceable failure when pipelining preconditions, e.g., loop bound being static, are not met or when the loop wasn’t pipelined because due to the lack of loads into shared memory. If the failure propagation mode is set to “suppress” (default), succeeds in these case and associates the result handle with the original loop.

TODO: the shared memory part and behavior specific to NVGPU should be made orthogonal to pipelining so that transform.loop.pipeline becomes usable here.

OPERATION_NAME = 'transform.nvgpu.pipeline_shared_memory_copies'
_ODS_REGIONS = (0, True)
for_op() _ods_ir
depth() _ods_ir
peel_epilogue() bool
failure_propagation_mode() _ods_ir
result() _ods_ir

Shortcut to get an op result if it has only one (throws an error otherwise).

mlir.dialects._nvgpu_transform_ops_gen.nvgpu_pipeline_shared_memory_copies(result, for_op, depth, *, peel_epilogue=None, failure_propagation_mode=None, loc=None, ip=None) _ods_ir
class mlir.dialects._nvgpu_transform_ops_gen.RewriteCopyAsTmaOp(target, *, loc=None, ip=None)

Bases: _ods_ir

Rewrite a copy operation on memref to tma operations that transit through shared memory.

OPERATION_NAME = 'transform.nvgpu.rewrite_copy_as_tma'
_ODS_REGIONS = (0, True)
target() _ods_ir
mlir.dialects._nvgpu_transform_ops_gen.nvgpu_rewrite_copy_as_tma(target, *, loc=None, ip=None) RewriteCopyAsTmaOp
class mlir.dialects._nvgpu_transform_ops_gen.RewriteMatmulAsMmaSyncOp(target, *, loc=None, ip=None)

Bases: _ods_ir

Rewrite a matmul operation on memref to an mma.sync operation on vectors.

Memory copies with the required access patterns are automatically inserted. Operations that do not have a 1-1 mapping to mma.sync operations are left unchanged.

OPERATION_NAME = 'transform.nvgpu.rewrite_matmul_as_mma_sync'
_ODS_REGIONS = (0, True)
target() _ods_ir
mlir.dialects._nvgpu_transform_ops_gen.nvgpu_rewrite_matmul_as_mma_sync(target, *, loc=None, ip=None) RewriteMatmulAsMmaSyncOp