mlir.dialects._nvgpu_transform_ops_gen¶

Attributes¶

_ods_ir

Classes¶

`ApplyNVGPUToNVVMConversionPatternsOp`	Collects patterns that convert NVGPU dialect ops to NVVM dialect ops. These
`CreateAsyncGroupsOp`	Look for global to shared memory copies within the targeted op in the form
`PipelineSharedMemoryCopiesOp`	Applies software pipelining to a given scf.for loop. The pipelining
`RewriteCopyAsTmaOp`	Rewrite a copy operation on memref to tma operations that transit through
`RewriteMatmulAsMmaSyncOp`	Rewrite a matmul operation on memref to an mma.sync operation on vectors.

Functions¶

`apply_conversion_patterns_nvgpu_nvgpu_to_nvvm`(...)
`nvgpu_create_async_groups`(→ _ods_ir)
`nvgpu_pipeline_shared_memory_copies`(→ _ods_ir)
`nvgpu_rewrite_copy_as_tma`(→ RewriteCopyAsTmaOp)
`nvgpu_rewrite_matmul_as_mma_sync`(...)

Module Contents¶

mlir.dialects._nvgpu_transform_ops_gen._ods_ir¶

class mlir.dialects._nvgpu_transform_ops_gen.ApplyNVGPUToNVVMConversionPatternsOp(*, loc=None, ip=None)¶

Bases: _ods_ir

Collects patterns that convert NVGPU dialect ops to NVVM dialect ops. These patterns require an “LLVMTypeConverter”.

OPERATION_NAME = 'transform.apply_conversion_patterns.nvgpu.nvgpu_to_nvvm'¶

_ODS_REGIONS = (0, True)¶

mlir.dialects._nvgpu_transform_ops_gen.apply_conversion_patterns_nvgpu_nvgpu_to_nvvm(*, loc=None, ip=None) → ApplyNVGPUToNVVMConversionPatternsOp¶

class mlir.dialects._nvgpu_transform_ops_gen.CreateAsyncGroupsOp(result, target, *, bypass_l1=None, loc=None, ip=None)¶

Bases: _ods_ir

Look for global to shared memory copies within the targeted op in the form of vector transfer ops and convert them to async copies when possible. Consecutive copies are put into the same group. A “wait” operation is inserted right at the of end the group.

bypass_l1 specifies whether bypassL1 attributes should be added to the async copies. bypass_l1 is a compiler hint: only 16 byte transfers can bypass the L1 cache, so this attribute is not set for any other transfer sizes.

Return modes¶

This op consumes the target handle and produces the result handle, which is mapped to the same payload operations as the target handle. The op modifies the payload.

OPERATION_NAME = 'transform.nvgpu.create_async_groups'¶

_ODS_REGIONS = (0, True)¶

target() → _ods_ir¶

bypass_l1() → bool¶

result() → _ods_ir¶: Shortcut to get an op result if it has only one (throws an error otherwise).

mlir.dialects._nvgpu_transform_ops_gen.nvgpu_create_async_groups(result, target, *, bypass_l1=None, loc=None, ip=None) → _ods_ir¶

class mlir.dialects._nvgpu_transform_ops_gen.PipelineSharedMemoryCopiesOp(result, for_op, depth, *, peel_epilogue=None, failure_propagation_mode=None, loc=None, ip=None)¶

Bases: _ods_ir

Applies software pipelining to a given scf.for loop. The pipelining strategy will look for a load into shared memory and pipeline it to overlap it with the rest of the loop.

NOTE: It is user responsibility to ensure that there are no dependency between depth iterations of the loop by using multi-buffering. It is also user responsibility to ensure a sufficient amount of shared memory is allocated to cover eventual writes by depth-1 speculative iterations.

depth will indicate how many stages the software pipeline should have. peel_epilogue allows to force the epilogue to be peeled out instead of potentially using predicated operations for the epilogue phase.

Return modes¶

Consumes the operand handle and produces a result handle pointing to the loop, which may or may not have been pipelined. Produces a definite failure if the loop pipeliner mutated the IR before failing to pipeline, in particular if peel_epilogue is not set and the loop body doesn’t support predication. If failure propagation mode is set to “propagate”, produces a silenceable failure when pipelining preconditions, e.g., loop bound being static, are not met or when the loop wasn’t pipelined because due to the lack of loads into shared memory. If the failure propagation mode is set to “suppress” (default), succeeds in these case and associates the result handle with the original loop.

TODO: the shared memory part and behavior specific to NVGPU should be made orthogonal to pipelining so that transform.loop.pipeline becomes usable here.

OPERATION_NAME = 'transform.nvgpu.pipeline_shared_memory_copies'¶

_ODS_REGIONS = (0, True)¶

for_op() → _ods_ir¶

depth() → _ods_ir¶

peel_epilogue() → bool¶

failure_propagation_mode() → _ods_ir¶

result() → _ods_ir¶: Shortcut to get an op result if it has only one (throws an error otherwise).

mlir.dialects._nvgpu_transform_ops_gen.nvgpu_pipeline_shared_memory_copies(result, for_op, depth, *, peel_epilogue=None, failure_propagation_mode=None, loc=None, ip=None) → _ods_ir¶

class mlir.dialects._nvgpu_transform_ops_gen.RewriteCopyAsTmaOp(target, *, loc=None, ip=None)¶

Bases: _ods_ir

Rewrite a copy operation on memref to tma operations that transit through shared memory.

OPERATION_NAME = 'transform.nvgpu.rewrite_copy_as_tma'¶

_ODS_REGIONS = (0, True)¶

target() → _ods_ir¶

mlir.dialects._nvgpu_transform_ops_gen.nvgpu_rewrite_copy_as_tma(target, *, loc=None, ip=None) → RewriteCopyAsTmaOp¶

class mlir.dialects._nvgpu_transform_ops_gen.RewriteMatmulAsMmaSyncOp(target, *, loc=None, ip=None)¶

Bases: _ods_ir

Rewrite a matmul operation on memref to an mma.sync operation on vectors.

Memory copies with the required access patterns are automatically inserted. Operations that do not have a 1-1 mapping to mma.sync operations are left unchanged.

OPERATION_NAME = 'transform.nvgpu.rewrite_matmul_as_mma_sync'¶

_ODS_REGIONS = (0, True)¶

target() → _ods_ir¶

mlir.dialects._nvgpu_transform_ops_gen.nvgpu_rewrite_matmul_as_mma_sync(target, *, loc=None, ip=None) → RewriteMatmulAsMmaSyncOp¶