mlir.dialects._gpu_transform_ops_gen¶
Attributes¶
Classes¶
Collects patterns that are tryin to promote ``gpu.shuffle``s to specialized |
|
Collects GPU rewrite patterns comprising: |
|
Collects patterns that convert GPU dialect ops related to wmma ops |
|
Collects patterns that convert GPU dialect ops to NVVM dialect ops. These |
|
Collects patterns that convert GPU dialect ops to ROCDL dialect ops. These |
|
Collects patterns that convert GPU dialect ops related to wmma ops |
|
Unrolls contractions to the target |
|
Removes unnecessary GPU barriers from the function. If a barrier does not |
|
Target the gpu_launch op and rewrite the top level |
|
Target the |
Functions¶
|
|
|
|
|
Module Contents¶
- mlir.dialects._gpu_transform_ops_gen._ods_ir¶
- class mlir.dialects._gpu_transform_ops_gen.ApplyGPUPromoteShuffleToAMDGPUPatternsOp(*, chipset=None, loc=None, ip=None)¶
Bases:
_ods_irCollects patterns that are tryin to promote ``gpu.shuffle``s to specialized AMDGPU intrinsics.
- OPERATION_NAME = 'transform.apply_patterns.gpu.gpu_shuffle_to_amdgpu'¶
- _ODS_REGIONS = (0, True)¶
- chipset() _ods_ir | None¶
- mlir.dialects._gpu_transform_ops_gen.apply_patterns_gpu_gpu_shuffle_to_amdgpu(*, chipset=None, loc=None, ip=None) ApplyGPUPromoteShuffleToAMDGPUPatternsOp¶
- class mlir.dialects._gpu_transform_ops_gen.ApplyGPURewritePatternsOp(*, loc=None, ip=None)¶
Bases:
_ods_irCollects GPU rewrite patterns comprising:
GpuAllReduceRewrite patterns
GpuGlobalIdRewriter patterns
GpuShuffleRewriter patterns
- OPERATION_NAME = 'transform.apply_patterns.gpu.gpu_rewrite_patterns'¶
- _ODS_REGIONS = (0, True)¶
- mlir.dialects._gpu_transform_ops_gen.apply_patterns_gpu_gpu_rewrite_patterns(*, loc=None, ip=None) ApplyGPURewritePatternsOp¶
- class mlir.dialects._gpu_transform_ops_gen.ApplyGPUSubgroupReduceToNVVMConversionPatternsOp(*, loc=None, ip=None)¶
Bases:
_ods_irCollects patterns that convert GPU dialect ops related to wmma ops to NVVM dialect ops. These patterns require an “LLVMTypeConverter”.
- OPERATION_NAME = 'transform.apply_conversion_patterns.gpu.gpu_subgroup_reduce_to_nvvm'¶
- _ODS_REGIONS = (0, True)¶
- mlir.dialects._gpu_transform_ops_gen.apply_conversion_patterns_gpu_gpu_subgroup_reduce_to_nvvm(*, loc=None, ip=None) ApplyGPUSubgroupReduceToNVVMConversionPatternsOp¶
- class mlir.dialects._gpu_transform_ops_gen.ApplyGPUToNVVMConversionPatternsOp(*, benefit=None, loc=None, ip=None)¶
Bases:
_ods_irCollects patterns that convert GPU dialect ops to NVVM dialect ops. These patterns require an “LLVMTypeConverter”.
- OPERATION_NAME = 'transform.apply_conversion_patterns.gpu.gpu_to_nvvm'¶
- _ODS_REGIONS = (0, True)¶
- benefit() _ods_ir¶
- mlir.dialects._gpu_transform_ops_gen.apply_conversion_patterns_gpu_gpu_to_nvvm(*, benefit=None, loc=None, ip=None) ApplyGPUToNVVMConversionPatternsOp¶
- class mlir.dialects._gpu_transform_ops_gen.ApplyGPUToROCDLConversionPatternsOp(chipset, *, loc=None, ip=None)¶
Bases:
_ods_irCollects patterns that convert GPU dialect ops to ROCDL dialect ops. These patterns require an “LLVMTypeConverter”.
- OPERATION_NAME = 'transform.apply_conversion_patterns.gpu.gpu_to_rocdl'¶
- _ODS_REGIONS = (0, True)¶
- chipset() _ods_ir¶
- mlir.dialects._gpu_transform_ops_gen.apply_conversion_patterns_gpu_gpu_to_rocdl(chipset, *, loc=None, ip=None) ApplyGPUToROCDLConversionPatternsOp¶
- class mlir.dialects._gpu_transform_ops_gen.ApplyGPUWwmaToNVVMConversionPatternsOp(*, loc=None, ip=None)¶
Bases:
_ods_irCollects patterns that convert GPU dialect ops related to wmma ops to NVVM dialect ops. These patterns require an “LLVMTypeConverter”.
- OPERATION_NAME = 'transform.apply_conversion_patterns.gpu.gpu_wmma_to_nvvm'¶
- _ODS_REGIONS = (0, True)¶
- mlir.dialects._gpu_transform_ops_gen.apply_conversion_patterns_gpu_gpu_wmma_to_nvvm(*, loc=None, ip=None) ApplyGPUWwmaToNVVMConversionPatternsOp¶
- class mlir.dialects._gpu_transform_ops_gen.ApplyUnrollVectorsSubgroupMmaOp(m, n, k, *, loc=None, ip=None)¶
Bases:
_ods_irUnrolls contractions to the target
m,n, andknative vector size, along with other vector operations based on expected usage.transfer_readops unroll based on the extract slice shape introduced by unrolling the contractions, while elementwise andtransfer_writeops unroll to the shape of the C matrix (m x n).This operation applies to pure vector operations and should be applied before lowering to subgroup_mma ops.
- OPERATION_NAME = 'transform.apply_patterns.gpu.unroll_vectors_subgroup_mma'¶
- _ODS_REGIONS = (0, True)¶
- m() _ods_ir¶
- n() _ods_ir¶
- k() _ods_ir¶
- mlir.dialects._gpu_transform_ops_gen.apply_patterns_gpu_unroll_vectors_subgroup_mma(m, n, k, *, loc=None, ip=None) ApplyUnrollVectorsSubgroupMmaOp¶
- class mlir.dialects._gpu_transform_ops_gen.EliminateBarriersOp(*, loc=None, ip=None)¶
Bases:
_ods_irRemoves unnecessary GPU barriers from the function. If a barrier does not enforce any conflicting pair of memory effects, including a pair that is enforced by another barrier, it is unnecessary and can be removed.
The approach is based on “High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs” by Moses, Ivanov, Domke, Endo, Doerfert, and Zinenko in PPoPP 2023. Specifically, it analyzes the memory effects of the operations before and after the given barrier and checks if the barrier enforces any of the memory effect-induced dependencies that aren’t already enforced by another barrier.
For example, in the following code
store %A barrier // enforces load-after-store load %A barrier // load-after-store already enforced by the previous barrier load %A
the second barrier can be removed.
- OPERATION_NAME = 'transform.apply_patterns.gpu.eliminate_barriers'¶
- _ODS_REGIONS = (0, True)¶
- mlir.dialects._gpu_transform_ops_gen.apply_patterns_gpu_eliminate_barriers(*, loc=None, ip=None) EliminateBarriersOp¶
- class mlir.dialects._gpu_transform_ops_gen.MapForallToBlocks(result, target, *, grid_dims=None, generate_gpu_launch=None, loc=None, ip=None)¶
Bases:
_ods_irTarget the gpu_launch op and rewrite the top level
scf.forallto distributed gpu.block_id attribute. Ifgenerate_gpu_launchattribute is set, then first generatesgpu_launchand moves the top levelscf.forallinside.The operation searches top level
scf.forallops undergpu_launchand maps each such op to GPU blocks. Mapping is one-to-one and the induction variables ofscf.forallare rewritten to gpu.block_id according to thethread_dim_mappingattribute.Dynamic,
scf.foralltrip counts are currently not supported. Dynamic block dim sizes are currently not supported.Only bufferized scf.forall are currently supported. Only scf.forall distributed to at most 3 dimensions are currently supported.
The operation alters the block size of the given gpu_launch using the grid_dims argument.
Return modes:¶
This operation ignores non-gpu_launch ops and drops them in the return.
If any scf.forall with tensors is found, the transform definitely fails.
If all the
scf.foralloperations contained within the LaunchOp referred to by thetargethandle lower to GPU properly, the transform succeeds. Otherwise the transform definitely fails.The returned handle points to the same LaunchOp operand, consuming it and producing a new SSA value to satisfy chaining and linearity of the IR properties.
- OPERATION_NAME = 'transform.gpu.map_forall_to_blocks'¶
- _ODS_REGIONS = (0, True)¶
- target() _ods_ir¶
- grid_dims() _ods_ir | None¶
- generate_gpu_launch() bool¶
- result() _ods_ir¶
Shortcut to get an op result if it has only one (throws an error otherwise).
- mlir.dialects._gpu_transform_ops_gen.gpu_map_forall_to_blocks(result, target, *, grid_dims=None, generate_gpu_launch=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_transform_ops_gen.MapNestedForallToThreads(result, target, *, block_dims=None, sync_after_distribute=None, warp_size=None, loc=None, ip=None)¶
Bases:
_ods_irTarget the
gpu.launch opand rewrite allscf.forallnested in it to distributedgpu.thread_idattribute.The operation searches for
scf.forallops nested undertargetand maps each such op to GPU threads.scf.forallinduction variables are rewritten togpu.thread_idaccording to themappingattribute.Different types of mappings attributes are supported:
the block_dims is a list of integers that specifies the number of
threads in each dimension. This is a mandatory attribute that is used to constrain the number of threads in each dimension. If an
scf.forallop is mapped to fewer threads, predication occurs. * the warp_dims is a list of integers that specifies the number of warps in each dimension. This is an optional attribute that is used to constrain the number of warps in each dimension. When present, this attribute must be specified in a way that is compatible with the block_dims attribute. If anscf.forallop is mapped to fewer warps, predication occurs.Dynamic
scf.foralltrip counts are currently not supported. Dynamic block dim sizes are currently not supported.Only bufferized
scf.forallare currently supported. Onlyscf.foralldistributed to at most 3 dimensions are currently supported.The
sync_after_distribute``attribute controls whether a ``gpu.barrieris inserted after each scf.forall op. At this time, this is an all or nothing choice. This will need to be tightened in the future.The operation alters the block size of the given gpu_launch using the mandatory block_dims argument.
Return modes:¶
This operation ignores non-
gpu_launchops and drops them in the return.If any scf.forall with tensors is found, the transform definitely fails.
If all the
scf.foralloperations with gpu.thread mapping contained within theLaunchOpreferred to by thetargethandle lower to GPU properly, the transform succeeds. Otherwise the transform definitely fails.scf.forall operations with mappings other than gpu.thread are ignored.
The returned handle points to the same LaunchOp operand, consuming it and producing a new SSA value to satisfy chaining and linearity of the IR properties.
Example:¶
gpu.launch blocks(%bx, %by, %bz) in (%x = %0, %y = %1, %z = %2) threads(%tx, %ty, %tz) in (%tx = %3, %ty = %4, %tz = %5) { scf.forall (%i, %j) in (7, 9) { ... // body 1 } {mapping = [#gpu.thread<x>, #gpu.thread<y>, #gpu.thread<z>]} scf.forall (%i) in (12) { ... // body 2 } {mapping = [#gpu.thread<x>]} gpu.terminator }
is translated to:
%bdimX = arith.constant 12 : index %bdimY = arith.constant 9 : index gpu.launch blocks(%bx, %by, %bz) in (%x = %0, %y = %1, %z = %2) threads(%tx, %ty, %tz) in (%tx = %bdimX, %ty = %bdimY, %tz = %5) { if (threadIdx.x < 9 && threadIdx.y < 7) { ... // body 1 } gpu.barrier if (threadIdx.y < 1) { ... // body 2 } gpu.barrier gpu.terminator }
- OPERATION_NAME = 'transform.gpu.map_nested_forall_to_threads'¶
- _ODS_REGIONS = (0, True)¶
- target() _ods_ir¶
- block_dims() _ods_ir¶
- sync_after_distribute() _ods_ir¶
- warp_size() _ods_ir¶
- result() _ods_ir¶
Shortcut to get an op result if it has only one (throws an error otherwise).
- mlir.dialects._gpu_transform_ops_gen.gpu_map_nested_forall_to_threads(result, target, *, block_dims=None, sync_after_distribute=None, warp_size=None, loc=None, ip=None) _ods_ir¶