mlir.dialects._nvgpu_transform_ops_gen ====================================== .. py:module:: mlir.dialects._nvgpu_transform_ops_gen Attributes ---------- .. autoapisummary:: mlir.dialects._nvgpu_transform_ops_gen._ods_ir Classes ------- .. autoapisummary:: mlir.dialects._nvgpu_transform_ops_gen.ApplyNVGPUToNVVMConversionPatternsOp mlir.dialects._nvgpu_transform_ops_gen.CreateAsyncGroupsOp mlir.dialects._nvgpu_transform_ops_gen.PipelineSharedMemoryCopiesOp mlir.dialects._nvgpu_transform_ops_gen.RewriteCopyAsTmaOp mlir.dialects._nvgpu_transform_ops_gen.RewriteMatmulAsMmaSyncOp Functions --------- .. autoapisummary:: mlir.dialects._nvgpu_transform_ops_gen.apply_conversion_patterns_nvgpu_nvgpu_to_nvvm mlir.dialects._nvgpu_transform_ops_gen.nvgpu_create_async_groups mlir.dialects._nvgpu_transform_ops_gen.nvgpu_pipeline_shared_memory_copies mlir.dialects._nvgpu_transform_ops_gen.nvgpu_rewrite_copy_as_tma mlir.dialects._nvgpu_transform_ops_gen.nvgpu_rewrite_matmul_as_mma_sync Module Contents --------------- .. py:data:: _ods_ir .. py:class:: ApplyNVGPUToNVVMConversionPatternsOp(*, loc=None, ip=None) Bases: :py:obj:`_ods_ir` Collects patterns that convert NVGPU dialect ops to NVVM dialect ops. These patterns require an "LLVMTypeConverter". .. py:attribute:: OPERATION_NAME :value: 'transform.apply_conversion_patterns.nvgpu.nvgpu_to_nvvm' .. py:attribute:: _ODS_REGIONS :value: (0, True) .. py:function:: apply_conversion_patterns_nvgpu_nvgpu_to_nvvm(*, loc=None, ip=None) -> ApplyNVGPUToNVVMConversionPatternsOp .. py:class:: CreateAsyncGroupsOp(result, target, *, bypass_l1=None, loc=None, ip=None) Bases: :py:obj:`_ods_ir` Look for global to shared memory copies within the targeted op in the form of vector transfer ops and convert them to async copies when possible. Consecutive copies are put into the same group. A "wait" operation is inserted right at the of end the group. ``bypass_l1`` specifies whether ``bypassL1`` attributes should be added to the async copies. ``bypass_l1`` is a compiler hint: only 16 byte transfers can bypass the L1 cache, so this attribute is not set for any other transfer sizes. Return modes ------------ This op consumes the ``target`` handle and produces the ``result`` handle, which is mapped to the same payload operations as the ``target`` handle. The op modifies the payload. .. py:attribute:: OPERATION_NAME :value: 'transform.nvgpu.create_async_groups' .. py:attribute:: _ODS_REGIONS :value: (0, True) .. py:method:: target() -> _ods_ir .. py:method:: bypass_l1() -> bool .. py:method:: result() -> _ods_ir Shortcut to get an op result if it has only one (throws an error otherwise). .. py:function:: nvgpu_create_async_groups(result, target, *, bypass_l1=None, loc=None, ip=None) -> _ods_ir .. py:class:: PipelineSharedMemoryCopiesOp(result, for_op, depth, *, peel_epilogue=None, failure_propagation_mode=None, loc=None, ip=None) Bases: :py:obj:`_ods_ir` Applies software pipelining to a given scf.for loop. The pipelining strategy will look for a load into shared memory and pipeline it to overlap it with the rest of the loop. NOTE: It is user responsibility to ensure that there are no dependency between ``depth`` iterations of the loop by using multi-buffering. It is also user responsibility to ensure a sufficient amount of shared memory is allocated to cover eventual writes by ``depth-1`` speculative iterations. ``depth`` will indicate how many stages the software pipeline should have. ``peel_epilogue`` allows to force the epilogue to be peeled out instead of potentially using predicated operations for the epilogue phase. Return modes ------------ Consumes the operand handle and produces a result handle pointing to the loop, which may or may not have been pipelined. Produces a definite failure if the loop pipeliner mutated the IR before failing to pipeline, in particular if ``peel_epilogue`` is not set and the loop body doesn't support predication. If failure propagation mode is set to "propagate", produces a silenceable failure when pipelining preconditions, e.g., loop bound being static, are not met or when the loop wasn't pipelined because due to the lack of loads into shared memory. If the failure propagation mode is set to "suppress" (default), succeeds in these case and associates the result handle with the original loop. TODO: the shared memory part and behavior specific to NVGPU should be made orthogonal to pipelining so that ``transform.loop.pipeline`` becomes usable here. .. py:attribute:: OPERATION_NAME :value: 'transform.nvgpu.pipeline_shared_memory_copies' .. py:attribute:: _ODS_REGIONS :value: (0, True) .. py:method:: for_op() -> _ods_ir .. py:method:: depth() -> _ods_ir .. py:method:: peel_epilogue() -> bool .. py:method:: failure_propagation_mode() -> _ods_ir .. py:method:: result() -> _ods_ir Shortcut to get an op result if it has only one (throws an error otherwise). .. py:function:: nvgpu_pipeline_shared_memory_copies(result, for_op, depth, *, peel_epilogue=None, failure_propagation_mode=None, loc=None, ip=None) -> _ods_ir .. py:class:: RewriteCopyAsTmaOp(target, *, loc=None, ip=None) Bases: :py:obj:`_ods_ir` Rewrite a copy operation on memref to tma operations that transit through shared memory. .. py:attribute:: OPERATION_NAME :value: 'transform.nvgpu.rewrite_copy_as_tma' .. py:attribute:: _ODS_REGIONS :value: (0, True) .. py:method:: target() -> _ods_ir .. py:function:: nvgpu_rewrite_copy_as_tma(target, *, loc=None, ip=None) -> RewriteCopyAsTmaOp .. py:class:: RewriteMatmulAsMmaSyncOp(target, *, loc=None, ip=None) Bases: :py:obj:`_ods_ir` Rewrite a matmul operation on memref to an mma.sync operation on vectors. Memory copies with the required access patterns are automatically inserted. Operations that do not have a 1-1 mapping to mma.sync operations are left unchanged. .. py:attribute:: OPERATION_NAME :value: 'transform.nvgpu.rewrite_matmul_as_mma_sync' .. py:attribute:: _ODS_REGIONS :value: (0, True) .. py:method:: target() -> _ods_ir .. py:function:: nvgpu_rewrite_matmul_as_mma_sync(target, *, loc=None, ip=None) -> RewriteMatmulAsMmaSyncOp