mlir.dialects._gpu_ops_gen¶
Attributes¶
Classes¶
The |
|
The |
|
The |
|
GPU binaries provide a semantic mechanism for storing GPU objects, |
|
Returns the number of threads in the thread block (aka the block size) along |
|
Returns the block id, i.e. the index of the current block within the grid |
|
Returns the block id within the cluster along the x, y, or z |
|
Returns the number of thread blocks in the cluster along |
|
Returns the number of cluster identifiers per grid along |
|
Returns the cluster id, i.e. the index of the current cluster within the |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
This operation provides a memref pointer to the start of dynamic shared |
|
Defines a function that can be executed on a GPU. This supports memory |
|
GPU module contains code that is intended to be run on a GPU. A host device |
|
Returns the unique global workitem/thread id, i.e., the unique index of the |
|
Returns the number of thread blocks in the grid along the x, y, or z |
|
This op maps the provided host buffer into the device address space. |
|
This op unmaps the provided host buffer from the device address space. |
|
Returns the lane id within the subgroup (warp/wave). |
|
Launch a kernel function on the specified grid of thread blocks. |
|
Launch a kernel on the specified grid of thread blocks. The body of the |
|
The |
|
The |
|
Returns the number of subgroups within a workgroup. |
|
|
|
A terminator operation for regions that appear in the body of |
|
The "rotate" op moves values across lanes in a subgroup (a.k.a., local |
|
The |
|
The |
|
The |
|
Operation that sets the current default GPU, using a zero-based index |
|
The "shuffle" op moves values across lanes in a subgroup (a.k.a., local |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
Broadcasts a value from one lane to all active lanes in a subgroup. The |
|
Returns the subgroup id, i.e., the index of the current subgroup within the |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
Returns the number of threads within a subgroup. |
|
A terminator operation for regions that appear in the body of |
|
Returns the thread id, i.e. the index of the current thread within the block |
|
This op synchronizes the host or the device with a list of dependent ops. |
|
|
|
|
Functions¶
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Module Contents¶
- mlir.dialects._gpu_ops_gen._ods_ir¶
- class mlir.dialects._gpu_ops_gen._Dialect(descriptor: object)¶
Bases:
_ods_ir- DIALECT_NAMESPACE = 'gpu'¶
- class mlir.dialects._gpu_ops_gen.AllReduceOp(value, *, op=None, uniform=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irThe
all_reduceop reduces the value of every work item across a local workgroup. The result is equal for all work items of a workgroup.For example, both
%1 = gpu.all_reduce add %0 {} : (f32) -> (f32) %2 = gpu.all_reduce %0 { ^bb(%lhs : f32, %rhs : f32): %sum = arith.addf %lhs, %rhs : f32 "gpu.yield"(%sum) : (f32) -> () } : (f32) -> (f32)
compute the sum of each work item’s %0 value. The first version specifies the accumulation as operation, whereas the second version specifies the accumulation as code region. The reduction operation must be one of:
Integer types:
add,mul,minui,minsi,maxui,maxsi,and,
or,xor* Floating point types:add,mul,minnumf,maxnumf,minimumf,maximumfIf
uniformflag is set either none or all work items of a workgroup need to execute this op in convergence.- OPERATION_NAME = 'gpu.all_reduce'¶
- _ODS_REGIONS = (1, True)¶
- value() _ods_ir¶
- op() _ods_ir | None¶
- uniform() bool¶
- result() _ods_ir¶
Shortcut to get an op result if it has only one (throws an error otherwise).
- body() _ods_ir¶
- mlir.dialects._gpu_ops_gen.all_reduce(value, *, op=None, uniform=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.AllocOp(memref, asyncToken, asyncDependencies, dynamicSizes, symbolOperands, *, hostShared=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.allocoperation allocates a region of memory on the GPU. It is similar to thememref.allocop, but supports asynchronous GPU execution.The op does not execute before all async dependencies have finished executing.
If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it also returns a !gpu.async.token.If the
host_sharedkeyword is present, the memory will be allocated in a memory accessible both on host and on device.Example:
%memref, %token = gpu.alloc async [%dep] host_shared (%width) : memref<64x?xf32, 1>
- OPERATION_NAME = 'gpu.alloc'¶
- _ODS_OPERAND_SEGMENTS¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- dynamicSizes() _ods_ir¶
- symbolOperands() _ods_ir¶
- memref() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.alloc(memref, async_token, async_dependencies, dynamic_sizes, symbol_operands, *, host_shared=None, loc=None, ip=None) _ods_ir | _ods_ir | AllocOp¶
- class mlir.dialects._gpu_ops_gen.BarrierOp(*, loc=None, ip=None)¶
Bases:
_ods_irThe
barrierop synchronizes all work items of a workgroup. It is used to coordinate communication between the work items of the workgroup.gpu.barrierwaits until all work items in the workgroup have reached this point and all memory accesses made by these work items prior to the op are visible to all work items in the workgroup. Data hazards between work items accessing the same memory can be avoided by synchronizing work items in-between these accesses.
Either none or all work items of a workgroup need to execute this op in convergence.
- OPERATION_NAME = 'gpu.barrier'¶
- _ODS_REGIONS = (0, True)¶
- class mlir.dialects._gpu_ops_gen.BinaryOp(sym_name, objects, *, offloadingHandler=None, loc=None, ip=None)¶
Bases:
_ods_irGPU binaries provide a semantic mechanism for storing GPU objects, e.g. the result of compiling a GPU module to an object file.
This operation has 3 arguments:
The name of the binary.
An optional attribute implementing the offloading LLVM translation interface.
An array of GPU object attributes.
During translation, the offloading attribute will be called for translating GPU
binaryandlaunch_funcoperations. The default offloading handler is:#gpu.select_object, this handler selects the first object from the array and embeds it as a string.Examples:
// Selects the first object. gpu.binary @myobject [#gpu.object<...>, #gpu.object<...>] // Uses the `#foo.my_handler` for handling the binary during translation. gpu.binary @myobject <#foo.my_handler> [#gpu.object<...>, #gpu.object<...>] // Selects the object with the `#rocdl.target` target attribute. gpu.binary @myobject <#gpu.select_object<#rocdl.target>> [#gpu.object<...>, #gpu.object<#rocdl.target, ...>]
- OPERATION_NAME = 'gpu.binary'¶
- _ODS_REGIONS = (0, True)¶
- sym_name() _ods_ir¶
- offloadingHandler() _ods_ir | None¶
- objects() _ods_ir¶
- mlir.dialects._gpu_ops_gen.binary(sym_name, objects, *, offloading_handler=None, loc=None, ip=None) BinaryOp¶
- class mlir.dialects._gpu_ops_gen.BlockDimOp(dimension, *, upper_bound=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReturns the number of threads in the thread block (aka the block size) along the x, y, or z
dimension.Example:
%bDimX = gpu.block_dim x
If
known_block_sizeis set on an this operation’s enclosinggpu.func, orgpu.known_block_sizeis set on an enclosingFunctionOpInterfaceimplementor, or if the enclosinggpu.launchspecifies a constant size fordimension’s blocks, these contextual facts may be used to infer that this operation has a constant value, though such a transformation will not be performed by canonicalization or the default constant folder. Executions which cause that constant-value assumption to be false incur undefined behavior.If
upper_boundis set, executions where the bblock size alongdimensionexceedsupper_boundcause undefined behavior.There is an implicit upper bound of
kMaxDim(currently uint32_t::max).- OPERATION_NAME = 'gpu.block_dim'¶
- _ODS_REGIONS = (0, True)¶
- dimension() _ods_ir¶
- upper_bound() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.block_dim(dimension, *, upper_bound=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.BlockIdOp(dimension, *, upper_bound=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReturns the block id, i.e. the index of the current block within the grid along the x, y, or z
dimension.Example:
%bIdY = gpu.block_id y
If
upper_boundis set, or if one can be inferred fromknown_grid_size-type annotations in context, executions where the block index indimensionwould be greater than or equal to that bound cause undefined behavior.upper_boundtakes priority over bounds inferrable from context.There is an implicit upper bound of
kMaxDim(currently uint32_t::max).- OPERATION_NAME = 'gpu.block_id'¶
- _ODS_REGIONS = (0, True)¶
- dimension() _ods_ir¶
- upper_bound() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.block_id(dimension, *, upper_bound=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.ClusterBlockIdOp(dimension, *, upper_bound=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReturns the block id within the cluster along the x, y, or z
dimension.Example:
%cBlockIdY = gpu.cluster_block_id y
If
upper_boundis set, then executing (a lowering of) this operation in an environment where the number of thread blocks per cluster alongdimensionis greater thanupper_boundcauses undefined behavior.There is an implicit upper bound of
kMaxClusterDim(currently 8).- OPERATION_NAME = 'gpu.cluster_block_id'¶
- _ODS_REGIONS = (0, True)¶
- dimension() _ods_ir¶
- upper_bound() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.cluster_block_id(dimension, *, upper_bound=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.ClusterDimBlocksOp(dimension, *, upper_bound=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReturns the number of thread blocks in the cluster along the x, y, or z
dimension.Example:
%cDimBlocksX = gpu.cluster_dim_blocks x
If
upper_boundis set, then executing (a lowering of) this operation in an environment where the thread blocks per cluster is greater thanupper_boundcauses undefined behavior.There is an implicit upper bound of
kMaxClusterDim(currently 8).- OPERATION_NAME = 'gpu.cluster_dim_blocks'¶
- _ODS_REGIONS = (0, True)¶
- dimension() _ods_ir¶
- upper_bound() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.cluster_dim_blocks(dimension, *, upper_bound=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.ClusterDimOp(dimension, *, upper_bound=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReturns the number of cluster identifiers per grid along the x, y, or z
dimension.Example:
%cDimX = gpu.cluster_dim x
If
upper_boundis set, then executing (a lowering of) this operation in an environment where the clusters per grid is greater thanupper_boundcauses undefined behavior.There is an implicit upper bound of
kMaxDim(currently uint32_t::max).- OPERATION_NAME = 'gpu.cluster_dim'¶
- _ODS_REGIONS = (0, True)¶
- dimension() _ods_ir¶
- upper_bound() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.cluster_dim(dimension, *, upper_bound=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.ClusterIdOp(dimension, *, upper_bound=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReturns the cluster id, i.e. the index of the current cluster within the grid along the x, y, or z
dimension.Example:
%cIdY = gpu.cluster_id y
If
upper_boundis set, then executing (a lowering of) this operation in an environment where the number of clusters in the grid alongdimensionis greater thanupper_boundcauses undefined behavior.There is an implicit upper bound of
kMaxDim(currently uint32_t::max).- OPERATION_NAME = 'gpu.cluster_id'¶
- _ODS_REGIONS = (0, True)¶
- dimension() _ods_ir¶
- upper_bound() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.cluster_id(dimension, *, upper_bound=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.Create2To4SpMatOp(spMat, asyncToken, asyncDependencies, rows, cols, memref, *, pruneFlag=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.create_2to4_spmatoperation initializes a sparse matrix in dense format with 2:4 sparsity. The buffers must already be copied from the host to the device prior to using this operation. The operation returns a handle to the sparse matrix descriptor.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.Example:
%spmat, %token = gpu.create_2to4_spmat async [%dep] {PRUNE_AND_CHECK} %rows, %cols, %mem: memref<?xf64>
- OPERATION_NAME = 'gpu.create_2to4_spmat'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- rows() _ods_ir¶
- cols() _ods_ir¶
- memref() _ods_ir¶
- pruneFlag() _ods_ir¶
- spMat() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.create_2to4_spmat(sp_mat, async_token, async_dependencies, rows, cols, memref, *, prune_flag=None, loc=None, ip=None) _ods_ir | _ods_ir | Create2To4SpMatOp¶
- class mlir.dialects._gpu_ops_gen.CreateBsrOp(spmat, asyncToken, asyncDependencies, brows, bcols, bnnz, rBlockSize, cBlockSize, bRowPos, bColIdxs, values, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.create_bsroperation initializes a sparse matrix in BSR format with the given sizes for the matrix and blocks from the given position, index, and values buffers. The buffers must already be copied from the host to the device prior to using this operation. The operation returns a handle to the sparse matrix descriptor.The BSR format is similar to CSR, where the column indices represent two-dimensional blocks instead of a single matrix entry. Note that this operation (currently) only supports storage with square blocks, i.e.,
rBlockSize == cBlockSize.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.Example:
%spmat, %token = gpu.create_bsr async [%dep] %brows, %bcols, %bnnz, %rBlockSize, %cBlockSize, %bRowPos, %bColIdxs, %values : memref<?xindex>, memref<?xindex>, memref<?xf64>
- OPERATION_NAME = 'gpu.create_bsr'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- brows() _ods_ir¶
- bcols() _ods_ir¶
- bnnz() _ods_ir¶
- rBlockSize() _ods_ir¶
- cBlockSize() _ods_ir¶
- bRowPos() _ods_ir¶
- bColIdxs() _ods_ir¶
- values() _ods_ir¶
- spmat() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.create_bsr(spmat, async_token, async_dependencies, brows, bcols, bnnz, r_block_size, c_block_size, b_row_pos, b_col_idxs, values, *, loc=None, ip=None) _ods_ir | _ods_ir | CreateBsrOp¶
- class mlir.dialects._gpu_ops_gen.CreateCooAoSOp(spmat, asyncToken, asyncDependencies, rows, cols, nnz, idxs, values, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.create_coo_aosoperation initializes a sparse matrix in COO format with the given sizes from the given index and values buffers. The buffers must already be copied from the host to the device prior to using this operation. The operation returns a handle to the sparse matrix descriptor. Unlike the defaultgpu.create_coooperation, this operation builds the COO format from a single index buffer in AoS format (note that this feature has been deprecated in cuSparse 11.2).If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.Example:
%spmat, %token = gpu.create_coo_aos async [%dep] %rows, %cols, %nnz, %idxs, %values : memref<?xindex>, memref<?xf64>
- OPERATION_NAME = 'gpu.create_coo_aos'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- rows() _ods_ir¶
- cols() _ods_ir¶
- nnz() _ods_ir¶
- idxs() _ods_ir¶
- values() _ods_ir¶
- spmat() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.create_coo_aos(spmat, async_token, async_dependencies, rows, cols, nnz, idxs, values, *, loc=None, ip=None) _ods_ir | _ods_ir | CreateCooAoSOp¶
- class mlir.dialects._gpu_ops_gen.CreateCooOp(spmat, asyncToken, asyncDependencies, rows, cols, nnz, rowIdxs, colIdxs, values, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.create_coooperation initializes a sparse matrix in COO format with the given sizes from the given index and values buffers. The buffers must already be copied from the host to the device prior to using this operation. The operation returns a handle to the sparse matrix descriptor. Note that this operation builds the COO in SoA format.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.Example:
%spmat, %token = gpu.create_coo async [%dep] %rows, %cols, %nnz, %rowIdx, %colIdx, %values : memref<?xindex>, memref<?xindex>, memref<?xf64>
- OPERATION_NAME = 'gpu.create_coo'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- rows() _ods_ir¶
- cols() _ods_ir¶
- nnz() _ods_ir¶
- rowIdxs() _ods_ir¶
- colIdxs() _ods_ir¶
- values() _ods_ir¶
- spmat() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.create_coo(spmat, async_token, async_dependencies, rows, cols, nnz, row_idxs, col_idxs, values, *, loc=None, ip=None) _ods_ir | _ods_ir | CreateCooOp¶
- class mlir.dialects._gpu_ops_gen.CreateCscOp(spmat, asyncToken, asyncDependencies, rows, cols, nnz, colPos, rowIdxs, values, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.create_cscoperation initializes a sparse matrix in CSC format with the given sizes from the given position, index, and values buffers. The buffers must already be copied from the host to the device prior to using this operation. The operation returns a handle to the sparse matrix descriptor.The CSC format has exactly the same memory layout as its transpose in CSR format (and vice versa).
If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.Example:
%spmat, %token = gpu.create_csc async [%dep] %rows, %cols, %nnz, %colPos, %rowIdx, %values : memref<?xindex>, memref<?xindex>, memref<?xf64>
- OPERATION_NAME = 'gpu.create_csc'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- rows() _ods_ir¶
- cols() _ods_ir¶
- nnz() _ods_ir¶
- colPos() _ods_ir¶
- rowIdxs() _ods_ir¶
- values() _ods_ir¶
- spmat() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.create_csc(spmat, async_token, async_dependencies, rows, cols, nnz, col_pos, row_idxs, values, *, loc=None, ip=None) _ods_ir | _ods_ir | CreateCscOp¶
- class mlir.dialects._gpu_ops_gen.CreateCsrOp(spmat, asyncToken, asyncDependencies, rows, cols, nnz, rowPos, colIdxs, values, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.create_csroperation initializes a sparse matrix in CSR format with the given sizes from the given position, index, and values buffers. The buffers must already be copied from the host to the device prior to using this operation. The operation returns a handle to the sparse matrix descriptor.The CSR format has exactly the same memory layout as its transpose in CSC format (and vice versa).
If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.Example:
%spmat, %token = gpu.create_csr async [%dep] %rows, %cols, %nnz, %rowPos, %colIdx, %values : memref<?xindex>, memref<?xindex>, memref<?xf64>
- OPERATION_NAME = 'gpu.create_csr'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- rows() _ods_ir¶
- cols() _ods_ir¶
- nnz() _ods_ir¶
- rowPos() _ods_ir¶
- colIdxs() _ods_ir¶
- values() _ods_ir¶
- spmat() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.create_csr(spmat, async_token, async_dependencies, rows, cols, nnz, row_pos, col_idxs, values, *, loc=None, ip=None) _ods_ir | _ods_ir | CreateCsrOp¶
- class mlir.dialects._gpu_ops_gen.CreateDnTensorOp(dnTensor, asyncToken, asyncDependencies, memref, dims, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.create_dn_tensoroperation initializes a dense tensor from the given values buffer and sizes. The buffer must already be copied from the host to the device prior to using this operation. The operation returns a handle to the dense tensor descriptor.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.Example:
%dmat, %token = gpu.create_dn_tensor async [%dep] %mem, %dims : index, index into memref<?xf64>
- OPERATION_NAME = 'gpu.create_dn_tensor'¶
- _ODS_OPERAND_SEGMENTS¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- memref() _ods_ir¶
- dims() _ods_ir¶
- dnTensor() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.create_dn_tensor(dn_tensor, async_token, async_dependencies, memref, dims, *, loc=None, ip=None) _ods_ir | _ods_ir | CreateDnTensorOp¶
- class mlir.dialects._gpu_ops_gen.DeallocOp(asyncToken, asyncDependencies, memref, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.deallocoperation frees the region of memory referenced by a memref which was originally created by thegpu.allocoperation. It is similar to thememref.deallocop, but supports asynchronous GPU execution.The op does not execute before all async dependencies have finished executing.
If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token.Example:
%token = gpu.dealloc async [%dep] %memref : memref<8x64xf32, 1>
- OPERATION_NAME = 'gpu.dealloc'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- memref() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.dealloc(async_token, async_dependencies, memref, *, loc=None, ip=None) _ods_ir | _ods_ir | DeallocOp¶
- class mlir.dialects._gpu_ops_gen.DestroyDnTensorOp(asyncToken, asyncDependencies, dnTensor, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.destroy_dn_tensoroperation releases all resources of a dense tensor represented by a handle that was previously created by agpu.create_dn_tensoroperation.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.Example:
%token = gpu.destroy_dn_tensor async [%dep] %dnTensor
- OPERATION_NAME = 'gpu.destroy_dn_tensor'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- dnTensor() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.destroy_dn_tensor(async_token, async_dependencies, dn_tensor, *, loc=None, ip=None) _ods_ir | _ods_ir | DestroyDnTensorOp¶
- class mlir.dialects._gpu_ops_gen.DestroySpMatOp(asyncToken, asyncDependencies, spmat, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.destroy_sp_matoperation releases all resources of a sparse matrix represented by a handle that was previously created by a one of the sparse matrix creation operations.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.Example:
%token = gpu.destroy_sp_mat async [%dep] %spmat
- OPERATION_NAME = 'gpu.destroy_sp_mat'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- spmat() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.destroy_sp_mat(async_token, async_dependencies, spmat, *, loc=None, ip=None) _ods_ir | _ods_ir | DestroySpMatOp¶
Bases:
_ods_irThis operation provides a memref pointer to the start of dynamic shared memory, often referred to as workgroup memory. It’s important to note that this dynamic shared memory needs to be allocated at kernel launch. One can conveniently utilize the
dynamic_shared_memory_sizeparameter ofgpu.launchfor this purpose.Examples:
%0 = gpu.dynamic.shared.memory : memref<?xi8, #gpu.address_space<workgroup>> %1 = memref.view %0[%c8192][] : memref<?xi8, #gpu.address_space<workgroup>> to memref<32x64xf32, #gpu.address_space<workgroup>> %2 = memref.view %0[%c16384][] : memref<?xi8, #gpu.address_space<workgroup>> to memref<32x64xf32, #gpu.address_space<workgroup>>
- class mlir.dialects._gpu_ops_gen.GPUFuncOp(function_type, *, arg_attrs=None, res_attrs=None, workgroup_attrib_attrs=None, private_attrib_attrs=None, known_block_size=None, known_grid_size=None, loc=None, ip=None)¶
Bases:
_ods_irDefines a function that can be executed on a GPU. This supports memory attribution and its body has a particular execution model.
GPU functions are either kernels (as indicated by the
kernelattribute) or regular functions. The former can be launched from the host side, while the latter are device side only.The memory attribution defines SSA values that correspond to memory buffers allocated in the memory hierarchy of the GPU (see below).
The operation has one attached region that corresponds to the body of the function. The region arguments consist of the function arguments without modification, followed by buffers defined in memory annotations. The body of a GPU function, when launched, is executed by multiple work items. There are no guarantees on the order in which work items execute, or on the connection between them. In particular, work items are not necessarily executed in lock-step. Synchronization ops such as “gpu.barrier” should be used to coordinate work items. Declarations of GPU functions, i.e. not having the body region, are not supported.
A function may optionally be annotated with the block and/or grid sizes that will be used when it is launched using the
known_block_sizeandknown_grid_sizeattributes, respectively. If set, these attributes must be arrays of three 32-bit integers giving the x, y, and z launch dimensions. Launching a kernel that has these annotations, or that calls a function with these annotations, using a block size or grid size other than what is specified is undefined behavior. These attributes may be set on non-gpu.funcfunctions by usinggpu.known_block_sizeorgpu.known_grid_size, but this carries the risk that they will de discarded.Syntax:
op ::= `gpu.func` symbol-ref-id `(` argument-list `)` (`->` function-result-list)? memory-attribution `kernel`? function-attributes? region memory-attribution ::= (`workgroup` `(` ssa-id-and-type-list `)`)? (`private` `(` ssa-id-and-type-list `)`)?Example:
gpu.func @foo(%arg0: index) workgroup(%workgroup: memref<32xf32, 3>) private(%private: memref<1xf32, 5>) kernel attributes {qux: "quux"} { gpu.return }
The generic form illustrates the concept
"gpu.func"(%arg: index) {sym_name: "foo", kernel, qux: "quux"} ({ ^bb0(%arg0: index, %workgroup: memref<32xf32, 3>, %private: memref<1xf32, 5>): "gpu.return"() : () -> () }) : (index) -> ()
Note the non-default memory spaces used in memref types in memory attribution.
- OPERATION_NAME = 'gpu.func'¶
- _ODS_REGIONS = (1, True)¶
- function_type() _ods_ir¶
- arg_attrs() _ods_ir | None¶
- res_attrs() _ods_ir | None¶
- workgroup_attrib_attrs() _ods_ir | None¶
- private_attrib_attrs() _ods_ir | None¶
- known_block_size() _ods_ir | None¶
- known_grid_size() _ods_ir | None¶
- body() _ods_ir¶
- mlir.dialects._gpu_ops_gen.func(function_type, *, arg_attrs=None, res_attrs=None, workgroup_attrib_attrs=None, private_attrib_attrs=None, known_block_size=None, known_grid_size=None, loc=None, ip=None) GPUFuncOp¶
- class mlir.dialects._gpu_ops_gen.GPUModuleOp(sym_name, *, targets=None, offloadingHandler=None, loc=None, ip=None)¶
Bases:
_ods_irGPU module contains code that is intended to be run on a GPU. A host device can launch this code through a gpu.launc_func that creates a fully qualified symbol through the gpu.module’s symbol and a gpu.func symbol contained in the gpu.module.
The module’s top-level scope is modeled by a single region with a single block. GPU modules are required to have a name that is used for symbol resolution by the gpu.launch_func operation.
Using an op with a region to define a GPU module enables “embedding” GPU modules with SIMT execution models in other dialects in a clean manner and allows filtering of code regions to execute passes on only code intended to or not intended to be run on the separate device.
Modules can contain zero or more target attributes. These attributes encode how to transform modules into binary strings and are used by the
gpu-module-to-binarypass to transform modules into GPU binaries.Modules can contain an optional
OffloadingTranslationAttrattribute. This attribute will be used during thegpu-module-to-binarypass to specify theOffloadingTranslationAttrused when creating thegpu.binaryoperation.gpu.module @symbol_name { gpu.func {} ... } // Module with offloading handler and target attributes. gpu.module @symbol_name2 <#gpu.select_object<1>> [ #nvvm.target, #rocdl.target<chip = "gfx90a">] { gpu.func {} ... }
- OPERATION_NAME = 'gpu.module'¶
- _ODS_REGIONS = (1, True)¶
- sym_name() _ods_ir¶
- targets() _ods_ir | None¶
- offloadingHandler() _ods_ir | None¶
- bodyRegion() _ods_ir¶
- mlir.dialects._gpu_ops_gen.module(sym_name, *, targets=None, offloading_handler=None, loc=None, ip=None) GPUModuleOp¶
- class mlir.dialects._gpu_ops_gen.GlobalIdOp(dimension, *, upper_bound=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReturns the unique global workitem/thread id, i.e., the unique index of the current workitem/thread within all workgroups / grid along the x, y, or z
dimension.Example:
%gidX = gpu.global_id x %gidX = gpu.global_id x upper_bound 65536
The
upper_boundattribute defines an upper bound analogously to the ones onthread_idandblock_id. If one is not set, the bound may be inferred from a combination ofknown_block_sizeandknown_grid_size-type annotations.- OPERATION_NAME = 'gpu.global_id'¶
- _ODS_REGIONS = (0, True)¶
- dimension() _ods_ir¶
- upper_bound() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.global_id(dimension, *, upper_bound=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.GridDimOp(dimension, *, upper_bound=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReturns the number of thread blocks in the grid along the x, y, or z
dimension.Example:
%gDimZ = gpu.grid_dim z
If
known_grid_sizeis set on an this operation’s enclosinggpu.func, orgpu.known_grid_sizeis set on an enclosingFunctionOpInterfaceimplementor, or if the enclosinggpu.launchspecifies a constant size fordimension’s grid length, these contextual facts may be used to infer that this operation has a constant value, though such a transformation will not be performed by canonicalization or the default constant folder. Executions which cause that constant-value assumption to be false incur undefined behavior.If
upper_boundis set, executions where the grid size indimensionwould exceedupper_boundcause undefined behavior.There is an implicit upper bound of
kMaxDim(currently uint32_t::max).- OPERATION_NAME = 'gpu.grid_dim'¶
- _ODS_REGIONS = (0, True)¶
- dimension() _ods_ir¶
- upper_bound() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.grid_dim(dimension, *, upper_bound=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.HostRegisterOp(value, *, loc=None, ip=None)¶
Bases:
_ods_irThis op maps the provided host buffer into the device address space.
This operation may not be supported in every environment, there is not yet a way to check at runtime whether this feature is supported.
Writes from the host are guaranteed to be visible to device kernels that are launched afterwards. Writes from the device are guaranteed to be visible on the host after synchronizing with the device kernel completion.
- OPERATION_NAME = 'gpu.host_register'¶
- _ODS_REGIONS = (0, True)¶
- value() _ods_ir¶
- mlir.dialects._gpu_ops_gen.host_register(value, *, loc=None, ip=None) HostRegisterOp¶
- class mlir.dialects._gpu_ops_gen.HostUnregisterOp(value, *, loc=None, ip=None)¶
Bases:
_ods_irThis op unmaps the provided host buffer from the device address space.
This operation may not be supported in every environment, there is not yet a way to check at runtime whether this feature is supported.
- OPERATION_NAME = 'gpu.host_unregister'¶
- _ODS_REGIONS = (0, True)¶
- value() _ods_ir¶
- mlir.dialects._gpu_ops_gen.host_unregister(value, *, loc=None, ip=None) HostUnregisterOp¶
- class mlir.dialects._gpu_ops_gen.LaneIdOp(*, upper_bound=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReturns the lane id within the subgroup (warp/wave).
Example:
%laneId = gpu.lane_id
If
upper_boundis set, executions with more thanupper_boundlanes per subgroup cause undefined behavior. In the abscence ofupper_bound, the lane id is still assumed to be non-negative and less than the target-independentkMaxSubgroupSize(currently 128).- OPERATION_NAME = 'gpu.lane_id'¶
- _ODS_REGIONS = (0, True)¶
- upper_bound() _ods_ir | None¶
- result() _ods_ir¶
Shortcut to get an op result if it has only one (throws an error otherwise).
- mlir.dialects._gpu_ops_gen.lane_id(*, upper_bound=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.LaunchFuncOp(asyncToken, asyncDependencies, kernel, gridSizeX, gridSizeY, gridSizeZ, blockSizeX, blockSizeY, blockSizeZ, kernelOperands, *, clusterSizeX=None, clusterSizeY=None, clusterSizeZ=None, dynamicSharedMemorySize=None, asyncObject=None, loc=None, ip=None)¶
Bases:
_ods_irLaunch a kernel function on the specified grid of thread blocks.
gpu.launchoperations are lowered togpu.launch_funcoperations by outlining the kernel body into a function in a dedicated module, which reflects the separate compilation process. The kernel function is required to have thegpu.kernelattribute. The module containing the kernel function is required to be a gpu.module. And finally, the module containing the kernel module (which thus cannot be the top-level module) is required to have thegpu.container_moduleattribute. Thegpu.launch_funcoperation has a symbol attribute namedkernelto identify the fully specified kernel function to launch (both the gpu.module and func).The
gpu.launch_funcsupports async dependencies: the kernel does not start executing until the ops producing those async dependencies have completed.By the default, the host implicitly blocks until kernel execution has completed. If the
asynckeyword is present, the host does not block but instead a!gpu.async.tokenis returned. Other async GPU ops can take this token as dependency.The operation requires at least the grid and block sizes along the x,y,z dimensions as arguments. When a lower-dimensional kernel is required, unused sizes must be explicitly set to
1.The remaining operands are optional. The first optional operand corresponds to the amount of dynamic shared memory a kernel’s workgroup should be allocated; when this operand is not present, a zero size is assumed.
The remaining operands if present are passed as arguments to the kernel function.
The
gpu.launch_funcalso supports kernel launching with clusters if supported by the target architecture. The cluster size can be set byclusterSizeX,clusterSizeY, andclusterSizeZarguments. When these arguments are present, the Op launches a kernel that clusters the given thread blocks. This feature is exclusive to certain architectures.Example:
module attributes {gpu.container_module} { // This module creates a separate compilation unit for the GPU compiler. gpu.module @kernels { func.func @kernel_1(%arg0 : f32, %arg1 : memref<?xf32, 1>) attributes { nvvm.kernel = true } { // Operations that produce block/thread IDs and dimensions are // injected when outlining the `gpu.launch` body to a function called // by `gpu.launch_func`. %tIdX = gpu.thread_id x %tIdY = gpu.thread_id y %tIdZ = gpu.thread_id z %bDimX = gpu.block_dim x %bDimY = gpu.block_dim y %bDimZ = gpu.block_dim z %bIdX = gpu.block_id x %bIdY = gpu.block_id y %bIdZ = gpu.block_id z %gDimX = gpu.grid_dim x %gDimY = gpu.grid_dim y %gDimZ = gpu.grid_dim z // (Optional) Cluster size only for support architectures %cIdX = gpu.cluster_id x %cIdY = gpu.cluster_id y %cIdZ = gpu.cluster_id z %cDimX = gpu.cluster_dim x %cDimY = gpu.cluster_dim y %cDimZ = gpu.cluster_dim z "some_op"(%bx, %tx) : (index, index) -> () %42 = load %arg1[%bx] : memref<?xf32, 1> } } %t0 = gpu.wait async gpu.launch_func async // (Optional) Don't block host, return token. [%t0] // (Optional) Execute only after %t0 has completed. @kernels::@kernel_1 // Kernel function. clusters in (%cst, %cst, %cst) // (Optional) Cluster size only for support architectures. blocks in (%cst, %cst, %cst) // Grid size. threads in (%cst, %cst, %cst) // Block size. dynamic_shared_memory_size %s // (Optional) Amount of dynamic shared // memory to allocate for a workgroup. args(%arg0 : f32, // (Optional) Kernel arguments. %arg1 : memref<?xf32, 1>) }
- OPERATION_NAME = 'gpu.launch_func'¶
- _ODS_OPERAND_SEGMENTS¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- gridSizeX() _ods_ir¶
- gridSizeY() _ods_ir¶
- gridSizeZ() _ods_ir¶
- blockSizeX() _ods_ir¶
- blockSizeY() _ods_ir¶
- blockSizeZ() _ods_ir¶
- clusterSizeX() _ods_ir | None¶
- clusterSizeY() _ods_ir | None¶
- clusterSizeZ() _ods_ir | None¶
- kernelOperands() _ods_ir¶
- asyncObject() _ods_ir | None¶
- kernel() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.launch_func(async_token, async_dependencies, kernel, grid_size_x, grid_size_y, grid_size_z, block_size_x, block_size_y, block_size_z, kernel_operands, *, cluster_size_x=None, cluster_size_y=None, cluster_size_z=None, dynamic_shared_memory_size=None, async_object=None, loc=None, ip=None) _ods_ir | _ods_ir | LaunchFuncOp¶
- class mlir.dialects._gpu_ops_gen.LaunchOp(asyncToken, asyncDependencies, gridSizeX, gridSizeY, gridSizeZ, blockSizeX, blockSizeY, blockSizeZ, *, clusterSizeX=None, clusterSizeY=None, clusterSizeZ=None, dynamicSharedMemorySize=None, module=None, function=None, loc=None, ip=None)¶
Bases:
_ods_irLaunch a kernel on the specified grid of thread blocks. The body of the kernel is defined by the single region that this operation contains. The operation takes an optional list of async dependencies followed by six operands and an optional operand.
The
asynckeyword indicates the kernel should be launched asynchronously; the operation returns a new !gpu.async.token when the keyword is specified. The kernel launched does not start executing until the ops producing its async dependencies (optional operands) have completed.The first three operands (following any async dependencies) are grid sizes along the x,y,z dimensions and the following three are block sizes along the x,y,z dimensions. When a lower-dimensional kernel is required, unused sizes must be explicitly set to
1. The last operand is optional and corresponds to the amount of dynamic shared memory a kernel’s workgroup should be allocated; when this operand is not present, a zero size is assumed.The body region has at least twelve arguments, or eighteen if cluster dimensions are present, grouped as follows:
three optional arguments that contain cluster identifiers along x,y,z
dimensions; * three arguments that contain block identifiers along x,y,z dimensions; * three arguments that contain thread identifiers along x,y,z dimensions; * operands of the
gpu.launchoperation as is (i.e. the operands for grid and block sizes). * a variadic number of Workgroup memory attributions. * a variadic number of Private memory attributions.The
functionandmoduleattributes are optional and specifies the kernel name and a module in which the kernel should be outlined.Syntax:
operation ::= `gpu.launch` (`async` (`[` ssa-id-list `]`)? )? ( `clusters` `(` ssa-id-list `)` `in` ssa-reassignment )? `blocks` `(` ssa-id-list `)` `in` ssa-reassignment `threads` `(` ssa-id-list `)` `in` ssa-reassignment (dynamic_shared_memory_size ssa-use)? (`module(` symbol-ref-id `)`)? (`function(` symbol-ref-id `)`)? memory-attribution region attr-dict? ssa-reassignment ::= `(` ssa-id `=` ssa-use (`,` ssa-id `=` ssa-use)* `)` memory-attribution ::= (`workgroup` `(` ssa-id-and-type-list `)`)? (`private` `(` ssa-id-and-type-list `)`)?Example:
gpu.launch blocks(%bx, %by, %bz) in (%sz_bx = %0, %sz_by = %1, %sz_bz = %2) threads(%tx, %ty, %tz) in (%sz_tx = %3, %sz_ty = %4, %sz_tz = %5) { // Block and thread identifiers, as well as block/grid sizes are // immediately usable inside body region. "some_op"(%bx, %tx) : (index, index) -> () // Assuming %val1 is defined outside the gpu.launch region. %42 = load %val1[%bx] : memref<?xf32, 1> } // Generic syntax explains how the pretty syntax maps to the IR structure. "gpu.launch"(%cst, %cst, %c1, // Grid sizes. %cst, %c1, %c1) // Block sizes. {/*attributes*/} // All sizes and identifiers have "index" size. : (index, index, index, index, index, index) -> () { // The operation passes block and thread identifiers, followed by grid and // block sizes. ^bb0(%bx : index, %by : index, %bz : index, %tx : index, %ty : index, %tz : index, %num_bx : index, %num_by : index, %num_bz : index, %num_tx : index, %num_ty : index, %num_tz : index) "some_op"(%bx, %tx) : (index, index) -> () %3 = "memref.load"(%val1, %bx) : (memref<?xf32, 1>, index) -> f32 } // Launch with memory attributions. gpu.launch blocks(%bx, %by, %bz) in (%sz_bx = %0, %sz_by = %1, %sz_bz = %2) threads(%tx, %ty, %tz) in (%sz_tx = %3, %sz_ty = %4, %sz_tz = %5) workgroup(%workgroup: memref<32xf32, 3>) private(%private: memref<1xf32, 5>) { // Block and thread identifiers, as well as block/grid sizes are // immediately usable inside body region. "some_op"(%bx, %tx) : (index, index) -> () // Assuming %val1 is defined outside the gpu.launch region. %42 = load %workgroup[%bx] : memref<32xf32, 3> } // Launch with clusters. gpu.launch clusters(%cx, %cy, %cz) in (%sz_cx = %0, %sz_cy = %1, %sz_cz = %2) blocks(%bx, %by, %bz) in (%sz_bx = %3, %sz_by = %4, %sz_bz = %5) threads(%tx, %ty, %tz) in (%sz_tx = %6, %sz_ty = %7, %sz_tz = %8) { // Cluster, block and thread identifiers, as well as cluster/block/grid // sizes are immediately usable inside body region. "some_op"(%cx, %bx, %tx) : (index, index, index) -> () } // Launch with module and function attributes. gpu.launch blocks(%bx, %by, %bz) in (%sz_bx = %0, %sz_by = %1, %sz_bz = %2) threads(%tx, %ty, %tz) in (%sz_tx = %3, %sz_ty = %4, %sz_tz = %5) module(@kernel_module) function(@kernel_func) { "some_op"(%bx, %tx) : (index, index) -> () %42 = load %val1[%bx] : memref<?xf32, 1> }
Rationale: using operation/block arguments gives analyses a clear way of understanding that a value has additional semantics (e.g., we will need to know what value corresponds to threadIdx.x for coalescing). We can recover these properties by analyzing the operations producing values, but it is easier just to have that information by construction.
- OPERATION_NAME = 'gpu.launch'¶
- _ODS_OPERAND_SEGMENTS¶
- _ODS_REGIONS = (1, True)¶
- asyncDependencies() _ods_ir¶
- gridSizeX() _ods_ir¶
- gridSizeY() _ods_ir¶
- gridSizeZ() _ods_ir¶
- blockSizeX() _ods_ir¶
- blockSizeY() _ods_ir¶
- blockSizeZ() _ods_ir¶
- clusterSizeX() _ods_ir | None¶
- clusterSizeY() _ods_ir | None¶
- clusterSizeZ() _ods_ir | None¶
- module() _ods_ir | None¶
- function() _ods_ir | None¶
- asyncToken() _ods_ir | None¶
- body() _ods_ir¶
- mlir.dialects._gpu_ops_gen.launch(async_token, async_dependencies, grid_size_x, grid_size_y, grid_size_z, block_size_x, block_size_y, block_size_z, *, cluster_size_x=None, cluster_size_y=None, cluster_size_z=None, dynamic_shared_memory_size=None, module=None, function=None, loc=None, ip=None) _ods_ir | _ods_ir | LaunchOp¶
- class mlir.dialects._gpu_ops_gen.MemcpyOp(asyncToken, asyncDependencies, dst, src, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.memcpyoperation copies the content of one memref to another.The op does not execute before all async dependencies have finished executing.
If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token.Example:
%token = gpu.memcpy async [%dep] %dst, %src : memref<?xf32, 1>, memref<?xf32>
- OPERATION_NAME = 'gpu.memcpy'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- dst() _ods_ir¶
- src() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.memcpy(async_token, async_dependencies, dst, src, *, loc=None, ip=None) _ods_ir | _ods_ir | MemcpyOp¶
- class mlir.dialects._gpu_ops_gen.MemsetOp(asyncToken, asyncDependencies, dst, value, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.memsetoperation sets the content of memref to a scalar value.The op does not execute before all async dependencies have finished executing.
If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token.Example:
%token = gpu.memset async [%dep] %dst, %value : memref<?xf32, 1>, f32
- OPERATION_NAME = 'gpu.memset'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- dst() _ods_ir¶
- value() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.memset(async_token, async_dependencies, dst, value, *, loc=None, ip=None) _ods_ir | _ods_ir | MemsetOp¶
- class mlir.dialects._gpu_ops_gen.NumSubgroupsOp(*, upper_bound=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReturns the number of subgroups within a workgroup.
Example:
%numSg = gpu.num_subgroups : index
If
upper_boundis set, executions with more thanupper_boundsubgroups per workgroup cause undefined behavior. There is a default upper bound ofkMaxDim(currently uint32_t::max).- OPERATION_NAME = 'gpu.num_subgroups'¶
- _ODS_REGIONS = (0, True)¶
- upper_bound() _ods_ir | None¶
- result() _ods_ir¶
Shortcut to get an op result if it has only one (throws an error otherwise).
- mlir.dialects._gpu_ops_gen.num_subgroups(*, upper_bound=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.PrintfOp(format, args, *, loc=None, ip=None)¶
Bases:
_ods_irgpu.printftakes a literal format stringformatand an arbitrary number of scalar arguments that should be printed.The format string is a C-style printf string, subject to any restrictions imposed by one’s target platform.
- OPERATION_NAME = 'gpu.printf'¶
- _ODS_REGIONS = (0, True)¶
- args() _ods_ir¶
- format() _ods_ir¶
- class mlir.dialects._gpu_ops_gen.ReturnOp(operands_, *, loc=None, ip=None)¶
Bases:
_ods_irA terminator operation for regions that appear in the body of
gpu.funcfunctions. The operands to thegpu.returnare the result values returned by an invocation of thegpu.func.- OPERATION_NAME = 'gpu.return'¶
- _ODS_REGIONS = (0, True)¶
- operands_() _ods_ir¶
- class mlir.dialects._gpu_ops_gen.RotateOp(value, offset, width, *, results=None, loc=None, ip=None)¶
Bases:
_ods_irThe “rotate” op moves values across lanes in a subgroup (a.k.a., local invocations) within the same subgroup. The
widthattribute specifies the number of lanes that participate in the rotation, and must be uniform across all participating lanes. Further, the firstwidthlanes of the subgroup must be active.widthmust be a power of two, andoffsetmust be in the range[0, width).Return the
rotateResultof the invocation whose id within the group is calculated as follows:Invocation ID = ((LaneId + offset) & (width - 1)) + (LaneId & ~(width - 1))
Returns the
rotateResultandtrueif the current lane id is smaller thanwidth, and poison value andfalseotherwise.example:
%1, %2 = gpu.rotate %0, 1, 16 : f32
For lane
k, returns the value from lane(k + cst1) % width.- OPERATION_NAME = 'gpu.rotate'¶
- _ODS_REGIONS = (0, True)¶
- value() _ods_ir¶
- offset() _ods_ir¶
- width() _ods_ir¶
- rotateResult() _ods_ir¶
- valid() _ods_ir¶
- mlir.dialects._gpu_ops_gen.rotate(value, offset, width, *, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.SDDMMBufferSizeOp(bufferSz, asyncToken, asyncDependencies, dnmatA, dnmatB, spmatC, computeType, *, modeA=None, modeB=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.sddmm_buffer_sizeoperation returns the buffer size required to perform the SDDMM operation on the given sparse and dense matrices. The operation expects handles returned by previous sparse operations to construct an environment and the operands for SDDMM.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.Example:
%buffersz, %token = gpu.sddmm_buffer_size async [%dep] %dnmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %spmatC into f32
The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
- OPERATION_NAME = 'gpu.sddmm_buffer_size'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- dnmatA() _ods_ir¶
- dnmatB() _ods_ir¶
- spmatC() _ods_ir¶
- modeA() _ods_ir¶
- modeB() _ods_ir¶
- computeType() _ods_ir¶
- bufferSz() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.sddmm_buffer_size(buffer_sz, async_token, async_dependencies, dnmat_a, dnmat_b, spmat_c, compute_type, *, mode_a=None, mode_b=None, loc=None, ip=None) _ods_ir | _ods_ir | SDDMMBufferSizeOp¶
- class mlir.dialects._gpu_ops_gen.SDDMMOp(asyncToken, asyncDependencies, dnmatA, dnmatB, spmatC, computeType, buffer, *, modeA=None, modeB=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.sddmmoperation performs the SDDMM operation on the given sparse and dense matrices, and buffer. The operation expects handles returned by previous sparse operations to construct an environment and the operands for SDDMM. The buffer must have been allocated on the device.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.Example:
%token = gpu.sddmm async [%dep] %dnmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %spmatC, %buffer into f32
The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
- OPERATION_NAME = 'gpu.sddmm'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- dnmatA() _ods_ir¶
- dnmatB() _ods_ir¶
- spmatC() _ods_ir¶
- buffer() _ods_ir¶
- modeA() _ods_ir¶
- modeB() _ods_ir¶
- computeType() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.sddmm(async_token, async_dependencies, dnmat_a, dnmat_b, spmat_c, compute_type, buffer, *, mode_a=None, mode_b=None, loc=None, ip=None) _ods_ir | _ods_ir | SDDMMOp¶
- class mlir.dialects._gpu_ops_gen.SetCsrPointersOp(asyncToken, asyncDependencies, spmat, positions, coordinates, values, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.set_csr_pointersassigns the given positions, coordinates, and values buffer that reside on the device directly to the given sparse matrix descriptor in csr format.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a!gpu.async.tokenin addition to the environment.Example:
%token = gpu.set_csr_pointers async [%dep] %positions, %coordinates, %values : memref<?xf32>, memref<?xindex>, memref<?xindex>
- OPERATION_NAME = 'gpu.set_csr_pointers'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- spmat() _ods_ir¶
- positions() _ods_ir¶
- coordinates() _ods_ir¶
- values() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.set_csr_pointers(async_token, async_dependencies, spmat, positions, coordinates, values, *, loc=None, ip=None) _ods_ir | _ods_ir | SetCsrPointersOp¶
- class mlir.dialects._gpu_ops_gen.SetDefaultDeviceOp(devIndex, *, loc=None, ip=None)¶
Bases:
_ods_irOperation that sets the current default GPU, using a zero-based index into the set of GPUs on the system. The default GPU setting may be thread-local.
- OPERATION_NAME = 'gpu.set_default_device'¶
- _ODS_REGIONS = (0, True)¶
- devIndex() _ods_ir¶
- mlir.dialects._gpu_ops_gen.set_default_device(dev_index, *, loc=None, ip=None) SetDefaultDeviceOp¶
- class mlir.dialects._gpu_ops_gen.ShuffleOp(value, offset, width, mode, *, results=None, loc=None, ip=None)¶
Bases:
_ods_irThe “shuffle” op moves values across lanes in a subgroup (a.k.a., local invocation) within the same subgroup. The
widthargument specifies the number of lanes that participate in the shuffle, and must be uniform across all lanes. Further, the firstwidthlanes of the subgroup must be active.The intepretation of the
offsetarguments depends on the selectedmode.Returns the
shuffleResultandtrueif the current lane id is smaller thanwidth, and an unspecified value andfalseotherwise.xorexample:%1, %2 = gpu.shuffle xor %0, %offset, %width : f32
For lane
k, returns the value%0from lanek ^ offset. Every lane trades value with exactly one other lane.downexample:%cst1 = arith.constant 1 : i32 %3, %4 = gpu.shuffle down %0, %cst1, %width : f32
For lane
k, returns the value from lane(k + cst1). If(k + cst1)is bigger than or equal towidth, the value is poison andvalidisfalse.upexample:%cst1 = arith.constant 1 : i32 %5, %6 = gpu.shuffle up %0, %cst1, %width : f32
For lane
k, returns the value from lane(k - cst1). If(k - cst1)is smaller than0, the value is poison andvalidisfalse.idxexample:%cst0 = arith.constant 0 : i32 %7, %8 = gpu.shuffle idx %0, %cst0, %width : f32
Broadcasts the value from lane 0 to all lanes.
- OPERATION_NAME = 'gpu.shuffle'¶
- _ODS_REGIONS = (0, True)¶
- value() _ods_ir¶
- offset() _ods_ir¶
- width() _ods_ir¶
- mode() _ods_ir¶
- shuffleResult() _ods_ir¶
- valid() _ods_ir¶
- mlir.dialects._gpu_ops_gen.shuffle(value, offset, width, mode, *, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.SpGEMMCopyOp(asyncToken, asyncDependencies, desc, spmatA, spmatB, spmatC, computeType, *, modeA=None, modeB=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.spgemm_copyoperation copies the sparse matrix result of a SpGEMM computation.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a!gpu.async.tokenin addition to the environment.Example:
gpu.spgemm_copy %spmatA, %spmatB, %spmatC, %spgemmDesc: f32
The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
- OPERATION_NAME = 'gpu.spgemm_copy'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- desc() _ods_ir¶
- spmatA() _ods_ir¶
- spmatB() _ods_ir¶
- spmatC() _ods_ir¶
- modeA() _ods_ir¶
- modeB() _ods_ir¶
- computeType() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.spgemm_copy(async_token, async_dependencies, desc, spmat_a, spmat_b, spmat_c, compute_type, *, mode_a=None, mode_b=None, loc=None, ip=None) _ods_ir | _ods_ir | SpGEMMCopyOp¶
- class mlir.dialects._gpu_ops_gen.SpGEMMCreateDescrOp(desc, asyncToken, asyncDependencies, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.spgemm_create_descrcreates a descriptor for the SpGEMM operation. The descriptor describes the SpGEMM operation and stores the internal data throughout the computation. It needs to be passed as an argument to spgemm_* operations.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a!gpu.async.tokenin addition to the environment.Example:
%desc, %token = gpu.spgemm_create_descr async [%dep]
- OPERATION_NAME = 'gpu.spgemm_create_descr'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- desc() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.spgemm_create_descr(desc, async_token, async_dependencies, *, loc=None, ip=None) _ods_ir | _ods_ir | SpGEMMCreateDescrOp¶
- class mlir.dialects._gpu_ops_gen.SpGEMMDestroyDescrOp(asyncToken, asyncDependencies, desc, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.spgemm_destroy_descrdestroys the SpGEMM operation descriptor.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a!gpu.async.tokenin addition to the environment.Example:
%token = gpu.spgemm_destroy_descr async [%dep] %desc
- OPERATION_NAME = 'gpu.spgemm_destroy_descr'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- desc() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.spgemm_destroy_descr(async_token, async_dependencies, desc, *, loc=None, ip=None) _ods_ir | _ods_ir | SpGEMMDestroyDescrOp¶
- class mlir.dialects._gpu_ops_gen.SpGEMMWorkEstimationOrComputeOp(bufferSzNew, asyncToken, asyncDependencies, desc, spmatA, spmatB, spmatC, computeType, bufferSz, buffer, kind, *, modeA=None, modeB=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.spgemm_work_estimation_or_computeis used to call cusparseSpGEMM_workEstimation or cusparseSpGEMM_compute. Both of them are for both determining the buffer size and performing the actual computation. The operation expects handles returned by previous sparse operations to construct an environment and the operands for SpGEMM. The buffer must have been allocated on the device.C’ = alpha * op(A) * op(B) + beta * C
If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a!gpu.async.tokenin addition to the environment.Example:
%bufferSz, %token = gpu.spgemm_work_estimation_or_compute async [%dep] {COMPUTE} %desc, %spmatA{NON_TRANSPOSE}, %spmatB{NON_TRANSPOSE}, %spmatC, %spgemmDesc, %c0, %alloc: f32 into memref<0xi8>
The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
- OPERATION_NAME = 'gpu.spgemm_work_estimation_or_compute'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- desc() _ods_ir¶
- spmatA() _ods_ir¶
- spmatB() _ods_ir¶
- spmatC() _ods_ir¶
- bufferSz() _ods_ir¶
- buffer() _ods_ir¶
- modeA() _ods_ir¶
- modeB() _ods_ir¶
- computeType() _ods_ir¶
- kind() _ods_ir¶
- bufferSzNew() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.spgemm_work_estimation_or_compute(buffer_sz_new, async_token, async_dependencies, desc, spmat_a, spmat_b, spmat_c, compute_type, buffer_sz, buffer, kind, *, mode_a=None, mode_b=None, loc=None, ip=None) _ods_ir | _ods_ir | SpGEMMWorkEstimationOrComputeOp¶
- class mlir.dialects._gpu_ops_gen.SpMMBufferSizeOp(bufferSzs, asyncToken, asyncDependencies, spmatA, dnmatB, dnmatC, computeType, *, modeA=None, modeB=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.spmm_buffer_sizeoperation returns the buffer size required to perform the SpMM operation on the given sparse and dense matrix. The operation expects handles returned by previous sparse operations to construct an environment and the operands for SpMM.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
Example:
%bufferszs, %token = gpu.spmm_buffer_size async [%dep] %spmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %dnmatC : i64 into f32
- OPERATION_NAME = 'gpu.spmm_buffer_size'¶
- _ODS_RESULT_SEGMENTS¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- spmatA() _ods_ir¶
- dnmatB() _ods_ir¶
- dnmatC() _ods_ir¶
- modeA() _ods_ir¶
- modeB() _ods_ir¶
- computeType() _ods_ir¶
- bufferSzs() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.spmm_buffer_size(buffer_szs, async_token, async_dependencies, spmat_a, dnmat_b, dnmat_c, compute_type, *, mode_a=None, mode_b=None, loc=None, ip=None) _ods_ir | _ods_ir | SpMMBufferSizeOp¶
- class mlir.dialects._gpu_ops_gen.SpMMOp(asyncToken, asyncDependencies, spmatA, dnmatB, dnmatC, computeType, buffers, *, modeA=None, modeB=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.spmmoperation performs the SpMM operation on the given sparse and dense matrix, and buffer. The operation expects handles returned by previous sparse operations to construct an environment and the operands for SpMM. The buffer must have been allocated on the device.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
Example:
%token = gpu.spmm async [%dep] %spmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %dnmatC, %buffers : type($buffers) into f32
- OPERATION_NAME = 'gpu.spmm'¶
- _ODS_OPERAND_SEGMENTS¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- spmatA() _ods_ir¶
- dnmatB() _ods_ir¶
- dnmatC() _ods_ir¶
- buffers() _ods_ir¶
- modeA() _ods_ir¶
- modeB() _ods_ir¶
- computeType() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.spmm(async_token, async_dependencies, spmat_a, dnmat_b, dnmat_c, compute_type, buffers, *, mode_a=None, mode_b=None, loc=None, ip=None) _ods_ir | _ods_ir | SpMMOp¶
- class mlir.dialects._gpu_ops_gen.SpMVBufferSizeOp(bufferSz, asyncToken, asyncDependencies, spmatA, dnX, dnY, computeType, *, modeA=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.spmv_buffer_sizeoperation returns the buffer size required to perform the SpMV operation on the given sparse matrix and dense vectors. The operation expects handles returned by previous sparse operations to construct an environment and the operands for SpMV.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
Example:
%buffersz, %token = gpu.spmv_buffer_size async [%dep] %spmatA{TRANSPOSE}, %dnX, %dnY into f32
- OPERATION_NAME = 'gpu.spmv_buffer_size'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- spmatA() _ods_ir¶
- dnX() _ods_ir¶
- dnY() _ods_ir¶
- modeA() _ods_ir¶
- computeType() _ods_ir¶
- bufferSz() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.spmv_buffer_size(buffer_sz, async_token, async_dependencies, spmat_a, dn_x, dn_y, compute_type, *, mode_a=None, loc=None, ip=None) _ods_ir | _ods_ir | SpMVBufferSizeOp¶
- class mlir.dialects._gpu_ops_gen.SpMVOp(asyncToken, asyncDependencies, spmatA, dnX, dnY, computeType, buffer, *, modeA=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.spmvoperation performs the SpMV operation on the given sparse matrix, dense vectors, and buffer. The operation expects handles returned by previous sparse operations to construct an environment and the operands for SpMV. The buffer must have been allocated on the device.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a !gpu.async.token in addition to the environment.The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
Example:
%token = gpu.spmv async [%dep] %spmatA{TRANSPOSE}, %dnX, %dnY : memref<?xf64> into bf16
- OPERATION_NAME = 'gpu.spmv'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- spmatA() _ods_ir¶
- dnX() _ods_ir¶
- dnY() _ods_ir¶
- buffer() _ods_ir¶
- modeA() _ods_ir¶
- computeType() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.spmv(async_token, async_dependencies, spmat_a, dn_x, dn_y, compute_type, buffer, *, mode_a=None, loc=None, ip=None) _ods_ir | _ods_ir | SpMVOp¶
- class mlir.dialects._gpu_ops_gen.SpMatGetSizeOp(rows, cols, nnz, asyncToken, asyncDependencies, spmat, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.spmat_get_sizeoperation retrieves the number of rows, number of columns, and number of non-zero elements of a sparse matrix.If the
asynckeyword is present, the op is executed asynchronously (i.e. it does not block until the execution has finished on the device). In that case, it returns a!gpu.async.tokenin addition to the environment.Example:
%rows, %cols, %nnz, %token = gpu.spmat_get_size async [%dep] %spmatC
- OPERATION_NAME = 'gpu.spmat_get_size'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- spmat() _ods_ir¶
- rows() _ods_ir¶
- cols() _ods_ir¶
- nnz() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.spmat_get_size(rows, cols, nnz, async_token, async_dependencies, spmat, *, loc=None, ip=None) _ods_ir | _ods_ir | SpMatGetSizeOp¶
- class mlir.dialects._gpu_ops_gen.SubgroupBroadcastOp(src, broadcast_type, *, lane=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irBroadcasts a value from one lane to all active lanes in a subgroup. The result is guaranteed to be uniform across the active lanes in subgroup.
The possible broadcast types are:
first_active_lane- broadcasts the value from the first active lane
in the subgroup. *
specific_lane- broadcasts from the specified lane. The lane index must be uniform and within the subgroup size. The result is poison if the lane index is invalid, non subgroup-uniform, or if the source lane is not active.- OPERATION_NAME = 'gpu.subgroup_broadcast'¶
- _ODS_REGIONS = (0, True)¶
- src() _ods_ir¶
- lane() _ods_ir | None¶
- broadcast_type() _ods_ir¶
- result() _ods_ir¶
Shortcut to get an op result if it has only one (throws an error otherwise).
- mlir.dialects._gpu_ops_gen.subgroup_broadcast(src, broadcast_type, *, lane=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.SubgroupIdOp(*, upper_bound=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReturns the subgroup id, i.e., the index of the current subgroup within the workgroup.
Example:
%sgId = gpu.subgroup_id : index
Executions where there are more than
upper_boundsubgroups per workgroup cause undefined behavior. There is an implicit upper bound ofkMaxDim(currently uint32_t::max).- OPERATION_NAME = 'gpu.subgroup_id'¶
- _ODS_REGIONS = (0, True)¶
- upper_bound() _ods_ir | None¶
- result() _ods_ir¶
Shortcut to get an op result if it has only one (throws an error otherwise).
- mlir.dialects._gpu_ops_gen.subgroup_id(*, upper_bound=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.SubgroupMmaComputeOp(opA, opB, opC, *, a_transpose=None, b_transpose=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.subgroup_mma_computeoperation performs a matrix-multiply accumulate (mma) operation using all the threads in a subgroup.This operation takes three
!gpu.mma_matrix``s as arguments: these hold ``A,BandC``operands for the mma operation. The operation performed is represented as ``C += A * B. The op returns a!gpu.mma_matrixwhich contains the result of the operation held by all threads in a subgroup.a_transposeorb_transposeif present, signify that the respective operand was loaded in a transposed manner. The transpose operands are required to map to correct underlying intrisics but they currently do not seem to affect correctness even if they are absent given that the operands were loaded correctly using thetransposeattribute ingpu.subgroup_mma_load_matrixop.For integer types, the
AandBmatrices carry their signedness with their types. The accumulator type is expected to be signless and imply a signed integer with a greater width than the other two operands.This op is meant to be used along with
gpu.subgroup_mma_store_matrixandgpu.subgroup_mma_load_matrixops.Example:
%D = gpu.subgroup_mma_compute_matrix %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">> -> !gpu.mma_matrix<16x16xf16, "COp">
- OPERATION_NAME = 'gpu.subgroup_mma_compute'¶
- _ODS_REGIONS = (0, True)¶
- opA() _ods_ir¶
- opB() _ods_ir¶
- opC() _ods_ir¶
- a_transpose() bool¶
- b_transpose() bool¶
- res() _ods_ir¶
- mlir.dialects._gpu_ops_gen.subgroup_mma_compute(op_a, op_b, op_c, *, a_transpose=None, b_transpose=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.SubgroupMmaConstantMatrixOp(res, value, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.subgroup_mma_constant_matrixcreates a!gpu.mma_matrixwith constant elements.The operation takes a scalar input and return a
!gpu.mma_matrixwhere each element of is equal to the operand constant. The destination mma_matrix type must have elememt type equal to the constant type. Since the layout of!gpu.mma_matrixis opaque this only support setting all the elements to the same value.This op is meant to be used along with
gpu.subgroup_mma_compute.Example:
%0 = gpu.subgroup_mma_constant_matrix %a : !gpu.mma_matrix<16x16xf16, "AOp"> %1 = gpu.subgroup_mma_constant_matrix %b : !gpu.mma_matrix<16x16xf32, "COp">
- OPERATION_NAME = 'gpu.subgroup_mma_constant_matrix'¶
- _ODS_REGIONS = (0, True)¶
- value() _ods_ir¶
- res() _ods_ir¶
- mlir.dialects._gpu_ops_gen.subgroup_mma_constant_matrix(res, value, *, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.SubgroupMmaElementwiseOp(res, args, opType, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.subgroup_mma_elementwisetakes!gpu.mma_matrixinputs and compute a new!gpu.mma_matrixby applying an elementwise operation to each element.Since the operation is elementwise and the matrix type must match, the matrix elements are processed independently of the matrix layout.
This op is meant to be used along with
gpu.subgroup_mma_compute.Example:
%0 = %A, %B { opType = "ADD" } : (!gpu.mma_matrix<16x16xf16, "COp">, !gpu.mma_matrix<16x16xf16, "COp">) -> !gpu.mma_matrix<16x16xf16, "COp">
- OPERATION_NAME = 'gpu.subgroup_mma_elementwise'¶
- _ODS_REGIONS = (0, True)¶
- args() _ods_ir¶
- opType() _ods_ir¶
- res() _ods_ir¶
- mlir.dialects._gpu_ops_gen.subgroup_mma_elementwise(res, args, op_type, *, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.SubgroupMmaExtractThreadLocalOp(matrix, indices, *, results=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.subgroup_mma_extract_thread_localoperation extracts a value from!gpu.mma_matrixthat is stored at subgroup level.This operation takes
!gpu.mma_matrixas its first operand. It is the source matrix across a subgroup. The op returns a scalar value stored in the invocation in the subgroup.Since
matrixis packed into the the threads within a subgroup,indicesare the indices into the values stored by each thread. That is, an index of 0 (or [0, 0]) does not necessarily refer to the first element of the matrix, but the first element that a particular thread holds.The mapping of matrix elements to threads is not defined by this operation and may not be defined by some lowerings (such as the lowering to SPIR-V). However, if the size of the subgroup is S, then
subgroup_mma_extract_thread_localat each index in[0, (M * N) / S)will have the entire matrix extracted across the subgroup.Example:
%c0 = arith.constant 0 : index %val = gpu.subgroup_mma_extract_thread_local %m[%c0] : !gpu.mma_matrix<16x16xf32, "AOp"> -> f32
- OPERATION_NAME = 'gpu.subgroup_mma_extract_thread_local'¶
- _ODS_REGIONS = (0, True)¶
- matrix() _ods_ir¶
- indices() _ods_ir¶
- res() _ods_ir¶
- mlir.dialects._gpu_ops_gen.subgroup_mma_extract_thread_local(matrix, indices, *, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.SubgroupMmaInsertThreadLocalOp(res, value, matrix, indices, *, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.subgroup_mma_insert_thread_localoperation inserts a value to!gpu.mma_matrixthat is stored at subgroup level.This operation takes scalar value as its first operand and
!gpu.mma_matrixas its second operand. The op inserts the scalar value to the matrix.Since
matrixis packed into the the threads within a subgroup,indicesare the indices into the values stored by each thread. That is, an index of 0 (or [0, 0]) does not necessarily refer to the first element of the matrix, but the first element that a particular thread holds.The mapping of matrix elements to threads is not defined by this operation and may not be defined by some lowerings (such as the lowering to SPIR-V). However, if the size of the subgroup is S, then
subgroup_mma_insert_thread_localat each index in[0, (M * N) / S)will have the entire matrix inserted across the subgroup.The op returns
!gpu.mma_matrixwith the updated value.Example:
%c0 = arith.constant 0 : index %s0 = gpu.subgroup_mma_insert_thread_local %val, %m[%c0] : f16, !gpu.mma_matrix<16x16xf16, "COp"> -> !gpu.mma_matrix<16x16xf16, "COp">
- OPERATION_NAME = 'gpu.subgroup_mma_insert_thread_local'¶
- _ODS_REGIONS = (0, True)¶
- value() _ods_ir¶
- matrix() _ods_ir¶
- indices() _ods_ir¶
- res() _ods_ir¶
- mlir.dialects._gpu_ops_gen.subgroup_mma_insert_thread_local(res, value, matrix, indices, *, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.SubgroupMmaLoadMatrixOp(res, srcMemref, indices, leadDimension, *, transpose=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.subgroup_mma_load_matrixoperation loads a matrix collectively using all the threads in a subgroup.This operation takes a memref as its first operand: it is the source matrix from which data is to be loaded. The op returns a
!gpu.mma_matrix. The source memref can be in global memory or shared memory. The load address is determined usingindices. The matrix being loaded into is the result. TheleadDimensionattribute specifies the leading dimension size of the source matrix which eventually allows the lowering to determine the size of each row. If thetransposeattribute is present then the op does a transposed load.For integer types, the resulting
!gpu.mma_matrixtype needs to specify the signedness of the data if the matrix type is anAorBoperand forgpu.subgroup_mma_compute.This op is often meant to be used along with
gpu.subgroup_mma_store_matrixandgpu.subgroup_mma_compute.Example:
%0 = gpu.subgroup_mma_load_matrix src[%i,%j] : {leadDimension = 32 : i32} : memref<32x32xf16, 3>, !gpu.mma_matrix<16x16xf16, "AOp">
- OPERATION_NAME = 'gpu.subgroup_mma_load_matrix'¶
- _ODS_REGIONS = (0, True)¶
- srcMemref() _ods_ir¶
- indices() _ods_ir¶
- leadDimension() _ods_ir¶
- transpose() bool¶
- res() _ods_ir¶
- mlir.dialects._gpu_ops_gen.subgroup_mma_load_matrix(res, src_memref, indices, lead_dimension, *, transpose=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.SubgroupMmaStoreMatrixOp(src, dstMemref, indices, leadDimension, *, transpose=None, loc=None, ip=None)¶
Bases:
_ods_irThe
gpu.subgroup_mma_store_matrixoperation stores a matrix collectively using all the threads in a subgroup.This operation takes a
!gpu.mma_matrixand a memref as operands.!gpu.mma_matrixis the source value containing the data to be stored into the destination memref which can be in global or shared memory. The store address is determined using the indices provided. TheleadDimensionattribute specifies the leading dimension of the destination matrix. If thetransposeattribute is present then the op does a transposed store.This op is often meant to be used along with
gpu.subgroup_mma_load_matrixandgpu.subgroup_mma_compute.Example:
gpu.subgroup_mma_store_matrix %D, %sg[%i,%j] : { leadDimension = 32 : i32} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>
- OPERATION_NAME = 'gpu.subgroup_mma_store_matrix'¶
- _ODS_REGIONS = (0, True)¶
- src() _ods_ir¶
- dstMemref() _ods_ir¶
- indices() _ods_ir¶
- leadDimension() _ods_ir¶
- transpose() bool¶
- mlir.dialects._gpu_ops_gen.subgroup_mma_store_matrix(src, dst_memref, indices, lead_dimension, *, transpose=None, loc=None, ip=None) SubgroupMmaStoreMatrixOp¶
- class mlir.dialects._gpu_ops_gen.SubgroupReduceOp(value, op, *, uniform=None, cluster_size=None, cluster_stride=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irThe
subgroup_reduceop reduces the values of lanes (work items) across a subgroup.The subgroup is divided into clusters starting at lane index 0. Within each cluster, there are
sizelanes, and the lane index advances bystride. A reduction is done for each cluster in parallel: every lane in the cluster is reduced, and the result is equal for all lanes in the cluster. Ifsizeis omitted, there is a single cluster covering the entire subgroup. Ifstrideis omitted, the stride is 1 (the cluster’s lanes are contiguous).When the reduced value is of a vector type, each vector element is reduced independently. Only 1-d vector types are allowed.
Example:
%1 = gpu.subgroup_reduce add %a : (f32) -> f32 %2 = gpu.subgroup_reduce add %b : (vector<4xf16>) -> vector<4xf16> %3 = gpu.subgroup_reduce add %c cluster(size = 4) : (f32) -> f32 %3 = gpu.subgroup_reduce add %c cluster(size = 4, stride = 2) : (f32) -> f32
If
uniformflag is set either none or all lanes of a subgroup need to execute this op in convergence.The reduction operation must be one of:
Integer types:
add,mul,minui,minsi,maxui,maxsi,and,
or,xor* Floating point types:add,mul,minnumf,maxnumf,minimumf,maximumf- OPERATION_NAME = 'gpu.subgroup_reduce'¶
- _ODS_REGIONS = (0, True)¶
- value() _ods_ir¶
- op() _ods_ir¶
- uniform() bool¶
- cluster_size() _ods_ir | None¶
- cluster_stride() _ods_ir¶
- result() _ods_ir¶
Shortcut to get an op result if it has only one (throws an error otherwise).
- mlir.dialects._gpu_ops_gen.subgroup_reduce(value, op, *, uniform=None, cluster_size=None, cluster_stride=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.SubgroupSizeOp(*, upper_bound=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReturns the number of threads within a subgroup.
Example:
%sgSz = gpu.subgroup_size : index
Executions where the number of threads per subgroup exceed
upper_boundcause undefined behavior. When noupper_boundis specified, range analyses and similar machinery assume the default bound ofkMaxSubgroupSize, currently 128.- OPERATION_NAME = 'gpu.subgroup_size'¶
- _ODS_REGIONS = (0, True)¶
- upper_bound() _ods_ir | None¶
- result() _ods_ir¶
Shortcut to get an op result if it has only one (throws an error otherwise).
- mlir.dialects._gpu_ops_gen.subgroup_size(*, upper_bound=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.TerminatorOp(*, loc=None, ip=None)¶
Bases:
_ods_irA terminator operation for regions that appear in the body of
gpu.launchoperation. These regions are not expected to return any value so the terminator takes no operands.- OPERATION_NAME = 'gpu.terminator'¶
- _ODS_REGIONS = (0, True)¶
- mlir.dialects._gpu_ops_gen.terminator(*, loc=None, ip=None) TerminatorOp¶
- class mlir.dialects._gpu_ops_gen.ThreadIdOp(dimension, *, upper_bound=None, results=None, loc=None, ip=None)¶
Bases:
_ods_irReturns the thread id, i.e. the index of the current thread within the block along the x, y, or z
dimension.Example:
%tIdX = gpu.thread_id x
If
upper_boundis set, or if one can be inferred fromknown_block_size-type annotations in context, executions where the thread index would be greater than or equal to that bound cause undefined behavior.There is an implicit upper bound of
kMaxDim(currently uint32_t::max).- OPERATION_NAME = 'gpu.thread_id'¶
- _ODS_REGIONS = (0, True)¶
- dimension() _ods_ir¶
- upper_bound() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.thread_id(dimension, *, upper_bound=None, results=None, loc=None, ip=None) _ods_ir¶
- class mlir.dialects._gpu_ops_gen.WaitOp(asyncToken, asyncDependencies, *, loc=None, ip=None)¶
Bases:
_ods_irThis op synchronizes the host or the device with a list of dependent ops.
If the op contains the
asynckeyword, it returns a new async token which is synchronized with the op arguments. This new token is merely a shortcut to the argument list, and one could replace the uses of the result with the arguments for the same effect. The async version of this op is primarily used to make each async token have a single use during lowering and thereby make forks in async execution explicit. Example usage:%t0 = gpu.foo async : !gpu.async.token %t1 = gpu.bar async : !gpu.async.token %t2 = gpu.wait async [%t0, %t1] // gpu.baz doesn't run until gpu.foo and gpu.bar have both completed, just // as if the async dependencies were [%t0, %t1]. %t3 = gpu.baz async [%t2]
If the op does not contain the
asynckeyword, it does not return a new async token but blocks until all ops producing the async dependency tokens finished execution. All dependent memory operations are visible to the host once this op completes. Example usage:%t0 = gpu.foo async : !gpu.async.token %t1 = gpu.bar async : !gpu.async.token // The gpu.wait op blocks until gpu.foo and gpu.bar have completed. gpu.wait [%t0, %t1]
- OPERATION_NAME = 'gpu.wait'¶
- _ODS_REGIONS = (0, True)¶
- asyncDependencies() _ods_ir¶
- asyncToken() _ods_ir | None¶
- mlir.dialects._gpu_ops_gen.wait(async_token, async_dependencies, *, loc=None, ip=None) _ods_ir | _ods_ir | WaitOp¶
- class mlir.dialects._gpu_ops_gen.WarpExecuteOnLane0Op(results_, laneid, warp_size, args, *, loc=None, ip=None)¶
Bases:
_ods_irwarp_execute_on_lane_0is an operation used to bridge the gap between vector programming and SPMD programming model like GPU SIMT. It allows to trivially convert a region of vector code meant to run on a multiple threads into a valid SPMD region and then allows incremental transformation to distribute vector operations on the threads.Any code present in the region would only be executed on first thread/lane based on the
laneidoperand. Thelaneidoperand is an integer ID between [0,warp_size). Thewarp_sizeattribute indicates the number of lanes in a warp.Operands are vector values distributed on all lanes that may be used by the single lane execution. The matching region argument is a vector of all the values of those lanes available to the single active lane. The distributed dimension is implicit based on the shape of the operand and argument. the properties of the distribution may be described by extra attributes (e.g. affine map).
Return values are distributed on all lanes using laneId as index. The vector is distributed based on the shape ratio between the vector type of the yield and the result type. If the shapes are the same this means the value is broadcasted to all lanes. In the future the distribution can be made more explicit using affine_maps and will support having multiple Ids.
Therefore the
warp_execute_on_lane_0operations allow to implicitly copy between lane0 and the lanes of the warp. When distributing a vector from lane0 to all the lanes, the data are distributed in a block cyclic way. For examplevector<64xf32>gets distributed on 32 threads and map tovector<2xf32>where thread 0 contains vector[0] and vector[1].During lowering values passed as operands and return value need to be visible to different lanes within the warp. This would usually be done by going through memory.
The region is not isolated from above. For values coming from the parent region not going through operands only the lane 0 value will be accesible so it generally only make sense for uniform values.
Example:
// Execute in parallel on all threads/lanes. gpu.warp_execute_on_lane_0 (%laneid)[32] { // Serial code running only on thread/lane 0. ... } // Execute in parallel on all threads/lanes.
This may be lowered to an scf.if region as below:
// Execute in parallel on all threads/lanes. %cnd = arith.cmpi eq, %laneid, %c0 : index scf.if %cnd { // Serial code running only on thread/lane 0. ... } // Execute in parallel on all threads/lanes.
When the region has operands and/or return values:
// Execute in parallel on all threads/lanes. %0 = gpu.warp_execute_on_lane_0(%laneid)[32] args(%v0 : vector<4xi32>) -> (vector<1xf32>) { ^bb0(%arg0 : vector<128xi32>) : // Serial code running only on thread/lane 0. ... gpu.yield %1 : vector<32xf32> } // Execute in parallel on all threads/lanes.
values at the region boundary would go through memory:
// Execute in parallel on all threads/lanes. ... // Store the data from each thread into memory and Synchronization. %tmp0 = memreg.alloc() : memref<128xf32> %tmp1 = memreg.alloc() : memref<32xf32> %cnd = arith.cmpi eq, %laneid, %c0 : index vector.store %v0, %tmp0[%laneid] : memref<128xf32>, vector<4xf32> some_synchronization_primitive scf.if %cnd { // Serialized code running only on thread 0. // Load the data from all the threads into a register from thread 0. This // allow threads 0 to access data from all the threads. %arg0 = vector.load %tmp0[%c0] : memref<128xf32>, vector<128xf32> ... // Store the data from thread 0 into memory. vector.store %1, %tmp1[%c0] : memref<32xf32>, vector<32xf32> } // Synchronization and load the data in a block cyclic way so that the // vector is distributed on all threads. some_synchronization_primitive %0 = vector.load %tmp1[%laneid] : memref<32xf32>, vector<32xf32> // Execute in parallel on all threads/lanes.
- OPERATION_NAME = 'gpu.warp_execute_on_lane_0'¶
- _ODS_REGIONS = (1, True)¶
- laneid() _ods_ir¶
- args() _ods_ir¶
- warp_size() _ods_ir¶
- results_() _ods_ir¶
- warpRegion() _ods_ir¶
- mlir.dialects._gpu_ops_gen.warp_execute_on_lane_0(results_, laneid, warp_size, args, *, loc=None, ip=None) _ods_ir | _ods_ir | WarpExecuteOnLane0Op¶
- class mlir.dialects._gpu_ops_gen.YieldOp(values, *, loc=None, ip=None)¶
Bases:
_ods_irgpu.yieldis a special terminator operation for blocks inside regions in gpu ops. It returns values to the immediately enclosing gpu op.Example:
gpu.yield %f0, %f1 : f32, f32
- OPERATION_NAME = 'gpu.yield'¶
- _ODS_REGIONS = (0, True)¶
- values() _ods_ir¶