'amdgpu' Dialect

The AMDGPU dialect provides wrappers around AMD-specific functionality and LLVM intrinsics. These wrappers should be used in conjunction with more generic dialects, such as gpu and vector, when generating LLVM IR that will eventually be executed on AMD hardware.

What goes here?

In many cases, AMD GPU functionality can be accessed either though generic operations (such as those in the gpu, vector, or math) or through the rocdl dialect’s intrinsic wrappers. However, there are instances where AMD-specific functionally benefits from a wrapper around the underlying LLVM intrinsics.

In general terms, operations or types should be added to this dialect when they wrap some AMD-specific functionality in a way that makes it work better with the MLIR ecosystem and its types or when those buitins would be needlessly complex to work with (such as if they features magic constants at the LLVM level).

An additional set of operations that belong in this dialect are those that have chipset-specific differences that can be abstracted over in a useful way.

To give some concrete examples:

amdgpu.mfma and amdgpu.wmma exist in order to make a large set of intrinsics more compatible with the MLIR type system (such as by allowing 8-bit float vectors to be passed as vector<N x f8E4M3FN> or vector<N x f8E4M2> instead of as packed 32-bit integers whose element type is controlled by separate operator-level constants. These operations also allow the same amdgpu.mfma operation to be used regardless of the target chip.
amdgpu.swizzle_bitmode provides a wrapper around the ds.swizzle intrinsic, allowing a wider range of types (such as vector<2xf16>) to be used natively and eliminating the need to pack the and, or, and xor components using opaque shifts.
Operations like amdgpu.gather_to_lds provide memref-ized wrappers around intrinsics that take a pointer, and are nontrivial enough to justify inclusion in this dialect.

Note that simple intrinsics like rocdl.sin or rocdl.s.barrier should not receive wrapper operations, as nothing is gained from the duplicate operation. As a rule of thumb, if an operation’s rewrite in AMDGPUToROCDL would be only a replaceOpWithNewOp call, no AMDGPU dialect operation is needed.

Design guidelines

Operations should leverage MLIR’s “standard” types where possible. MLIR has a more extensible type system than LLVM (especially in the area of small floats) and those types should be used to create more ergonomic wrappers. In particular, intrinsics that take pointers should have wrappers in this dialect that take memref arguments and indices.

Operations should use properties or attributes in cases where the underlying intrinsic uses immargs (except in cases where that attribute can be represented in the type system).

If it is possible to generalize the types of an operation, it should be done. For example, the underlying operations for permutations and swizzles always take 32-bit operands. Their AMDGPU wrappers can take any type, and will apply padding and expansion to multiple instructions as needed. This makes these operations easier to target because it hides the bitcasts and extracts until the final lowering.

When the underlying operation uses magic constants, those should be presented in a more programmer-friendly fashion, such as through enums or though using separate arguments that are later combined. (For example, see the design of the amdgpu.dpp and amdgpu.fat_raw_buffer_cast operations.) However, in the case where an immediate argument corresponds directly to an enum in the backend compiler, that enum should be in the ROCDL dialect instead.

If sufficiently similar functionality on multiple hardware generations can be encapsulated into a single operation, it should be done. The lowering to intrinsics should either throw an error when an unsupported capability is used or ignore it. Which of these is two failure modes is more appropriate depends on the nature of the feature, but errors are a safe default choice.

Documentation guidelines

AMDGPU dialect operations should document how any abstractions they introduce translate to LLVM intrinsics or hardware operations.

While documenting the semantics of the underlying operations is not required, is preferred to provide an overview of the operation’s functionality, especially in cases where the documentation is widely distributed. Someone looking at an AMDGPU dialect operation should be able to generally understand what it does and have found the keywords they’ll need for more detail.

Operation documentation should include usage examples.

Note that this dialect uses LLVM’s gfx numbers to refer to individual architectures/chipsets and not product names or codenames.

Operations ¶

source

`amdgpu.dot` (amdgpu::DotOp) ¶

MLIR wrapper for AMDGPU v_dot* intrinsics

Syntax:

operation ::= `amdgpu.dot` $sourceA `*` $sourceB `+` $destC attr-dict
              `:` type($sourceA) `,` type($sourceB) `,` type($destC)

The amdgpu.dot op is an MLIR wrapper over the v_dot* family of intrinsics, which compute D = sum_i A[i] * B[i] + C.

Variants (source, dest, signedness, chipset -> intrinsic).

| A elem   | B elem   | destC | signedness | chipset                   | ROCDL op                     |
|----------|----------|-------|------------|---------------------------|------------------------------|
| f16      | f16      | f32   | n/a        | gfx906+                   | fdot2                        |
| f16      | f16      | f16   | n/a        | gfx11+                    | fdot2.f16.f16                |
| bf16     | bf16     | f32   | n/a        | gfx11+, gfx950+           | fdot2.f32.bf16               |
| bf16     | bf16     | bf16  | n/a        | gfx11+                    | fdot2.bf16.bf16              |
| i16      | i16      | i32   | s / u      | gfx906+, no gfx11+/gfx12+ | sdot2 / udot2                |
| i8       | i8       | i32   | s / u      | gfx906+                   | sdot4 / udot4                |
| i8       | i8       | i32   | mixed      | gfx11+                    | sudot4                       |
| i4       | i4       | i32   | s / u      | gfx906+                   | sdot8 / udot8                |
| i4       | i4       | i32   | mixed      | gfx11+                    | sudot8                       |
| fp8/bf8  | fp8/bf8  | f32   | n/a        | gfx11.7, gfx12+           | dot4.f32.{fp8,bf8}.{fp8,bf8} |

Example:

%r0 = amdgpu.dot %a * %b + %c : vector<4xi8>, vector<4xi8>, i32
%r1 = amdgpu.dot %a * %b + %c {unsignedA, unsignedB, clamp}
    : vector<8xi4>, vector<8xi4>, i32
%r2 = amdgpu.dot %a * %b + %c {unsignedB}
    : vector<4xi8>, vector<4xi8>, i32
%r3 = amdgpu.dot %a * %b + %c : vector<2xf16>, vector<2xf16>, f32
%r4 = amdgpu.dot %a * %b + %c : vector<2xf16>, vector<2xf16>, f16
%r5 = amdgpu.dot %a * %b + %c
    : vector<4xf8E4M3FN>, vector<4xf8E5M2>, f32

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`unsignedA`	::mlir::UnitAttr	unit attribute
`unsignedB`	::mlir::UnitAttr	unit attribute
`clamp`	::mlir::UnitAttr	unit attribute

Operands: ¶

Operand	Description
`sourceA`	vector of 16-bit float or bfloat16 type or 16-bit signless integer values of length 2 or vector of 8-bit signless integer or f8E4M3FN type or f8E5M2 type values of length 4 or vector of 4-bit signless integer values of length 8
`sourceB`	vector of 16-bit float or bfloat16 type or 16-bit signless integer values of length 2 or vector of 8-bit signless integer or f8E4M3FN type or f8E5M2 type values of length 4 or vector of 4-bit signless integer values of length 8
`destC`	32-bit float or 16-bit float or bfloat16 type or 32-bit signless integer

Results: ¶

Result	Description
`destD`	32-bit float or 16-bit float or bfloat16 type or 32-bit signless integer

`amdgpu.dpp` (amdgpu::DPPOp) ¶

AMDGPU DPP operation

Syntax:

operation ::= `amdgpu.dpp` $old $src $kind (`(` $permArgument^ `)`)? attr-dict `:` type($result)

The amdgpu.dpp op performs a Data Parallel Primitives (DPP) lane permutation on a source value within a wavefront. Each lane reads its source data from another lane according to the permutation mode specified by kind. DPP operates at dword (32-bit) granularity: sub-32-bit types (e.g., f16, i16) are packed into an i32 during lowering, permuted, and extracted back.

Lanes are organized into rows of 16.
A Wave64 wavefront has 4 rows of 16 lanes each: row 0 = lanes 0-15, row 1 = lanes 16-31, row 2 = lanes 32-47, row 3 = lanes 48-63.
Similarly, a Wave32 wavefront has two rows of 16 lanes each, organized in the same fashion.
Each row is divided into 4 banks of 4 consecutive lanes: bank 0 = lanes 0-3, bank 1 = lanes 4-7, bank 2 = lanes 8-11, bank 3 = lanes 12-15 (lane numbers shown for row 0; add 16/32/48 for other rows).

The kind attribute selects the permutation. Some modes require a permArgument; others take no argument.

Quad permutation:

quad_perm([a, b, c, d]): Full permute within each group of 4 consecutive lanes (a quad). Each element is in [0, 3] and selects which lane within the quad to read from. Lane 4k+i reads from lane 4k+perm[i]. For example, quad_perm([1, 0, 3, 2]) swaps adjacent pairs within every quad.

Row shifts and rotates (operate within each 16-lane row independently):

row_shl(N): Shift left by N (1-15) within the row. Lane n reads from lane (n % 16) + N in the same row. Lanes where the source index exceeds 15 are out of bounds (see bound_ctrl).
row_shr(N): Shift right by N (1-15) within the row. Lane n reads from lane (n % 16) - N in the same row. Lanes where the source index is negative are out of bounds.
row_ror(N): Rotate right by N (1-15) within the row. Lane n reads from lane ((n % 16) - N) mod 16 in the same row. Always in bounds.

Wavefront shifts and rotates (not available on RDNA):

wave_shl: Shift left by 1. Lane n reads from lane n + 1. The last lane in the wavefront is out of bounds.
wave_shr: Shift right by 1. Lane n reads from lane n - 1. Lane 0 is out of bounds.
wave_rol: Rotate left by 1. Lane n reads from lane (n + 1) mod W, where W is the wavefront size.
wave_ror: Rotate right by 1. Lane n reads from lane (n - 1) mod W, where W is the wavefront size.

Row mirrors:

row_mirror: Reverse lanes within each 16-lane row. Lane n reads from lane 15 - (n % 16) within its row.
row_half_mirror: Reverse within each 8-lane half-row. Lane n reads from lane 7 - (n % 8) within its half-row.

Row broadcasts (not available on RDNA):

row_bcast_15: Lane 15 of each row broadcasts to all lanes of the next row. Lanes in row 0 are not affected (retain old).
row_bcast_31: Lane 31 broadcasts to all lanes in rows 2 and 3. Lanes in rows 0 and 1 are not affected (retain old).

Example:

// Swap adjacent pairs within each quad (lanes 0<->1, 2<->3, etc.)
%0 = amdgpu.dpp %old %src quad_perm( [1, 0, 3, 2] ) : i32

// Shift right by 1 lane within each 16-lane row.
// bound_ctrl=true -> lanes that would read past the row return 0.
// row_mask=0x5 (0b0101) -> only rows 0 and 2 apply the shift;
// rows 1 and 3 pass through %old unchanged.
%1 = amdgpu.dpp %old %src row_shr( 0x1 : i32 )
  { row_mask = 0x5 : i32, bound_ctrl = true } : f32

// Rotate left across the full wavefront by 1 lane
%2 = amdgpu.dpp %old %src wave_rol : i32

Operands:

$old: Fallback value. Lanes that are masked off by row_mask / bank_mask retain old. For lanes with an out-of-bounds source, behavior depends on bound_ctrl.
$src: Source value to be permuted across lanes.
$kind: A #amdgpu.dpp_perm enum selecting the permutation mode.
$permArgument: Mode-specific argument. Required for quad_perm (array of 4 integers in [0, 3]) and row_shl/row_shr/row_ror (integer in [1, 15]). Absent for all other modes.
$row_mask (default 0xf): 4-bit mask controlling which rows write results. Bit i enables row i (bit 0 = lanes 0-15, bit 1 = lanes 16-31, etc.). Disabled lanes retain old.
$bank_mask (default 0xf): 4-bit mask controlling which banks write results. Bit i enables bank i (bit 0 = lanes 0-3, 16-19, etc. across all rows). Disabled lanes retain old.
$bound_ctrl (default false): When false, out of bounds lanes retain old. When true, out-of-bounds lanes receive zero.

Traits: AlwaysSpeculatableImplTrait, SameTypeOperands

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`kind`	::mlir::amdgpu::DPPPermAttr	The possible permutations for a DPP operation Enum cases: quad_perm (`quad_perm`) row_shl (`row_shl`) row_shr (`row_shr`) row_ror (`row_ror`) wave_shl (`wave_shl`) wave_shr (`wave_shr`) wave_ror (`wave_ror`) wave_rol (`wave_rol`) row_mirror (`row_mirror`) row_half_mirror (`row_half_mirror`) row_bcast_15 (`row_bcast_15`) row_bcast_31 (`row_bcast_31`)
`permArgument`	::mlir::Attribute	32-bit signless integer attribute or array attribute or unit attribute
`row_mask`	::mlir::IntegerAttr	32-bit signless integer attribute
`bank_mask`	::mlir::IntegerAttr	32-bit signless integer attribute
`bound_ctrl`	::mlir::BoolAttr	bool attribute

Operands: ¶

Operand	Description
`old`	integer or float with element bitwidth <= 64 or fixed-length vector of integer or float with element bitwidth <= 64 values of ranks 1
`src`	integer or float with element bitwidth <= 64 or fixed-length vector of integer or float with element bitwidth <= 64 values of ranks 1

Results: ¶

Result	Description
`result`	any non-token type

`amdgpu.ds_async_barrier_arrive` (amdgpu::DsAsyncBarrierArriveOp) ¶

Asynchronously arrive at an in-LDS barrier.

Syntax:

operation ::= `amdgpu.ds_async_barrier_arrive` $base `[` $indices `]` attr-dict `:` type($base)

Add a arrival to the LDS barrier at base[indices] to the sequence of pending asynchronous memory operations.

The indices must be non-negative and in-bounds for the corresponding dimensions of base.

This will add an “asynchronous memory operation” to the in-order list of pending asynchronous loads from global memory to LDS. When the queue of such operations issued before this operation is complete, the specified barrier will be arrived at, decrementing the pending count by 1 per lane that executes it and rolling over the phase if applicable.

This operation does not return the old barrier state.

Example:

amdgpu.ds_async_barrier_arrive %barrier[] : memref<!amdgpu.ds_barrier_state, #gpu.address_space<workgroup>>

This operation is only available on gfx1250+.

Operands: ¶

Operand	Description
`base`	memref of State of an in-LDS barrier. values
`indices`	variadic of index

`amdgpu.ds_barrier_arrive` (amdgpu::DsBarrierArriveOp) ¶

Arrive at an in-LDS barrier and return old state.

Syntax:

operation ::= `amdgpu.ds_barrier_arrive` $base `[` $indices `]` `,` $count attr-dict `:` type($base) `,` type($count) `->` type($out)

Atomically arrive at the LDS barrier at base[indices] and decrement it by count, rolling over the phase if needed and returning the old barrier state.

The indices must be non-negative and in-bounds for the corresponding dimensions of base.

count is the number of participants that should be subtracted from the barrier’s pending count per lane that executes the operation.

Example:

%old_state = amdgpu.ds_barrier_arrive %barrier[], %c1 : memref<!amdgpu.ds_barrier_state, #gpu.address_space<workgroup>>, i64 -> !amdgpu.ds_barrier_state

This operation is only available on gfx1250+.

Interfaces: InferTypeOpInterface

Operands: ¶

Operand	Description
`base`	memref of State of an in-LDS barrier. values
`indices`	variadic of index
`count`	64-bit signless integer

Results: ¶

Result	Description
`out`	State of an in-LDS barrier.

`amdgpu.ds_barrier_init` (amdgpu::DsBarrierInitOp) ¶

Initialize an in-LDS barrier.

Syntax:

operation ::= `amdgpu.ds_barrier_init` $base `[` $indices `]` `,` $participants attr-dict `:` type($base) `,` type($participants)

Given the location !amdgpu.ds_barrier_state in LDS (as specified by base and indices), initialize the barrier structure so that the pending and init counts are equal to participants - 1, which will have its high bits masked off, and its phase is equal to 0.

The indices must be non-negative and in-bounds for the corresponding dimensions of base.

Note that we subtract 1 from participants when constructing the barrier state to provide clearer high-level semantics.

The subtraction means that, when the participantth arrival occurs, the phase will change. In practical terms, this means that you can use (for example) the number of subgroups or waves per workgroup as participants, instead of manually needing to remove one.

While the write of the initial state will be performed atomically, no synchronization between waves will be performed by this operation.

Example:

amdgpu.ds_barrier_init %barrier[], %c32 : memref<!amdgpu.ds_barrier_state, #gpu.address_space<workgroup>>, i32

This operation is only available on gfx1250+.

Operands: ¶

Operand	Description
`base`	memref of State of an in-LDS barrier. values
`indices`	variadic of index
`participants`	32-bit signless integer

`amdgpu.ds_barrier_poll_state` (amdgpu::DsBarrierPollStateOp) ¶

Atomically read the state of an in-LDS barrier.

Syntax:

operation ::= `amdgpu.ds_barrier_poll_state` $base `[` $indices `]` attr-dict `:` type($base) `->` type($out)

Atomically read and return the state of the barrier at base[indices...].

This will ultimately act like a memref.load, but this operation will ensure that appropriate atomic orderings and syncscopes are set. The indices must be non-negative and in-bounds for the corresponding dimensions of base.

Example:

%state = amdgpu.ds_barrier_poll_state %barrier[] : memref<!amdgpu.ds_barrier_state, #gpu.address_space<workgroup>> -> !amdgpu.ds_barrier_state

This operation is only available on gfx1250+.

Interfaces: InferTypeOpInterface

Operands: ¶

Operand	Description
`base`	memref of State of an in-LDS barrier. values
`indices`	variadic of index

Results: ¶

Result	Description
`out`	State of an in-LDS barrier.

`amdgpu.ds_barrier_state_init_count` (amdgpu::DsBarrierStateInitCountOp) ¶

Extract the init count of a barrier state.

Syntax:

operation ::= `amdgpu.ds_barrier_state_init_count` $state attr-dict `:` type($state) `->` type($res)

Extract the init count of the !amdgpu.ds_barrier_state state as a 32-bit value.

Example:

%init = amdgpu.ds_barrier_state_init_count %state : !amdgpu.ds_barrier_state -> i32

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands: ¶

Operand	Description
`state`	State of an in-LDS barrier.

Results: ¶

Result	Description
`res`	32-bit signless integer

`amdgpu.ds_barrier_state_pending_count` (amdgpu::DsBarrierStatePendingCountOp) ¶

Extract the pending count of a barrier state.

Syntax:

operation ::= `amdgpu.ds_barrier_state_pending_count` $state attr-dict `:` type($state) `->` type($res)

Extract the pending count of the !amdgpu.ds_barrier_state state as a 32-bit value.

Example:

%pending = amdgpu.ds_barrier_state_pending_count %state : !amdgpu.ds_barrier_state -> i32

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands: ¶

Operand	Description
`state`	State of an in-LDS barrier.

Results: ¶

Result	Description
`res`	32-bit signless integer

`amdgpu.ds_barrier_state_phase` (amdgpu::DsBarrierStatePhaseOp) ¶

Extract the phase of a barrier state.

Syntax:

operation ::= `amdgpu.ds_barrier_state_phase` $state attr-dict `:` type($state) `->` type($res)

Extract the phase of the !amdgpu.ds_barrier_state state as a 32-bit value.

Example:

%phase = amdgpu.ds_barrier_state_phase %state : !amdgpu.ds_barrier_state -> i32

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands: ¶

Operand	Description
`state`	State of an in-LDS barrier.

Results: ¶

Result	Description
`res`	32-bit signless integer

`amdgpu.ds_barrier_state_phase_parity` (amdgpu::DsBarrierStatePhaseParity) ¶

Extract the phase parity of a barrier state.

Syntax:

operation ::= `amdgpu.ds_barrier_state_phase_parity` $state attr-dict `:` type($state) `->` type($res)

Return the parity of the phase of the !amdgpu.ds_barrier_state state.

This is intended to simplify the case where the barrier is being used to repeatedly track completion of a task where the precise value of the phase won’t mater, only that it changed since (or as a result of) the arrival.

Example:

%parity = amdgpu.ds_barrier_state_phase_parity %state : !amdgpu.ds_barrier_state -> i1

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands: ¶

Operand	Description
`state`	State of an in-LDS barrier.

Results: ¶

Result	Description
`res`	1-bit signless integer

`amdgpu.ext_packed_fp8` (amdgpu::ExtPackedFp8Op) ¶

Extend a fp8 value to a float or a vector of packed fp8 values to two floats

Syntax:

operation ::= `amdgpu.ext_packed_fp8` attr-dict $source `[` $index `]` `:` type($source) `to` type($res)

Extend one or two 8-bit floats in source[index] to a 32-bit float or two floats and return them.

This rather unusual signature arises from the fact that AMD GPUs cannot easily work with sub 32-bit quantities, so the compiler intrinsics for extending 8-bit floats (which are, currently, the only way to work with this operation) take packed vectors of 4 such floats.

If the passed-in vector has fewer than four elements, or the input is scalar, the remaining values in the <4 x i8> will be filled with undefined values as needed.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`index`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 3

Operands: ¶

Operand	Description
`source`	f8E5M2FNUZ type or f8E4M3FNUZ type or f8E5M2 type or f8E4M3FN type or vector of f8E5M2FNUZ type or f8E4M3FNUZ type or f8E5M2 type or f8E4M3FN type values of length 1/2/3/4

Results: ¶

Result	Description
`res`	32-bit float or fixed-length vector of 32-bit float values of length 2

`amdgpu.fat_raw_buffer_cast` (amdgpu::FatRawBufferCastOp) ¶

Create a raw buffer fat pointer that matches memref

Syntax:

operation ::= `amdgpu.fat_raw_buffer_cast` $source oilist (`validBytes` `(` $validBytes `)`
              | `cacheSwizzleStride` `(` $cacheSwizzleStride `)`
              | `boundsCheck` `(` $boundsCheck `)`
              | `resetOffset` $resetOffset )
              attr-dict `:` type($source) `to` type($result)

Wraps the memory pointed to by source as a raw buffer fat pointer, or, in LLVM terms, a ptr addrspace(7), returning a memref that has the same sizes and layout but the #amdgpu.address_space<fat_raw_buffer> address space.

This memref can be used with standard memref operations like memref.load, memref.store, and memref.atomicrmw, which will be lowered to the relevant buffer intrinsics. (vector.masked_load/store will work once there’s backend support for lowering them, and then this document will be updated)

If validBytes is given, it is the number of bytes that will be valid as an offset to out. If it is not provided, this will be inferred from the size of the memref during lowering. This size is max_{d = 0 upto rank(source)} (sizes[d] * strides[d]) * sizeof(element type).

The flags of the buffer descriptor will be set up to enable raw usage - for example, stride = 0, add_tid = 0, and so on. The boundsCheck property determines if bounds checking is enabled or not (on architectures where this can be controlled - that is, on RDNA chips).

If cacheSwizzleStride is provided, L1 cache swizzling will be enabled on architectures that support it. This swizzling, unlike the main swizzling mode (whose usage makes a buffer non-raw) does not affect index calculation, but does affect cache behavior. Mixing access between cache-swizzled raw buffers and other forms of memory access, like ordinary pointer loads or unswizzled buffer pointers can cause incorrect behavior and must be avoided.

This operation preserves the sizes, strides, and offset of the input memref - they’ll be added in by memref.load later. However, if resetOffset is set, that offset will be added to the base pointer. If the value of the memref’s offset is not uniform (independent of the lane/thread ID), this will lead to substantially decreased performance due to the need for a waterfall loop on the base address of the buffer resource.

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), ReifyRankedShapedTypeOpInterface, ViewLikeOpInterface

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`resetOffset`	::mlir::UnitAttr	unit attribute

Operands: ¶

Operand	Description
`source`	memref of any non-token type values
`validBytes`	64-bit signless integer
`cacheSwizzleStride`	14-bit signless integer

Results: ¶

Result	Description
`result`	memref of any non-token type values

`amdgpu.gather_to_lds` (amdgpu::GatherToLDSOp) ¶

MLIR wrapper for CDNA Gather to LDS instructions

Syntax:

operation ::= `amdgpu.gather_to_lds` (`async` $async^)? $src `[` $srcIndices `]` `,` $dst `[` $dstIndices `]` attr-dict `:` $transferType `,` type($src) `,` type($dst)

The amdgpu.gather_to_lds op is a wrapper around the global_load_lds instructions.

Operands:

$src: global memory (including fat buffer) memref to read from.
$srcIndices: indices into $src to read from for this thread. These indices must be non-negative and in-bounds when $src is not a fat raw buffer. Fat raw buffer sources permit out-of-bounds indices with raw buffer semantics.
$dst: LDS memory memref to write to.
$dstIndices: base indices into $dst to write to for the subgroup of this thread. These indices must be non-negative and in-bounds. The elements gathered by the subgroup will be written contiguously in order of lane ID starting at $dst[$dstIndices]. Byte-sized (ex. i8) or short-sized (ex. i16) types will be zero-padded/extended to 32 bits before being written. 96-bit types (ex. vector<3xf32>) will be zero-padded to 128 bits before being written. Only the offsets held by lane 0 are used.
$transferType: type of the data to be transferred by each thread. This is used to determine the size of the data to be transferred and the number of threads in the subgroup. The transfer type must be a scalar type or a vector type with a single element type.
If $async is set, the compiler will not attempt to infer the memory waits needed to ensure that the DMA operation has succeeded before a load that might access the stored-to LDS is performed. Instead, the rocdl.asyncmark and rocdl.wait.asyncmark N operations must be used to explicitly indicate the desired completion behavior. This enables more precise calculation of these waits at the cost of requiring user management of asynchrony.

The $dst, along with its indices, points to the memory location the subgroup of this thread will write to.

Note: only supported on gfx9 and gfx10.

Traits: AttrSizedOperandSegments

Attributes: ¶

Attribute	MLIR Type	Description
`transferType`	::mlir::TypeAttr	any type attribute
`async`	::mlir::UnitAttr	unit attribute

Operands: ¶

Operand	Description
`src`	memref of any non-token type values
`srcIndices`	variadic of index
`dst`	memref of any non-token type values
`dstIndices`	variadic of index

`amdgpu.global_load_async_to_lds` (amdgpu::GlobalLoadAsyncToLDSOp) ¶

MLIR wrapper for async global load to lds instructions

Syntax:

operation ::= `amdgpu.global_load_async_to_lds` $src `[` $srcIndices `]` `,` $dst `[` $dstIndices `]`  (`,` $mask^)?
              attr-dict `:` $transferType `,` type($src) `,` type($dst)

AMDGPU wrapper for global.load.async.to.lds instructions, which performs asynchronous load of data from global memory into LDS while bypassing VGPRs.

$src: global memory memref to read from (global addrspace only, no fat buffer).
$srcIndices: indices into $src for this thread’s global read location. These indices must be non-negative and in-bounds.
$dst: LDS memref to write to (workgroup addrspace).
$dstIndices: indices into $dst for this thread’s LDS write location. These indices must be non-negative and in-bounds when $mask is not provided. When $mask is provided, the destination indices are not guaranteed to be in-bounds because masked-off lanes may carry invalid destination indices.
$transferType: type of data to be transferred. Must be 8, 32, 64 or 128 bit scalar or vector type.
$mask: optional per-thread mask. When false, the thread’s LDS write is masked off. The global read still occurs for all threads regardless of mask.

Note: only supported on gfx1250 and later.

Examples:

  amdgpu.global_load_async_to_lds %src[%i, %j], %dst[%k, %l]
    : f32, memref<128x64xf32, #gpu.address_space<global>>,
      memref<64x64xf32, #gpu.address_space<workgroup>>

  amdgpu.global_load_async_to_lds %src[%i, %j], %dst[%k, %l]
    : vector<4xf32>, memref<128x64xf32, #gpu.address_space<global>>,
      memref<64x64xf32, #gpu.address_space<workgroup>>

  amdgpu.global_load_async_to_lds %src[%i], %dst[%j]
    : i8, memref<512xi8, #gpu.address_space<global>>,
      memref<256xi8, #gpu.address_space<workgroup>>

Traits: AttrSizedOperandSegments

Attributes: ¶

Attribute	MLIR Type	Description
`transferType`	::mlir::TypeAttr	any type attribute

Operands: ¶

Operand	Description
`src`	memref of any non-token type values
`srcIndices`	variadic of index
`dst`	memref of any non-token type values
`dstIndices`	variadic of index
`mask`	1-bit signless integer

`amdgpu.global_prefetch` (amdgpu::GlobalPrefetchOp) ¶

Prefetch data to caches.

Syntax:

operation ::= `amdgpu.global_prefetch` $src `[` $indices `]` $temporalHint $cacheScope (`speculative` $speculative^)? attr-dict `:` qualified(type($src))

Prefetches a cache line to high-level caches using the aligned address of the source memref and an offset provided by the indices of the element containing the cache line. This provides temporal hints (e.g., regular or high-priority). Note that out-of-bounds access is allowed in speculative mode. The provided memref must be in the global address space (#gpu.address_space<global> or 1).

This operation was introduced in gfx1250.

Example:

amdgpu.global_prefetch %src[%i, %j] RT SE speculative : memref<64x64xf16, #gpu.address_space<global>>

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::SideEffects::DefaultResource, MemoryEffects::Read on ::mlir::SideEffects::DefaultResource}

Attributes: ¶

Attribute	MLIR Type	Description
`temporalHint`	::mlir::amdgpu::LoadTemporalHintAttr	AMDGPU-specific prefetch temporal hints for load instructions. RT - regular temporal for both near and far caches; NT - non-temporal for both near and far caches; HT - high-priority temporal for both near and far caches; LU - last-use; NT_RT - non-temporal for near cache(s) and regular for far caches; RT_NT - regular for near cache(s) and non-temporal for far caches; NT_HT - non-temporal for near cache(s) and high-priority temporal for far caches Enum cases: RT (`RT`) NT (`NT`) HT (`HT`) LU (`LU`) NT_RT (`NT_RT`) RT_NT (`RT_NT`) NT_HT (`NT_HT`)
`cacheScope`	::mlir::amdgpu::ScopeAttr	AMDGPU-specific cache scopes. WGP - workgroup processor (CUs); SE - shader engine (GL2); DEV - device; SYS - system Enum cases: WGP (`WGP`) SE (`SE`) DEV (`DEV`) SYS (`SYS`)
`speculative`	::mlir::UnitAttr	unit attribute

Operands: ¶

Operand	Description
`src`	memref of any non-token type values
`indices`	variadic of 64-bit signless integer

`amdgpu.global_transpose_load` (amdgpu::GlobalTransposeLoadOp) ¶

MLIR wrapper for global memory transpose load instructions

Syntax:

operation ::= `amdgpu.global_transpose_load` $src `[` $srcIndices `]` attr-dict `:` type($src) `->` type($result)

The amdgpu.global_transpose_load op is a wrapper around the global_load_tr family of instructions introduced in gfx1200.

Each thread reads a column of a matrix stored in global memory and receives the corresponding row of the transposed matrix in its result register. The subgroup collectively performs a transpose of the tile.

This op is a direct wrapper around the ROCDL global.load.tr family intrinsics. Refer to the ISA manual for exact semantics.

Format example:

%0 = amdgpu.global_transpose_load %src[%i, %j]
       : memref<128x256xf16, #gpu.address_space<global>> -> vector<8xf16>

Operands:

$src: Global address space memref to read from.
$srcIndices: indices into $src for this thread. Indices must be non-negative and in-bounds for the corresponding dimension of $src, matching the constraints of memref.load.
$result: register this transpose load instruction writes to.

Valid (element bits, num elements) pairs:

(4, 16) -> global_load_tr4_b64 (gfx1250+)
(6, 16) -> global_load_tr6_b96 (gfx1250+)
(8, 8) -> global_load_tr_b64 (gfx1200+)
(16, 8) -> global_load_tr_b128 (gfx1200+)

Note: 8-bit and 16-bit element lowering requires gfx1200+. 4-bit and 6-bit element lowering requires gfx1250+.

Traits: SameVariadicOperandSize

Operands: ¶

Operand	Description
`src`	memref of any non-token type values
`srcIndices`	variadic of index

Results: ¶

Result	Description
`result`	fixed-length vector of 8-bit signless integer or f8E5M2FNUZ type or f8E4M3FNUZ type or f8E5M2 type or f8E4M3FN type or 16-bit float or bfloat16 type or 16-bit signless integer values of length 8 or fixed-length vector of 4-bit signless integer or f4E2M1FN type or 6-bit signless integer or f6E2M3FN type or f6E3M2FN type values of length 16

`amdgpu.lds_barrier` (amdgpu::LDSBarrierOp) ¶

Barrier that includes a wait for LDS memory operations.

Syntax:

operation ::= `amdgpu.lds_barrier` attr-dict

DEPRECATION NOTICE: Unless you need the inline-assembly-based workaround for gfx908/MI-100, you should represent this pattern with the equivalent

gpu.barrier memfence [#gpu.address_space<workgroup>]

instead.

amdgpu.lds_barrier is both a barrier (all workitems in a workgroup must reach the barrier before any of them may proceed past it) and a wait for all operations that affect the Local Data Store (LDS) issued from that workgroup to complete before the workgroup may continue. Since the LDS is per-workgroup memory, this barrier may be used, for example, to ensure all workitems have written data to LDS before any workitem attempts to read from it.

Note that lds_barrier does not force reads to or from global memory to complete before execution continues. Therefore, it should be used when operations on global memory can be issued far in advance of when their results are used (for example, by writing them to LDS).

WARNING: On architectures that do not support the BackOffBarrier feature, (those which will implement this barrier by emitting inline assembly), use of this operation will impede the usabiliity of memory watches (including breakpoints set on variables) when debugging.

`amdgpu.make_dma_base` (amdgpu::MakeDmaBaseOp) ¶

Pair of based addresses used when moving tiles between LDS and global memory.

Syntax:

operation ::= `amdgpu.make_dma_base` $global `[` $global_indices `]` `,` $lds `[` $lds_indices `]` attr-dict `:` type($global) `,` type($lds) `->` type(results)

This operation creates a pair of addresses that will be used by tensor_load_to_lds and tensor_store_from_lds.

The global and LDS indices must be non-negative and in-bounds for the corresponding dimensions of their memrefs.

This operation creates a value corresponding to the tensor descriptor (D#) group 0 found in TensorLoadToLDSOp and TensorStoreFromLDSOp in the rocdl dialect.

For example:

  %base = amdgpu.make_dma_base %global[%idx0, %idx1], %lds[%idx2, %idx3] : memref<64x64xi32>, memref<64x64xi32, #gpu.address_space<workgroup>> -> !amdgpu.tdm_base<i32>
  %descriptor = amdgpu.make_dma_descriptor %base globalSize [2, 2] globalStride [2, 1] sharedSize [2, 2] : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
  amdgpu.tensor_load_to_lds %descriptor : !amdgpu.tdm_descriptor

  // pseudo-code
  %global_base = llvm.extractvalue %global_memref[1]
  %global_address = llvm.get_element_ptr ...

  %lds_base = llvm.extractvalue %lds_memref[1]
  %lds_address = llvm.get_element_ptr ...

  // Definition of %base
  %undef = llvm.mlir.undef : vector<4xi32>
  %v0 = llvm.insertelement %15, %undef[0] : vector<4xi32>
  %v1 = llvm.insertelement %lds_address, %v0[1] : vector<4xi32>
  %v2 = llvm.insertelement %global_address_low, %v1[2] : vector<4xi32>
  %base = llvm.insertelement %global_address_high, %v2[3] : vector<4xi32>

  rocdl.tensor.load.to.lds %base, %dgroup1, %dgroup2, %dgroup3 cachepolicy 0 : vector<4xi32>, vector<8xi32>

These tensor DMA operations were introduced in gfx1250.

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands: ¶

Operand	Description
`global`	memref of any non-token type values
`global_indices`	variadic of index
`lds`	memref of any non-token type values
`lds_indices`	variadic of index

Results: ¶

Result	Description
`base`	Pair of base addresses that move data between LDS and global storage.

`amdgpu.make_dma_descriptor` (amdgpu::MakeDmaDescriptorOp) ¶

Make all descriptor groups needed by TensorLoadToLDS/TensorStoreFromLDS.

Syntax:

operation ::= `amdgpu.make_dma_descriptor` $base
              `globalSize` custom<DynamicIndexList>($global_dynamic_sizes, $global_static_sizes)
              `globalStride` custom<DynamicIndexList>($global_dynamic_strides, $global_static_strides)
              `sharedSize` custom<DynamicIndexList>($shared_dynamic_sizes, $shared_static_sizes)
              ( `padShared` `(` $pad_amount^ `every` $pad_interval `)` )?
              ( `workgroupMask` $workgroup_mask^ ( `earlyTimeout` $early_timeout^)?)?
              ( `atomicBarrier` `(` $atomic_barrier_address^ `[` $atomic_barrier_indices `]`
              `:` type($atomic_barrier_address) `)`)?
              ( `iterate` $global_increment^ `,` $lds_increment `,` $iteration_count )?
              attr-dict `:` qualified(type($base)) `->` type(results)

Make all descriptor groups needed by tensor memory operations.

The $base operand corresponds to the base pair addresses, one must be an address in LDS while the other must be a global memory location.

$global_{static/dynamic}sizes determine the size of the tensor. $global{static/dynamic}strides determine the strides of the tensor. $shared{static/dynamic}_sizes determines the size of the tile.

$workgroup_mask broadcast load to workgroups inside of a workgroup cluster (0 = do not broadcast result to workgroup, 1 = broadcast result to workgroup). Ignored for stores. An all zeros mask is interpreted as a non-broadcasted load.

$early_timeout return data to requesters as soon as cache supplies it.

Padding can be applied to the LDS address when copying from memory to LDS, but not when copying from LDS to memory. The values in the padded target addresses remain the same as before the operation was applied. $pad_interval must be a power of two contained in [2, 256]. $pad_amount must be a value contained in [1, 128].

If an atomic barrier is provided, it will be arrived at once after each load/store using this descriptor is completed. Its indices must be non-negative and in-bounds for the corresponding dimensions of the barrier memref.

2D and 3D tensors may be iterated over by setting $global_increment, $lds_increment, and $iteration_count. $global_increment determines how much to increment the starting global memory address per iteration in units of the $base’s element type. $lds_increment determines how much to increment the starting LDS address per iteration in units of the $base’s element type. $iterate_count determines how many times to iterate, it must be a value in the inclusive interval [1, 256].

 // Example of moving a two-dimensional tensor to LDS.
 %base = amdgpu.make_dma_base %global[0, 0], %lds[0, 0] : memref<64x64xi32>, memref<64x64xi32, #gpu.address_space<workgroup>> -> !amdgpu.tdm_base<i32>
 %descriptor = amdgpu.make_dma_descriptor %base globalSize [64, 64] globalStride [64, 1] sharedSize [64, 64] : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
 amdgpu.tensor_load_to_lds %descriptor : !amdgpu.tdm_descriptor

 // Example of moving a two dimension tensor to LDS where padding is applied after every integer.
 %base = amdgpu.make_dma_base %global[0, 0], %lds[0, 0] : memref<32x32xi32>, memref<64x64xi32, #gpu.address_space<workgroup>> -> !amdgpu.tdm_base<i32>
 %descriptor = amdgpu.make_dma_descriptor %base globalSize [32, 32] globalStride [32, 1] sharedSize [64, 64] padShared(%pad_amount every %pad_interval) : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
 amdgpu.tensor_load_to_lds %descriptor : !amdgpu.tdm_descriptor

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`global_static_sizes`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`global_static_strides`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`shared_static_sizes`	::mlir::DenseI64ArrayAttr	i64 dense array attribute

Operands: ¶

Operand	Description
`base`	Pair of base addresses that move data between LDS and global storage.
`global_dynamic_sizes`	variadic of index
`global_dynamic_strides`	variadic of index
`shared_dynamic_sizes`	variadic of index
`workgroup_mask`	fixed-length vector of 1-bit signless integer values of length 16
`early_timeout`	1-bit signless integer
`pad_amount`	32-bit signless integer
`pad_interval`	32-bit signless integer
`atomic_barrier_address`	memref of State of an in-LDS barrier. values
`atomic_barrier_indices`	variadic of index
`global_increment`	index
`lds_increment`	32-bit signless integer
`iteration_count`	index

Results: ¶

Result	Description
`desc`	Descriptors used in tensor store/load operations.

`amdgpu.make_gather_dma_base` (amdgpu::MakeGatherDmaBaseOp) ¶

Pair of based addresses used when moving tiles between LDS and global memory.

Syntax:

operation ::= `amdgpu.make_gather_dma_base` $global `[` $global_indices `]` `,` $lds `[` $lds_indices `]` attr-dict `:` type($global) `,` type($lds) `->` type(results)

This operation creates a pair of addresses that will be used by tensor_load_to_lds and tensor_store_from_lds.

The global and LDS indices must be non-negative and in-bounds for the corresponding dimensions of their memrefs.

This operation creates a value corresponding to the tensor descriptor (D#) group 0 found in TensorLoadToLDSOp and TensorStoreFromLDSOp in the rocdl dialect.

Unlike make_dma_base, this operation returns !amdgpu.tdm_gather_base<$element_type, $index_type> which is only compatible with make_gather_dma_descriptor. Using the descriptor returned by make_gather_dma_descriptor will set the tensor_load_to_lds and tensor_store_from_lds to gather mode.

  %base = amdgpu.make_gather_dma_base %global[%idx0, %idx1], %lds[%idx2, %idx3] : memref<64x64xi32>, memref<64x64xi32, #gpu.address_space<workgroup>> -> !amdgpu.tdm_gather_base<i32, i16>
  // %indices : i16
  %descriptor = amdgpu.make_gather_dma_descriptor %base[%indices] globalSize [2, 2] globalStride [2, 1] sharedSize [2, 2] : !amdgpu.tdm_gather_base<i32, i16>, i16 -> !amdgpu.tdm_descriptor
  amdgpu.tensor_load_to_lds %descriptor : !amdgpu.tdm_descriptor

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands: ¶

Operand	Description
`global`	memref of any non-token type values
`global_indices`	variadic of index
`lds`	memref of any non-token type values
`lds_indices`	variadic of index

Results: ¶

Result	Description
`base`	Pair of base addresses that move data between LDS and global storage.

`amdgpu.make_gather_dma_descriptor` (amdgpu::MakeGatherDmaDescriptorOp) ¶

Make all descriptor groups needed by TensorLoadToLDS/TensorStoreFromLDS.

Syntax:

operation ::= `amdgpu.make_gather_dma_descriptor` $base `[` $indices `]`
              `globalSize` custom<DynamicIndexList>($global_dynamic_sizes, $global_static_sizes)
              `globalStride` custom<DynamicIndexList>($global_dynamic_strides, $global_static_strides)
              `sharedSize` custom<DynamicIndexList>($shared_dynamic_sizes, $shared_static_sizes)
              ( `padShared` `(` $pad_amount^ `every` $pad_interval `)` )?
              ( `workgroupMask` $workgroup_mask^ ( `earlyTimeout` $early_timeout^)?)?
              ( `atomicBarrier` `(` $atomic_barrier_address^ `[` $atomic_barrier_indices `]`
              `:` type($atomic_barrier_address) `)`)?
              ( `iterate` $global_increment^ `,` $lds_increment `,` $iteration_count )?
              attr-dict `:` qualified(type($base)) `,` type($indices) `->` type(results)

Make all descriptor groups needed by tensor memory operations in gather mode.

If an atomic barrier is provided, its indices must be non-negative and in-bounds for the corresponding dimensions of the barrier memref.

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`global_static_sizes`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`global_static_strides`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`shared_static_sizes`	::mlir::DenseI64ArrayAttr	i64 dense array attribute

Operands: ¶

Operand	Description
`base`	Pair of base addresses that move data between LDS and global storage.
`indices`	vector of 32-bit signless integer values of at least length 1 of at most length 8 or vector of 16-bit signless integer values of at least length 1 of at most length 16
`global_dynamic_sizes`	variadic of index
`global_dynamic_strides`	variadic of index
`shared_dynamic_sizes`	variadic of index
`workgroup_mask`	fixed-length vector of 1-bit signless integer values of length 16
`early_timeout`	1-bit signless integer
`pad_amount`	32-bit signless integer
`pad_interval`	32-bit signless integer
`atomic_barrier_address`	memref of State of an in-LDS barrier. values
`atomic_barrier_indices`	variadic of index
`global_increment`	index
`lds_increment`	32-bit signless integer
`iteration_count`	index

Results: ¶

Result	Description
`desc`	Descriptors used in tensor store/load operations.

`amdgpu.memory_counter_wait` (amdgpu::MemoryCounterWaitOp) ¶

Wait for specified hardware counters

Syntax:

operation ::= `amdgpu.memory_counter_wait` oilist( `load` `(` $load `)` | `store` `(` $store `)` | `ds` `(` $ds `)` | `exp` `(` $exp `)` | `tensor` `(` $tensor `)` ) attr-dict

Wait for the specified counters to be less-than or equal-to the provided values before continuing.

Counters can lower to different instructions on different architectires, including clamping to the some HW supported max value or combining multiple counters into one.

Attributes: ¶

Attribute	MLIR Type	Description
`load`	::mlir::IntegerAttr	32-bit signless integer attribute
`store`	::mlir::IntegerAttr	32-bit signless integer attribute
`ds`	::mlir::IntegerAttr	32-bit signless integer attribute
`exp`	::mlir::IntegerAttr	32-bit signless integer attribute
`tensor`	::mlir::IntegerAttr	32-bit signless integer attribute

`amdgpu.mfma` (amdgpu::MFMAOp) ¶

MLIR wrapper for CDNA mfma instructions

Syntax:

operation ::= `amdgpu.mfma` custom<MNKDimensionList>($m, $n, $k) $sourceA `*` $sourceB `+` $destC
              attr-dict
              `blgp` `=` $blgp
              `:` type($sourceA) `,` type($sourceB) `,` type($destC)

The amdgpu.mfma op is an MLIR wrapper around intrinsics for various mfma instructions in the CDNA architecture, which perform multiple outer products in order to allow fast matrix multiplication.

The wrapper will select an appropriate mfma instruction, if one is available, based on the provided m, k, n, and nBlks attributes, along with the types of the source and destination arguments.

For information on the layouts of the input and output matrices (which are stored in sourceA, sourceB, destC, and destD), see the CDNA ISA documentation.

The cbsz, abid, and blgp parameters control how the lanes of the wave are permuted when matrix data is being loaded: blgp can be any number of fixed permutations, cbsz specifies the log_2 of the number of chunks the lanes holding sourceA are split into, and abid selects one of those chunks.

Note, this wrapper allows specifying vector<4Kxi8> arguments to MFMA intrinsics that take an integer type of width 4K. For example, one can provide a vector<4xi8> as an argument to an MFMA instruction that logically takes 4 i8s but whose intrinsics are specified to take an i32. In these cases, the bytes in the vector will be concatenated in little-endian order (that is, v[0] will go to arg[7:0], v[1] to arg[15:8] and so on).

The negateA, negateB, and negateC flags are only supported for double-precision operations on gfx94x.

Example:

  %0 = amdgpu.mfma 16x16x16 %matA * %matB + %matC
    : vector<4xf16>, vector<4xf16>, vector<4xf32>

  %1 = amdgpu.mfma 32x32x1 %matD * %matE + %matF
    { abid = 1 : i32, cbsz = 1 : i32, blocks = 2 : i32 }
    blgp = bcast_second_32 : f32, f32, vector<32xf32>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`m`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {4, 16, 32}
`n`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {4, 16, 32}
`k`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {1, 2, 4, 8, 16, 32, 64, 128}
`blocks`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {1, 2, 4, 16}
`cbsz`	::mlir::IntegerAttr	32-bit signless integer attribute
`abid`	::mlir::IntegerAttr	32-bit signless integer attribute
`blgp`	::mlir::ROCDL::MFMAPermBAttr	permutations of the lanes storing B in an MFMA Enum cases: none (`none`) bcast_first_32 (`bcast_first_32`) bcast_second_32 (`bcast_second_32`) rotate_16_right (`rotate_16_right`) bcast_first_16 (`bcast_first_16`) bcast_second_16 (`bcast_second_16`) bcast_third_16 (`bcast_third_16`) bcast_fourth_16 (`bcast_fourth_16`)
`reducePrecision`	::mlir::UnitAttr	unit attribute
`negateA`	::mlir::UnitAttr	unit attribute
`negateB`	::mlir::UnitAttr	unit attribute
`negateC`	::mlir::UnitAttr	unit attribute

Operands: ¶

Operand	Description
`sourceA`	32-bit float or 64-bit float or 32-bit signless integer or 64-bit signless integer or vector of 32-bit float values of length 2 or vector of 16-bit float values of length 4/8 or vector of bfloat16 type values of length 2/4/8 or vector of 8-bit signless integer values of length 4/8/16 or vector of f8E5M2FNUZ type or f8E4M3FNUZ type values of length 8 or vector of f8E5M2 type or f8E4M3FN type values of length 8/32 or vector of f6E2M3FN type or f6E3M2FN type or f4E2M1FN type values of length 32
`sourceB`	32-bit float or 64-bit float or 32-bit signless integer or 64-bit signless integer or vector of 32-bit float values of length 2 or vector of 16-bit float values of length 4/8 or vector of bfloat16 type values of length 2/4/8 or vector of 8-bit signless integer values of length 4/8/16 or vector of f8E5M2FNUZ type or f8E4M3FNUZ type values of length 8 or vector of f8E5M2 type or f8E4M3FN type values of length 8/32 or vector of f6E2M3FN type or f6E3M2FN type or f4E2M1FN type values of length 32
`destC`	64-bit float or vector of 32-bit float values of length 4/16/32 or vector of 32-bit signless integer values of length 4/16/32 or vector of 64-bit float values of length 4

Results: ¶

Result	Description
`destD`	64-bit float or vector of 32-bit float values of length 4/16/32 or vector of 32-bit signless integer values of length 4/16/32 or vector of 64-bit float values of length 4

`amdgpu.packed_scaled_trunc` (amdgpu::PackedScaledTruncOp) ¶

Round two floats into a packed vector of floats

Syntax:

operation ::= `amdgpu.packed_scaled_trunc` attr-dict $source `into` ($existing^):(`undef`)? `[` $index `]`
              `,` $scale
              `:` type($source) `to` type($res) (`into` type($existing)^)?

Scale and round the inputs source (which is undefined if not specified) into the low or high word (bottom two or top two) elements of the returned vector, keeping the other two elements of existing unchanged if present (or undefined if it was not passed in).

The reason for this odd signature is that AMD GPUs cannot easily work with sub-registers, and so the conversion intrinsics take 32-bit wide packed vectors of float values.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`index`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 7

Operands: ¶

Operand	Description
`source`	vector of 32-bit float or 16-bit float or bfloat16 type values of length 1/2
`scale`	32-bit float
`existing`	fixed-length vector of f8E5M2 type or f8E4M3FN type values of length 4 or fixed-length vector of f4E2M1FN type values of length 8

Results: ¶

Result	Description
`res`	fixed-length vector of f8E5M2 type or f8E4M3FN type values of length 4 or fixed-length vector of f4E2M1FN type values of length 8

`amdgpu.packed_stoch_round_fp8` (amdgpu::PackedStochRoundFp8Op) ¶

Round float stochiastically into a packed vector of 8-bit floats

Syntax:

operation ::= `amdgpu.packed_stoch_round_fp8` attr-dict $source `+` $stochiasticParam
              `into` ($existing^):(`undef`)? `[` $storeIndex `]`
              `:` type($source) `to` type($res) (`into` type($existing)^)?

Round the input source, adding in stochiasticParam, and place it into the storeIndexth element of res.

If existing is passed in, elements of res other than the one at storeIndex are copied from existing.

The reason for this odd signature is that AMD GPUs cannot easily work with sub-registers, and so the conversion intrinsics (which are currently the only way to work with 8-bit float types) take packed vectors of 4 8-bit values.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`storeIndex`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 3

Operands: ¶

Operand	Description
`source`	32-bit float
`stochiasticParam`	32-bit signless integer
`existing`	fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4

Results: ¶

Result	Description
`res`	fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4

`amdgpu.packed_trunc_2xfp8` (amdgpu::PackedTrunc2xFp8Op) ¶

Round two floats into a packed vector of 8-bit floats

Syntax:

operation ::= `amdgpu.packed_trunc_2xfp8` attr-dict $sourceA `,` ($sourceB^):(`undef`)?
              `into` ($existing^):(`undef`)? `[` `word` $wordIndex `]`
              `:` type($sourceA) `to` type($res) (`into` type($existing)^)?

Round the inputs sourceA and sourceB (which is undefined if not specified) into the low or high word (bottom two or top two) elements of the returned vector, keeping the other two elements of existing unchanged if present (or undefined if it was not passed in).

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`wordIndex`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 1

Operands: ¶

Operand	Description
`sourceA`	32-bit float
`sourceB`	32-bit float
`existing`	fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4

Results: ¶

Result	Description
`res`	fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4

`amdgpu.permlane_swap` (amdgpu::PermlaneSwapOp) ¶

AMDGPU permlane swap op

Syntax:

operation ::= `amdgpu.permlane_swap` $src $row_length attr-dict `:` type($result)

High-level wrapper on rocdl.permlane{16,32}.swap variants for permutations on rows of lanes in a subgroup.

Supports arbitrary int/float/vector types, which will be repacked to i32 and one or more rocdl.permlane_swap ops during lowering. Supported lane permutations:

Swap the data between odd and even rows of 16 lanes
Swap the data between the first 32 lanes and the last 32 lanes

Example:

%0 = amdgpu.permlane_swap %src 16 : f16
%1 = amdgpu.permlane_swap %src 32 { fetch_inactive = true, bound_ctrl = true } : f16

Operands:

$src: Vector register to permute across lanes of the subgroup.
$row_length: The length of a row to permute in number of lanes (valid values are 16 and 32).
$fetch_inactive: Optional. Used to dertermine behavior of a fetch from a disabled lane. fetch_inactive = false: If the source lane is disabled, use bound_ctrl to determine the source value. fetch_inactive = true: If the source lane is disabled, fetch the source value anyway (ignoring bound_ctrl).
$bound_ctrl: Optional. Used to determine what a thread should do if its source operand is from a disabled lane: use the value zero, or disable the write. bound_ctrl = false: Do not write when source is from a disabled lane bound_ctrl = true: Use zero as input if source is from a disabled lane

Note: Lowering is only supported on gfx950 and up.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`row_length`	::mlir::IntegerAttr	32-bit signless integer attribute
`fetch_inactive`	::mlir::BoolAttr	bool attribute
`bound_ctrl`	::mlir::BoolAttr	bool attribute

Operands: ¶

Operand	Description
`src`	Integer or Float or fixed-length vector of Integer or Float values of ranks 1

Results: ¶

Result	Description
`result`	Integer or Float or fixed-length vector of Integer or Float values of ranks 1

`amdgpu.permlane_var` (amdgpu::PermlaneVarOp) ¶

AMDGPU variable-selector permlane op (GFX12+)

Syntax:

operation ::= `amdgpu.permlane_var` $src `,` $selector attr-dict `:` type($result)

High-level wrapper on rocdl.permlane16.var and rocdl.permlanex16.var for per-lane variable-selector permutations within a wave32 subgroup.

Supports arbitrary int/float/vector types, which will be repacked to i32 and one or more ROCDL intrinsic calls during lowering.

cross = false: intra-row permutation (each lane in a 16-lane row reads from a lane in the same row, selected by $selector). Maps to rocdl.permlane16.var.
cross = true: cross-row permutation (each lane reads from the opposite 16-lane row, selected by $selector). Maps to rocdl.permlanex16.var.

$selector is an i32 VGPR providing the per-lane source-lane index.

Example:

%0 = amdgpu.permlane_var %src, %sel { cross = false } : f16
%1 = amdgpu.permlane_var %src, %sel { cross = true } : f32

Note: Lowering is only supported on GFX12+.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`cross`	::mlir::BoolAttr	bool attribute
`fetch_inactive`	::mlir::BoolAttr	bool attribute
`bound_ctrl`	::mlir::BoolAttr	bool attribute

Operands: ¶

Operand	Description
`src`	Integer or Float or fixed-length vector of Integer or Float values of ranks 1
`selector`	32-bit signless integer

Results: ¶

Result	Description
`result`	Integer or Float or fixed-length vector of Integer or Float values of ranks 1

`amdgpu.raw_buffer_atomic_cmpswap` (amdgpu::RawBufferAtomicCmpswapOp) ¶

Raw Buffer Atomic compare-and-swap

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_cmpswap` attr-dict $src `,` $cmp `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_cmpswap op is a wrapper around the buffer-based atomic compare-and-swap min available on AMD GPUs.

The index into the buffer is computed as for memref.store with the addition of indexOffset (which is used to aid in emitting vectorized code) and, if present sgprOffset (which is added after bounds checks and includes any non-zero offset on the memref type).

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Interfaces: InferTypeOpInterface

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`src`	any non-token type
`cmp`	any non-token type
`memref`	memref of any non-token type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

Results: ¶

Result	Description
`value`	any non-token type

`amdgpu.raw_buffer_atomic_fadd` (amdgpu::RawBufferAtomicFaddOp) ¶

Raw Buffer Floating-point Atomic Add (MI-* only)

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_fadd` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_fadd op is a wrapper around the buffer-based atomic floating point addition available on the MI-* series of AMD GPUs.

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

The op returns the value observed before the atomic update.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Interfaces: InferTypeOpInterface

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`value`	32-bit float or vector of 16-bit float or bfloat16 type values of length 2
`memref`	memref of any non-token type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

Results: ¶

Result	Description
`oldValue`	32-bit float or vector of 16-bit float or bfloat16 type values of length 2

`amdgpu.raw_buffer_atomic_fmax` (amdgpu::RawBufferAtomicFmaxOp) ¶

Raw Buffer Floating-point Atomic Max (non-GFX9)

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_fmax` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_fmax op is a wrapper around the buffer-based atomic floating point max available on AMD GPUs (except GFX9).

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

The op returns the value observed before the atomic update. Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Interfaces: InferTypeOpInterface

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`value`	32-bit float or 64-bit float
`memref`	memref of any non-token type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

Results: ¶

Result	Description
`oldValue`	32-bit float or 64-bit float

`amdgpu.raw_buffer_atomic_smax` (amdgpu::RawBufferAtomicSmaxOp) ¶

Raw Buffer Signed Integer Atomic Max

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_smax` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_smax op is a wrapper around the buffer-based atomic signed integer max available on AMD GPUs.

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

The op returns the value observed before the atomic update. Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Interfaces: InferTypeOpInterface

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`value`	32-bit signless integer
`memref`	memref of any non-token type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

Results: ¶

Result	Description
`oldValue`	32-bit signless integer

`amdgpu.raw_buffer_atomic_umin` (amdgpu::RawBufferAtomicUminOp) ¶

Raw Buffer Unsigned Integer Atomic Min

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_umin` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_umin op is a wrapper around the buffer-based atomic signed integer min available on AMD GPUs.

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

The op returns the value observed before the atomic update. Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Interfaces: InferTypeOpInterface

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`value`	32-bit signless integer
`memref`	memref of any non-token type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

Results: ¶

Result	Description
`oldValue`	32-bit signless integer

`amdgpu.raw_buffer_load` (amdgpu::RawBufferLoadOp) ¶

Raw Buffer load, exposing GCN features

Syntax:

operation ::= `amdgpu.raw_buffer_load` attr-dict $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($memref) (`,` type($indices)^)? `->` type($value)

The amdgpu.raw_buffer_load op is a wrapper around the buffer load intrinsics available on AMD GPUs, including extensions in newer GPUs.

The index into the buffer is computed as for memref.load with the additon of indexOffset and sgprOffset (which may or may not be considered in bounds checks and includes any offset present on the memref type if it’s non-zero).

All indices and offsets are in units of the memref’s data type and are converted to bytes during lowering.

When a load is out of bounds, the instruction returns zero. Partially-out of bounds have chipset-dependent behavior: whether reading 2 elements starting at index 7 of a memref<8xf32> returns the last element in the first vector component depends on the architecture.

The memref struct is converted into a buffer resource (a V#) and the arguments are translated to intrinsic arguments as follows:

The base address of the buffer is the base address of the memref
The stride is 0 to enable raw mode
The number of records is the size of the memref, in bytes In the case of dynamically-shaped memrefs, this is computed at runtime as max_d (size(d) * stride(d)) * sizeof(elementType(memref))
The offset enable bit is 1, the index enable bit is 0.
The thread ID addition bit is off
If boundsCheck is false and the target chipset is RDNA, OOB_SELECT is set to 2 to disable bounds checks, otherwise it is 3
The cache coherency bits are off

Traits: AttrSizedOperandSegments

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`memref`	memref of any non-token type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

Results: ¶

Result	Description
`value`	any non-token type

`amdgpu.raw_buffer_store` (amdgpu::RawBufferStoreOp) ¶

Raw Buffer Store, exposing GCN features

Syntax:

operation ::= `amdgpu.raw_buffer_store` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) (`,` type($indices)^)?

The amdgpu.raw_buffer_store op is a wrapper around the buffer store intrinsics available on AMD GPUs, including extensions in newer GPUs.

The store index is computed as in memref.store with the addition of indexOffset (which is included for uniformity with atomics and may be useful when writing vectorized code) and sgprOffset (which is added after bounds checks and implicitly includes the offset of the memref type if non-zero). All index components are in terms of the elements of the memref, not bytes, and are scaled up appropriately.

Out of bounds stores are ignored in hardware. Wthether a vector write that includes some in-bounds and soeme out-of-bounds components is partically completed is chipset-dependent.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`value`	any non-token type
`memref`	memref of any non-token type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

`amdgpu.scaled_ext_packed` (amdgpu::ScaledExtPackedOp) ¶

Extend a vector of packed floating point values

Syntax:

operation ::= `amdgpu.scaled_ext_packed` attr-dict $source `[` $index `]` `,` $scale `:` type($source) `to` type($res)

Extend and scale two packed floats in source[index] to two floats and return them.

If the passed-in vector has fewer than two elements, or the input is scalar, the remaining values in the <2 x i8> will be filled with undefined values as needed.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`index`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 7

Operands: ¶

Operand	Description
`source`	vector of f8E5M2 type or f8E4M3FN type values of length 1/2/3/4 or vector of f4E2M1FN type values of length 1/2/3/4/5/6/7/8
`scale`	32-bit float

Results: ¶

Result	Description
`res`	fixed-length vector of 32-bit float values of length 2 or fixed-length vector of 16-bit float values of length 2 or fixed-length vector of bfloat16 type values of length 2

`amdgpu.scaled_ext_packed_matrix` (amdgpu::ScaledExtPackedMatrixOp) ¶

Extend a wave-wide matrix of packed floating point values

Syntax:

operation ::= `amdgpu.scaled_ext_packed_matrix` attr-dict $source
              `scale` `(` $scale `)`
              `blockSize` `(` $blockSize `)`
              `firstScaleLane` `(` $firstScaleLane`)`
              `firstScaleByte` `(` $firstScaleByte `)`
              `:` type($source) `,` type($scale) `->` type($res)

Extend matrix of microfloats (8 or 16 elements per lane) using a set of scales that may be stored on other lanes.

The scales applied to the input microfloats are stored in bytes which come from the scales input provided in a half of the wave identified by firstScaleLane. The bytes used is selected by firstScaleByte and depends on the type of source. The 16 vectors in consecutive lanes starting from firstScaleLane (which we’ll call the scale vectors) will be used by both halves of the wave (with lane L reading from L % 16’th scale vector).

When source is either F4E2M1FN, F6E2M3FN, or F6E3M2FN each half of the wave will use a different byte. The first one being firstScaleByte and the second one being firstScaleByte + 1. When the block size is 32, firstScaleByte can be either 0 or 2, selecting halves of the scale vectors. Lanes 0-15 will read from firstScaleByte and lanes 16-31 will read from firstScaleByte + 1.

For example:

// Input: 8-element vector of F8E4M3FN, converting to F32
// Lanes 0-15 read from byte 0, lanes 16-31 read from byte 1
%result = amdgpu.scaled_ext_packed_matrix %source scale(%scales)
  blockSize(32) firstScaleLane(0) firstScaleByte(0)
  : vector<8xf8E4M3FN>, vector<4xf8E8M0FNU> -> vector<8xf32>

// Input: 16-element vector of F6E2M3FN, converting to F16
// Lanes 0-15 read from byte 2, lanes 16-31 read from byte 3
%result = amdgpu.scaled_ext_packed_matrix %source scale(%scales)
  blockSize(32) firstScaleLane(16) firstScaleByte(2)
  : vector<16xf6E2M3FN>, vector<4xf8E8M0FNU> -> vector<16xf16>

When source is either F4E2M1FN, F6E2M3FN, or F6E3M2FN and the block size is 16, firstScaleByte can be 0 or 1. Lanes 0-15 read from the firstScaleByteth element of the scale vectors, while lanes 16-31 read from firstScaleByte + 2. For example:

// Input: 8-element vector of F8E5M2, converting to BF16
// Lanes 0-15 read from byte 0, lanes 16-31 read from byte 2 (0+2)
%result = amdgpu.scaled_ext_packed_matrix %source scale(%scales)
  blockSize(16) firstScaleLane(0) firstScaleByte(0)
  : vector<8xf8E5M2>, vector<4xf8E8M0FNU> -> vector<8xbf16>

// Input: 16-element vector of F6E3M2FN, converting to F32
// Lanes 0-15 read from byte 1, lanes 16-31 read from byte 3 (1+2)
%result = amdgpu.scaled_ext_packed_matrix %source scale(%scales)
  blockSize(16) firstScaleLane(16) firstScaleByte(1)
  : vector<16xf6E3M2FN>, vector<4xf8E8M0FNU> -> vector<16xf32>

Note: the layout for the scales generally mirrors how the WMMA instructions use for matrix scales. These selection operands allows one to choose portions of the matrix to convert.

When source is either F8E4M3FN or F8E5M2 and blockSize is 32, then the same byte will be used by both halves of the wave. In this case, firstScaleByte can be any value from 0 to 3.

When source is either F8E4M3FN or F8E5M2 and blockSize is 16, following combinations are allowed:

firstScaleLane(0), firstScaleByte(0)
firstScaleLane(16), firstScaleByte(2) all other combinations are reserved.

Available on gfx1250+.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`blockSize`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {16, 32}
`firstScaleLane`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {0, 16}
`firstScaleByte`	::mlir::IntegerAttr	32-bit signless integer attribute whose minimum value is 0 whose maximum value is 3

Operands: ¶

Operand	Description
`source`	vector<8xF4E2M1FN> of f4E2M1FN type values or vector<8xF8E4M3FN> of f8E4M3FN type values or vector<8xF8E5M2> of f8E5M2 type values or vector<16xF6E2M3FN> of f6E2M3FN type values or vector<16xF6E3M2FN> of f6E3M2FN type values
`scale`	vector<4xF8E8M0FNU> of f8E8M0FNU type values

Results: ¶

Result	Description
`res`	vector<8xF32> of 32-bit float values or vector<8xF16> of 16-bit float values or vector<8xBF16> of bfloat16 type values or vector<16xF32> of 32-bit float values or vector<16xF16> of 16-bit float values or vector<16xBF16> of bfloat16 type values

`amdgpu.scaled_mfma` (amdgpu::ScaledMFMAOp) ¶

MLIR wrapper for CDNA scaled mfma instructions

Syntax:

operation ::= `amdgpu.scaled_mfma` custom<MNKDimensionList>($m, $n, $k) ` `
              `(` $scalesA `[` $scalesIdxA `]` `*` $sourceA `)` `*`
              `(` $scalesB `[` $scalesIdxB `]` `*` $sourceB `)` `+` $destC
              attr-dict
              `:` type($scalesA) `,` type($sourceA) `,` type($scalesB) `,` type($sourceB) `,` type($destC)

The amdgpu.scaled_mfma op is an MLIR wrapper around intrinsics for various scaled versions of mfma instructions in the CDNA architecture, which perform multiple outer products in order to allow fast matrix multiplication.

Note, this wrapper allows specifying vector<4Kxi8> arguments to MFMA intrinsics that take an integer type of width 4K. For example, one can provide a vector<4xi8> as an argument to an MFMA instruction that logically takes 4 i8s but whose intrinsics are specified to take an i32. In these cases, the bytes in the vector will be concatenated in little-endian order (that is, v[0] will go to arg[7:0], v[1] to arg[15:8] and so on).

This wrapper takes inspiration from amdgpu.mfma, but has some key differences:

amdgpu.scaled_mfma operates on fp4 (f4E2M1FN), fp6 (f6E2M3FN and f6E3M2FN) and fp8 (f8E4M3FN and f8E5M2) types using either M=N=16, K=128 or M=N=32, K=64 as their tile size.
amdgpu.scaled_mfma does not support broadcasting. So, cbsz, abid, and blgp are omitted from this wrapper.
The negateA, negateB, and negateC flags in amdgpu.mfma are only supported for double-precision operations on gfx94x and so are not included here.

Example:

  %0 = amdgpu.scaled_mfma 32x32x64 (%arg0[0] * %arg1) * (%arg0[1] * %arg1) + %arg2
    : vector<4xf8E8M0FNU>, vector<32xf6E2M3FN>, f8E8M0FNU, vector<32xf6E2M3FN>, vector<16xf32>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`m`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {16, 32}
`n`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {16, 32}
`k`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {64, 128}
`scalesIdxA`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 3
`scalesIdxB`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 3

Operands: ¶

Operand	Description
`sourceA`	vector of f8E5M2 type or f8E4M3FN type values of length 32 or vector of f6E2M3FN type or f6E3M2FN type or f4E2M1FN type values of length 32
`sourceB`	vector of f8E5M2 type or f8E4M3FN type values of length 32 or vector of f6E2M3FN type or f6E3M2FN type or f4E2M1FN type values of length 32
`destC`	vector of 32-bit float values of length 4/16
`scalesA`	f8E8M0FNU type or fixed-length vector of f8E8M0FNU type values of length 4
`scalesB`	f8E8M0FNU type or fixed-length vector of f8E8M0FNU type values of length 4

Results: ¶

Result	Description
`destD`	vector of 32-bit float values of length 4/16

`amdgpu.scaled_wmma` (amdgpu::ScaledWMMAOp) ¶

MLIR wrapper for scaled wmma instructions

Syntax:

operation ::= `amdgpu.scaled_wmma` custom<MNKDimensionList>($m, $n, $k) ` `
              `(` $scaleA `*` $sourceA `)` `*`
              `(` $scaleB `*` $sourceB `)` `+` $destC
              attr-dict
              `:` type($scaleA) `,` type($sourceA) `,` type($scaleB) `,` type($sourceB) `,` type($destC)

The amdgpu.scaled_wmma op is an MLIR wrapper around intrinsics for scaled wmma instructions. These instructions perform matrix multiplication with per-block scaling of inputs, supporting fp4, fp6, and fp8 data formats.

The scale instructions support a block size of 16 or 32 and two tile sizes:

16x16x128 with mixed f8/f6/f4 formats (output: vector<8xf32>)
32x16x128 with f4 format only (output: vector<16xf32>)

Scale parameters (scaleA, scaleB) are small vectors of f8 scale values (either f8E8M0FNU, or f8E4M3FN) that are packed into i32/i64 values during lowering. Each lane can operate on 4 bytes (4 scale values), and the number of scales required for each matrix is determined by: num_scales_A = (M × K) / block_size num_scales_B = (N × K) / block_size

The index attributes (a_first_scale_lane, b_first_scale_lane) select which lane to start reading scale values from (0 or 16):

For block size 32, 32 lanes across a single wave are used for the scale values. If the number of scales (num_scales_A or num_scales_B) can fit into half of the available lanes (i.e., num_scales / scales_per_lane == 16 (num_lanes)), then then first_scale_lane can be either 0 or 16. If all lanes are required for storing the scale values (num_scales / scales_per_lane == 32 (num_lanes)), then the first_scale_lane must be 0.
For block size 16, the same rules apply as above except that there are 64 lanes across two waves that are used for the scale values. When num_scales / scales_per_lane == 32 (num lanes), then 16 lanes from each wave are used. first_scale_lane of 0 or 16 will decide which lanes are used for this. When num_scales / scales_per_lane == 64 (num_lanes), then first_scale_lane must be set to 0.

Example:

  // 16x16x128: fp8 inputs
  %0 = amdgpu.scaled_wmma 16x16x128 (%scaleVecA * %matA) * (%scaleVecB * %matB) + %matC
    {a_first_scale_lane = 0 : i32, b_first_scale_lane = 0 : i32}
    : vector<4xf8E8M0FNU>, vector<64xf8E4M3FN>,
    vector<4xf8E8M0FNU>, vector<64xf8E4M3FN>, vector<8xf32>

  // 32x16x128: fp4 inputs with different scale lanes
  %1 = amdgpu.scaled_wmma 32x16x128 (%scaleVecD * %matD) * (%scaleVecE * %matE) + %matF
    {a_first_scale_lane = 0 : i32, b_first_scale_lane = 16 : i32}
    : vector<8xf8E4M3FN>, vector<128xf4E2M1FN>,
    vector<8xf8E4M3FN>, vector<64xf4E2M1FN>, vector<16xf32>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`m`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {16, 32}
`n`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {16}
`k`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {128}
`a_first_scale_lane`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {0, 16}
`b_first_scale_lane`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {0, 16}

Operands: ¶

Operand	Description
`sourceA`	vector of f8E5M2 type or f8E4M3FN type values of length 64 or vector of f6E2M3FN type or f6E3M2FN type values of length 64 or vector of f4E2M1FN type values of length 64/128
`sourceB`	vector of f8E5M2 type or f8E4M3FN type values of length 64 or vector of f6E2M3FN type or f6E3M2FN type values of length 64 or vector of f4E2M1FN type values of length 64/128
`destC`	vector of 32-bit float values of length 8/16
`scaleA`	vector of f8E8M0FNU type or f8E4M3FN type values of length 4/8
`scaleB`	vector of f8E8M0FNU type or f8E4M3FN type values of length 4/8

Results: ¶

Result	Description
`destD`	vector of 32-bit float values of length 8/16

`amdgpu.sched_barrier` (amdgpu::SchedBarrierOp) ¶

Barrier that limits the backend scheduler of instruction movement

Syntax:

operation ::= `amdgpu.sched_barrier` `allow` `=` $opts attr-dict

amdgpu.sched_barrier serves as a barrier that could be configured to restrict movements of instructions through it as defined by the ROCDL scheduling group mask enum.

Attributes: ¶

Attribute	MLIR Type	Description
`opts`	::mlir::ROCDL::SchedGroupMaskAttr	instruction type mask for scheduling barriers Enum cases: none (`none`) non_mem_non_sideeffect (`non_mem_non_sideeffect`) valu (`valu`) salu (`salu`) mfma_wmma (`mfma_wmma`) all_vmem (`all_vmem`) vmem_read (`vmem_read`) vmem_write (`vmem_write`) all_ds (`all_ds`) ds_read (`ds_read`) ds_write (`ds_write`) transcendental (`transcendental`) ldsdma (`ldsdma`) all (`all`)

`amdgpu.sparse_mfma` (amdgpu::SparseMFMAOp) ¶

MLIR wrapper for CDNA sparse mfma (smfmac) instructions

Syntax:

operation ::= `amdgpu.sparse_mfma` custom<MNKDimensionList>($m, $n, $k) $sourceA `*` $sourceB `+` $destC
              `sparse` `(` $sparseIdx `:` type($sparseIdx) `)`
              attr-dict
              `:` type($sourceA) `,` type($sourceB) `,` type($destC)

The amdgpu.sparse_mfma op is an MLIR wrapper around intrinsics for various smfmac instructions in the AMDGPU architecture, which perform matrix multiply-accumulate operations using 2:4 structured sparsity on matrix A with dense matrices B, C, and D.

On gfx942, smfmac intrinsics support:

M=N=16, K=32 and M=N=32, K=16 for f16 and bf16 sources
M=N=16, K=64 and M=N=32, K=32 for i8 and fp8 sources

On gfx950, smfmac intrinsics additionally support:

M=N=16, K=64 and M=N=32, K=32 for f16 and bf16 sources
M=N=16, K=128 and M=N=32, K=64 for i8 and fp8 sources

The sparseIdx parameter contains packed 2-bit indices identifying which of every 4 dense-K positions are non-zero in the 2:4 sparse matrix A. The required sparseIdx type depends on the variant:

gfx942 16-bit ((m,k) in {(16,32), (32,16)}): 8 bits per lane, carried as vector<4xi8> (one 8-bit set per i8 element).
gfx942 8-bit ((m,k) in {(16,64), (32,32)}) and gfx950 16-bit ((m,k) in {(16,64), (32,32)}): 16 bits per lane, carried as vector<2xi16> (one 16-bit set per i16 element).
gfx950 8-bit ((m,k) in {(16,128), (32,64)}): 32 bits per lane (a full VGPR with no internal set structure), carried as i32.

The cbsz and abid parameters select which index set within the VGPR is used:

gfx942 16-bit: cbsz == 0 selects one of four 8-bit sets via abid[1:0] (range [0, 3]); cbsz != 0 selects the first set.
gfx942 8-bit and gfx950 16-bit: cbsz == 0 selects one of two 16-bit sets via abid[0] (range [0, 1]); cbsz != 0 selects the first set.
gfx950 8-bit: hardware ignores both cbsz and abid; both must be 0.

Example:

  %0 = amdgpu.sparse_mfma 16x16x32 %matA * %matB + %matC sparse(%idx : vector<4xi8>)
    : vector<4xf16>, vector<8xf16>, vector<4xf32>

  %1 = amdgpu.sparse_mfma 16x16x64 %matA * %matB + %matC sparse(%idx : vector<2xi16>)
    : vector<8xi8>, vector<16xi8>, vector<4xi32>

  %2 = amdgpu.sparse_mfma 16x16x64 %matA * %matB + %matC sparse(%idx : vector<2xi16>)
    { cbsz = 0 : i32, abid = 1 : i32 }
    : vector<8xf8E4M3FNUZ>, vector<16xf8E4M3FNUZ>, vector<4xf32>

  %3 = amdgpu.sparse_mfma 16x16x128 %matA * %matB + %matC sparse(%idx : i32)
    : vector<16xf8E4M3FN>, vector<32xf8E4M3FN>, vector<4xf32>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`m`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {16, 32}
`n`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {16, 32}
`k`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {16, 32, 64, 128}
`cbsz`	::mlir::IntegerAttr	32-bit signless integer attribute
`abid`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`sourceA`	vector of 16-bit float values of length 4/8 or vector of bfloat16 type values of length 4/8 or vector of 8-bit signless integer values of length 8/16 or vector of f8E4M3FN type or f8E5M2 type values of length 8/16 or vector of f8E4M3FNUZ type or f8E5M2FNUZ type values of length 8/16
`sourceB`	vector of 16-bit float values of length 8/16 or vector of bfloat16 type values of length 8/16 or vector of 8-bit signless integer values of length 16/32 or vector of f8E4M3FN type or f8E5M2 type values of length 16/32 or vector of f8E4M3FNUZ type or f8E5M2FNUZ type values of length 16/32
`destC`	vector of 32-bit float values of length 4/16 or vector of 32-bit signless integer values of length 4/16
`sparseIdx`	fixed-length vector of 8-bit signless integer values of length 4 or fixed-length vector of 16-bit signless integer values of length 2 or 32-bit signless integer

Results: ¶

Result	Description
`destD`	vector of 32-bit float values of length 4/16 or vector of 32-bit signless integer values of length 4/16

`amdgpu.sparse_wmma` (amdgpu::SparseWMMAOp) ¶

MLIR wrapper for gfx12+ sparse wmma instructions

Syntax:

operation ::= `amdgpu.sparse_wmma` custom<MNKDimensionList>($m, $n, $k) $sourceA `*` $sourceB `+` $destC
              `sparse` `(` $sparseIdx `:` type($sparseIdx) `)`
              attr-dict
              `:` type($sourceA) `,` type($sourceB) `,` type($destC)

The amdgpu.sparse_wmma op is an MLIR wrapper around intrinsics for various swmmac instructions in the AMDGPU architecture, which perform matrix multiply-accumulate operations using 2:4 structured sparsity on matrix A with dense matrices B, C, and D.

On gfx12, swmmac intrinsics support:

M=N=16, K=32 and M=N=32, K=16 for f16, bf16, i8 and i4 sources
M=N=16, K=64 for i4 sources

On gfx1250, swmmac intrinsics additionally support:

M=N=16, K=64 for f16 and bf16 sources
M=N=16, K=128 for f16, bf16 and i8 sources

The sparseIdx parameter contains packed indices identifying the positions of non-zero elements in the 2:4 sparse matrix A. For 16-bit source data, use vector<4xi8> (four 8-bit indices). For 8-bit source data, use vector<2xi16> (two 16-bit indices).

unsignedA and unsignedB flag that the int8 LLVM inputs are unsigned.

The clamp flag is used to saturate the output of type T to numeric_limits<T>::max() in case of overflow.

Example:

  %0 = amdgpu.sparse_wmma 16x16x32 %matA * %matB + %matC sparse(%idx : vector<4xi8>)
    : vector<4xf16>, vector<8xf16>, vector<4xf32>

  %1 = amdgpu.sparse_wmma 16x16x64 %matA * %matB + %matC sparse(%idx : vector<2xi16>)
    : vector<8xi8>, vector<16xi8>, vector<4xi32>

  %2 = amdgpu.sparse_wmma 16x16x64 %matA * %matB + %matC sparse(%idx : vector<2xi16>)
    { unsignedA = 0 : i1, unsignedB = 1 : i1, clamp = 0 : i1 }
    : vector<8xf8E4M3FNUZ>, vector<16xf8E4M3FNUZ>, vector<4xf32>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`m`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {16}
`n`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {16}
`k`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {32, 64, 128}
`unsignedA`	::mlir::UnitAttr	unit attribute
`unsignedB`	::mlir::UnitAttr	unit attribute
`reuseA`	::mlir::UnitAttr	unit attribute
`reuseB`	::mlir::UnitAttr	unit attribute
`clamp`	::mlir::UnitAttr	unit attribute
`wave64`	::mlir::UnitAttr	unit attribute

Operands: ¶

Operand	Description
`sourceA`	vector of 16-bit float values of length 4/8/16 or vector of bfloat16 type values of length 4/8/16 or vector of 8-bit signless integer values of length 4/8/32 or vector of 4-bit signless integer values of length 8/16 or vector of f8E4M3FN type or f8E5M2 type values of length 4/8/16/32 or vector of f8E4M3FNUZ type or f8E5M2FNUZ type values of length 4/8/16/32
`sourceB`	vector of 16-bit float values of length 8/16/32 or vector of bfloat16 type values of length 8/16/32 or vector of 8-bit signless integer values of length 4/8/16/64 or vector of 4-bit signless integer values of length 8/16/32 or vector of f8E4M3FN type or f8E5M2 type values of length 4/8/16/64 or vector of f8E4M3FNUZ type or f8E5M2FNUZ type values of length 4/8/16/64
`destC`	vector of 32-bit float values of length 4/8/16 or vector of 16-bit float values of length 4/8 or vector of bfloat16 type values of length 4/8 or vector of 32-bit signless integer values of length 4/8
`sparseIdx`	fixed-length vector of 8-bit signless integer values of length 4

Results: ¶

Result	Description
`destD`	vector of 32-bit float values of length 4/8/16 or vector of 16-bit float values of length 4/8 or vector of bfloat16 type values of length 4/8 or vector of 32-bit signless integer values of length 4/8

`amdgpu.swizzle_bitmode` (amdgpu::SwizzleBitModeOp) ¶

AMDGPU ds_swizzle op, bitmode variant

Syntax:

operation ::= `amdgpu.swizzle_bitmode` $src $and_mask $or_mask $xor_mask attr-dict `:` type($result)

High-level wrapper on bitmode rocdl.ds_swizzle op, masks are represented as separate fields so user won’t need to do manual bitpacking.

Supports arbitrary int/float/vector types, which will be repacked to i32 and one or more rocdl.ds_swizzle ops during lowering.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`and_mask`	::mlir::IntegerAttr	32-bit signless integer attribute
`or_mask`	::mlir::IntegerAttr	32-bit signless integer attribute
`xor_mask`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`src`	Integer or Float or fixed-length vector of Integer or Float values of ranks 1

Results: ¶

Result	Description
`result`	Integer or Float or fixed-length vector of Integer or Float values of ranks 1

`amdgpu.tensor_load_to_lds` (amdgpu::TensorLoadToLDSOp) ¶

Load tensors from global memory to LDS.

Syntax:

operation ::= `amdgpu.tensor_load_to_lds` $desc attr-dict `:` qualified(type($desc))

Load tensors of up to five dimensions from global memory to LDS.

This operation was introduced in gfx1250.

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::SideEffects::DefaultResource, MemoryEffects::Read on ::mlir::SideEffects::DefaultResource}

Operands: ¶

Operand	Description
`desc`	Descriptors used in tensor store/load operations.

`amdgpu.tensor_store_from_lds` (amdgpu::TensorStoreFromLDSOp) ¶

Store tensors from LDS to global memory.

Syntax:

operation ::= `amdgpu.tensor_store_from_lds` $desc attr-dict `:` qualified(type($desc))

Store tensors of up to five dimensions from LDS to global memory.

This operation was introduced in gfx1250.

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::SideEffects::DefaultResource, MemoryEffects::Read on ::mlir::SideEffects::DefaultResource}

Operands: ¶

Operand	Description
`desc`	Descriptors used in tensor store/load operations.

`amdgpu.transpose_load` (amdgpu::TransposeLoadOp) ¶

MLIR wrapper for CDNA transpose Load instructions

Syntax:

operation ::= `amdgpu.transpose_load` $src `[` $srcIndices `]` attr-dict `:` type($src) `->` type($result)

The amdgpu.transpose_load op is a wrapper around the ds_read_tr instructions on gfx9 and the ds_load_tr family of instructions on gfx1250.

The transpose load op represents a subgroup load from LDS memory, where the subgroup of threads collectively reads a matrix from the source memref, with each thread reading a vector of the matrix, and gets a transposed matrix in as the result. That is, each thread reads a vector of the col-major matrix at different indices, and the thread’s read result is a vector of the corresponding row of the transposed matrix.

This op is a direct wrapper around the ROCDL ds_read_tr family intrinsics on gfx950 and the ds_load_tr family of instructions on gfx1250. Please refer to the respective ISA documentation for more details about its exact semantics.

Format example:

%0 = amdgpu.transpose_load %src[%srcIndices] : memref<128x256xf16> -> vector<4xf16>

Operands:

$src: LDS memref to read from.
$srcIndices: indices into $src to read from for this thread. Indices must be non-negative and in-bounds for the corresponding dimension of $src, matching the constraints of memref.load.
$result: target register this transpose load instruction will write to.

Note: Lowering is only supported on gfx950 and gfx1250, with different permitted load types.

Traits: SameVariadicOperandSize

Operands: ¶

Operand	Description
`src`	memref of any non-token type values
`srcIndices`	variadic of index

Results: ¶

Result	Description
`result`	vector of any non-token type values

`amdgpu.wmma` (amdgpu::WMMAOp) ¶

MLIR wrapper for wmma instructions

Syntax:

operation ::= `amdgpu.wmma` custom<MNKDimensionList>($m, $n, $k) $sourceA `*` $sourceB `+` $destC
              attr-dict
              `:` type($sourceA) `,` type($sourceB) `,` type($destC)

The amdgpu.wmma op is an MLIR wrapper around intrinsics for various wmma instructions in the AMDGPU architecture, which perform matrix multiplication.

On gfx11/RDNA3, wmma intrinsics have M=N=K=16 dimensions.

On gfx12/RDNA4, wmma intrinsics have M=N=16 dimensions and support K=16 for all element types, and K=32 for i4 sources.

On gfx1250, wmma intrinsics have M=N=16 and K dimensions of 4, 32, 64, or 128, depending on the element types.

On gfx11/RDNA3, emitting f16->f16 (or bf16->bf16) wmma the output is a 16xf16 (or 16xbf16) vector containing only 8 valid values:

If subwordOffset is 0, then the output is stored at indices 0, 2, 4, …, 14.
If subwordOffset is 1, then the output is stored at indices 1, 3, 5, …, 15. On gfx12/RDNA4 and gfx1250, the result is instead returned as vector where all the values are valid and the subwordOffset must be 0, as it cannot be used.

unsignedA and unsignedB flag that the int8 LLVM inputs are unsigned.

The clamp flag is used to saturate the output of type T to numeric_limits<T>::max() in case of overflow.

The wave64attribute indicates whether an op is designed for 64 threads wavefont.

Example:

  %0 = amdgpu.wmma 16x16x16 %matA * %matB + %matC : vector<8xf16>, vector<8xf16>, vector<8xf16>

  %1 = amdgpu.wmma 16x16x64 %matD * %matE + %matF : vector<32xi8>, vector<8xf32>, vector<8xf32>

  %2 = amdgpu.wmma 16x16x128 %matG * %matH + %matI : vector<64xf4E2M1FN>, vector<64xf4E2M1FN>, vector<8xf32>

  %3 = amdgpu.wmma 16x16x4 %matJ * %matK + %matL : vector<2xf32>, vector<2xf32>, vector<8xf32>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`m`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {16}
`n`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {16}
`k`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {4, 16, 32, 64, 128}
`subwordOffset`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is one of {0, 1}
`unsignedA`	::mlir::UnitAttr	unit attribute
`unsignedB`	::mlir::UnitAttr	unit attribute
`clamp`	::mlir::UnitAttr	unit attribute

Operands: ¶

Operand	Description
`sourceA`	vector of 32-bit float values of length 2 or vector of 16-bit float or bfloat16 type values of length 4/8/16 or vector of 8-bit signless integer or 8-bit signed integer or 8-bit unsigned integer values of length 4/8/16/32 or vector of f8E4M3FN type or f8E5M2 type values of length 4/8/32/64 or vector of 4-bit signless integer or 4-bit signed integer or 4-bit unsigned integer values of length 4/8/16
`sourceB`	vector of 32-bit float values of length 2 or vector of 16-bit float or bfloat16 type values of length 4/8/16 or vector of 8-bit signless integer or 8-bit signed integer or 8-bit unsigned integer values of length 4/8/16/32 or vector of f8E4M3FN type or f8E5M2 type values of length 4/8/32/64 or vector of 4-bit signless integer or 4-bit signed integer or 4-bit unsigned integer values of length 4/8/16
`destC`	vector of 32-bit float or 32-bit signless integer values of length 4/8 or vector of 16-bit float or bfloat16 type values of length 4/8/16

Results: ¶

Result	Description
`destD`	vector of 32-bit float or 32-bit signless integer values of length 4/8 or vector of 16-bit float or bfloat16 type values of length 4/8/16

Attributes ¶

AddressSpaceAttr ¶

AMDGPU-specific address spaces

Syntax:

#amdgpu.address_space<
  `fat_raw_buffer` | `buffer_rsrc` | `fat_structured_buffer`   # value
>

AMDGPU-specific memory spaces that may not have exact analogues on other GPU targets or backends.

fat_raw_buffer is the memory space used when a memref is stored as as a “buffer fat pointer” - that is, a buffer resource (that is set up to use raw byte-level indexing) along with its offset. The AMDGPU backend implements ptr addrspace(7) to represent these fat pointers so that buffer resources (which allow advanced features like bounds checking or cache swizzling) can be used like ordinary LLVM pointers or memrefs. See also the fat_raw_buffer_cast operation
buffer_rsrc is the memory space for ptr addrspace(8), representing a buffer resource. It should not be used for memrefs, since it does not support indexing
fat_structured_buffer represents ptr addrspace(9), a buffer resource that carries both an index and offset field, which are used for complex structured indexing that is primarily seen in graphics applications. This is also incompatible with the simple indexing model supported by memref.

Parameters: ¶

Parameter	C++ type	Description
value	`::mlir::amdgpu::AddressSpace`	an enum of type AddressSpace

ScopeAttr ¶

AMDGPU-specific cache scopes. WGP - workgroup processor (CUs); SE - shader engine (GL2); DEV - device; SYS - system

Syntax:

#amdgpu.cache_scope<
  `WGP` | `SE` | `DEV` | `SYS`   # value
>

Enum cases:

WGP (WGP)
SE (SE)
DEV (DEV)
SYS (SYS)

Parameters: ¶

Parameter	C++ type	Description
value	`::mlir::amdgpu::Scope`	an enum of type Scope

DPPPermAttr ¶

The possible permutations for a DPP operation

Syntax:

#amdgpu.dpp_perm<
  `quad_perm` | `row_shl` | `row_shr` | `row_ror` | `wave_shl` | `wave_shr` | `wave_ror` | `wave_rol` | `row_mirror` | `row_half_mirror` | `row_bcast_15` | `row_bcast_31`   # value
>

Enum cases:

quad_perm (quad_perm)
row_shl (row_shl)
row_shr (row_shr)
row_ror (row_ror)
wave_shl (wave_shl)
wave_shr (wave_shr)
wave_ror (wave_ror)
wave_rol (wave_rol)
row_mirror (row_mirror)
row_half_mirror (row_half_mirror)
row_bcast_15 (row_bcast_15)
row_bcast_31 (row_bcast_31)

Parameters: ¶

Parameter	C++ type	Description
value	`::mlir::amdgpu::DPPPerm`	an enum of type DPPPerm

LoadTemporalHintAttr ¶

AMDGPU-specific prefetch temporal hints for load instructions. RT - regular temporal for both near and far caches; NT - non-temporal for both near and far caches; HT - high-priority temporal for both near and far caches; LU - last-use; NT_RT - non-temporal for near cache(s) and regular for far caches; RT_NT - regular for near cache(s) and non-temporal for far caches; NT_HT - non-temporal for near cache(s) and high-priority temporal for far caches

Syntax:

#amdgpu.load_temporal_hint<
  `RT` | `NT` | `HT` | `LU` | `NT_RT` | `RT_NT` | `NT_HT`   # value
>

Enum cases:

RT (RT)
NT (NT)
HT (HT)
LU (LU)
NT_RT (NT_RT)
RT_NT (RT_NT)
NT_HT (NT_HT)

Parameters: ¶

Parameter	C++ type	Description
value	`::mlir::amdgpu::LoadTemporalHint`	an enum of type LoadTemporalHint

Types ¶

DsBarrierStateType ¶

State of an in-LDS barrier.

Syntax: !amdgpu.ds_barrier_state

Type that encodes the state of an in-LDS barrier as used by the atomic barrier instructions introduced on gfx1250.

It consists of a 29-bit count of the number of pending arrivals at the barrier (the pending count) in bits [28:0], a 3-bit phase in bits [31:29], and the 32-bit count to re-initialize the pending count to on phase change (the init count) in bits [63:32].

When an instruction (either one of the explicit arrival primitives or tensor data movement) arrives at such a barrier, the pending count is decremented. If this decrement would cause the pending count to underflow, the count is instead reset to the init count and the phase is decremented (wrapping back to 0). When the phase is decremented, sleeping waves are woken up so they can check the barrier.

The barrier state resides in LDS, but an old barrier state can be returned from atomic arrival instructions or though atomic loads.

This feature is not available prior to gfx1250.

TDMBaseType ¶

Pair of base addresses that move data between LDS and global storage.

Syntax:

!amdgpu.tdm_base<
  Type   # elementType
>

This type is opaque and it is used to represent a struct of two addresses. One address is in LDS while the other is in global memory.

The value defined by this operation is only intended to be used by amdgpu.tdm_make_descriptor.

Parameters: ¶

Parameter	C++ type	Description
elementType	`Type`

TDMDescriptorType ¶

Descriptors used in tensor store/load operations.

Syntax: !amdgpu.tdm_descriptor

This type is opaque and corresponds to the two or four descriptor groups used in tensor_load_to_lds or tensor_store_from_lds.

TDMGatherBaseType ¶

Pair of base addresses that move data between LDS and global storage.

Syntax:

!amdgpu.tdm_gather_base<
  Type,   # elementType
  Type   # indexType
>

This type is opaque and it is used to represent a struct of two addresses. One address is in LDS while the other is in global memory.

This operation is similar to amdgpu.tdm_make_base but intended to be used in gather mode.

The value defined by this operation is only intended to be used by amdgpu.tdm_make_gather_descriptor.

Parameters: ¶

Parameter	C++ type	Description
elementType	`Type`
indexType	`Type`

Enums ¶

AddressSpace ¶

AMDGPU-specific address spaces

Cases: ¶

Symbol	Value	String
FatRawBuffer	`0`	fat_raw_buffer
BufferRsrc	`1`	buffer_rsrc
FatStructuredBuffer	`2`	fat_structured_buffer

Scope ¶

AMDGPU-specific cache scopes. WGP - workgroup processor (CUs); SE - shader engine (GL2); DEV - device; SYS - system

Cases: ¶

Symbol	Value	String
WGP	`0`	WGP
SE	`1`	SE
DEV	`2`	DEV
SYS	`3`	SYS

DPPPerm ¶

The possible permutations for a DPP operation

Cases: ¶

Symbol	Value	String
quad_perm	`0`	quad_perm
row_shl	`1`	row_shl
row_shr	`2`	row_shr
row_ror	`3`	row_ror
wave_shl	`4`	wave_shl
wave_shr	`5`	wave_shr
wave_ror	`6`	wave_ror
wave_rol	`7`	wave_rol
row_mirror	`8`	row_mirror
row_half_mirror	`9`	row_half_mirror
row_bcast_15	`10`	row_bcast_15
row_bcast_31	`11`	row_bcast_31

LoadTemporalHint ¶

Cases: ¶

Symbol	Value	String
RT	`0`	RT
NT	`1`	NT
HT	`2`	HT
LU	`3`	LU
NT_RT	`4`	NT_RT
RT_NT	`5`	RT_NT
NT_HT	`6`	NT_HT

BufferOOBMode ¶

ROCDL buffer out-of-bounds mode

Cases: ¶

Symbol	Value	String
Any	`0`	any
Relaxed	`1`	relaxed
Strict	`2`	strict

Gfx12AtomicCachePolicy ¶

Gfx12 atomic cache policy bits

Cases: ¶

Symbol	Value	String
none	`0`	none
keep_return	`1`	return
nt	`2`	nt
cascade	`4`	cascade
scope_se	`8`	scope_se
scope_dev	`16`	scope_dev
scope_sys	`24`	scope_sys
nv	`32`	nv
volatile_op	`2147483648`	volatile

Gfx12CachePolicy ¶

Gfx12 cache policy bits

Cases: ¶

Symbol	Value	String
none	`0`	none
nt	`1`	nt
ht	`2`	ht
lu	`3`	lu
nt_rt	`4`	nt_rt
rt_nt	`5`	rt_nt
nt_ht	`6`	nt_ht
nt_wb	`7`	nt_wb
scope_se	`8`	scope_se
scope_dev	`16`	scope_dev
scope_sys	`24`	scope_sys
nv	`32`	nv
swz	`64`	swz
scal	`2048`	scal
volatile_op	`2147483648`	volatile

Gfx942CachePolicy ¶

Gfx942 buffer cache policy bits

Cases: ¶

Symbol	Value	String
none	`0`	none
sc0	`1`	sc0
nt	`2`	nt
swz	`8`	swz
sc1	`16`	sc1
volatile_op	`2147483648`	volatile

MFMANegModifier ¶

Negation modifier bitfield for gfx94x double-precision MFMA

Cases: ¶

Symbol	Value	String
none	`0`	none
neg_a	`1`	neg_a
neg_b	`2`	neg_b
neg_c	`4`	neg_c

MFMAPermB ¶

Permutations of the lanes storing B in an MFMA

Cases: ¶

Symbol	Value	String
none	`0`	none
bcast_first_32	`1`	bcast_first_32
bcast_second_32	`2`	bcast_second_32
rotate_16_right	`3`	rotate_16_right
bcast_first_16	`4`	bcast_first_16
bcast_second_16	`5`	bcast_second_16
bcast_third_16	`6`	bcast_third_16
bcast_fourth_16	`7`	bcast_fourth_16

MatrixFormat ¶

Matrix operand formats selected by scaled MFMA/WMMA format fields

Cases: ¶

Symbol	Value	String
fp8_e4m3	`0`	fp8_e4m3
fp8_e5m2	`1`	fp8_e5m2
fp6_e2m3	`2`	fp6_e2m3
fp6_e3m2	`3`	fp6_e3m2
fp4_e2m1	`4`	fp4_e2m1

PreGfx12CachePolicy ¶

Pre-gfx12 buffer cache policy bits

Cases: ¶

Symbol	Value	String
none	`0`	none
glc	`1`	glc
slc	`2`	slc
dlc	`4`	dlc
swz	`8`	swz
scc	`16`	scc
all	`23`	all
volatile_op	`2147483648`	volatile

SchedGroupMask ¶

Instruction type mask for scheduling barriers

Cases: ¶

Symbol	Value	String
none	`0`	none
non_mem_non_sideeffect	`1`	non_mem_non_sideeffect
valu	`2`	valu
salu	`4`	salu
mfma_wmma	`8`	mfma_wmma
all_vmem	`16`	all_vmem
vmem_read	`32`	vmem_read
vmem_write	`64`	vmem_write
all_ds	`128`	all_ds
ds_read	`256`	ds_read
ds_write	`512`	ds_write
transcendental	`1024`	transcendental
ldsdma	`2048`	ldsdma
all	`4095`	all

WMMACModifier ¶

WMMA C operand modifiers

Cases: ¶

Symbol	Value	String
none	`0`	none
neg	`1`	neg
abs	`2`	abs
neg_abs	`3`	neg_abs

WMMAMatrixScale ¶

Matrix scale row selector

Cases: ¶

Symbol	Value	String
row0	`0`	row0
row1	`1`	row1

WMMAMatrixScaleFormat ¶

Matrix scale exponent formats

Cases: ¶

Symbol	Value	String
e8	`0`	e8
e5m3	`1`	e5m3
e4m3	`2`	e4m3

'amdgpu' Dialect

What goes here?

Design guidelines

Documentation guidelines

Operations ¶

amdgpu.dot (amdgpu::DotOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.dpp (amdgpu::DPPOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.ds_async_barrier_arrive (amdgpu::DsAsyncBarrierArriveOp) ¶

Operands: ¶

amdgpu.ds_barrier_arrive (amdgpu::DsBarrierArriveOp) ¶

Operands: ¶

Results: ¶

amdgpu.ds_barrier_init (amdgpu::DsBarrierInitOp) ¶

Operands: ¶

amdgpu.ds_barrier_poll_state (amdgpu::DsBarrierPollStateOp) ¶

Operands: ¶

Results: ¶

amdgpu.ds_barrier_state_init_count (amdgpu::DsBarrierStateInitCountOp) ¶

Operands: ¶

Results: ¶

amdgpu.ds_barrier_state_pending_count (amdgpu::DsBarrierStatePendingCountOp) ¶

Operands: ¶

Results: ¶

amdgpu.ds_barrier_state_phase (amdgpu::DsBarrierStatePhaseOp) ¶

Operands: ¶

Results: ¶

amdgpu.ds_barrier_state_phase_parity (amdgpu::DsBarrierStatePhaseParity) ¶

Operands: ¶

Results: ¶

amdgpu.ext_packed_fp8 (amdgpu::ExtPackedFp8Op) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.fat_raw_buffer_cast (amdgpu::FatRawBufferCastOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.gather_to_lds (amdgpu::GatherToLDSOp) ¶

Attributes: ¶

Operands: ¶

amdgpu.global_load_async_to_lds (amdgpu::GlobalLoadAsyncToLDSOp) ¶

Attributes: ¶

Operands: ¶

amdgpu.global_prefetch (amdgpu::GlobalPrefetchOp) ¶

Attributes: ¶

Operands: ¶

amdgpu.global_transpose_load (amdgpu::GlobalTransposeLoadOp) ¶

Operands: ¶

Results: ¶

amdgpu.lds_barrier (amdgpu::LDSBarrierOp) ¶

amdgpu.make_dma_base (amdgpu::MakeDmaBaseOp) ¶

Operands: ¶

Results: ¶

amdgpu.make_dma_descriptor (amdgpu::MakeDmaDescriptorOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.make_gather_dma_base (amdgpu::MakeGatherDmaBaseOp) ¶

Operands: ¶

Results: ¶

amdgpu.make_gather_dma_descriptor (amdgpu::MakeGatherDmaDescriptorOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.memory_counter_wait (amdgpu::MemoryCounterWaitOp) ¶

Attributes: ¶

amdgpu.mfma (amdgpu::MFMAOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.packed_scaled_trunc (amdgpu::PackedScaledTruncOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

`amdgpu.dot` (amdgpu::DotOp) ¶

`amdgpu.dpp` (amdgpu::DPPOp) ¶

`amdgpu.ds_async_barrier_arrive` (amdgpu::DsAsyncBarrierArriveOp) ¶

`amdgpu.ds_barrier_arrive` (amdgpu::DsBarrierArriveOp) ¶

`amdgpu.ds_barrier_init` (amdgpu::DsBarrierInitOp) ¶

`amdgpu.ds_barrier_poll_state` (amdgpu::DsBarrierPollStateOp) ¶

`amdgpu.ds_barrier_state_init_count` (amdgpu::DsBarrierStateInitCountOp) ¶

`amdgpu.ds_barrier_state_pending_count` (amdgpu::DsBarrierStatePendingCountOp) ¶

`amdgpu.ds_barrier_state_phase` (amdgpu::DsBarrierStatePhaseOp) ¶

`amdgpu.ds_barrier_state_phase_parity` (amdgpu::DsBarrierStatePhaseParity) ¶

`amdgpu.ext_packed_fp8` (amdgpu::ExtPackedFp8Op) ¶

`amdgpu.fat_raw_buffer_cast` (amdgpu::FatRawBufferCastOp) ¶

`amdgpu.gather_to_lds` (amdgpu::GatherToLDSOp) ¶

`amdgpu.global_load_async_to_lds` (amdgpu::GlobalLoadAsyncToLDSOp) ¶

`amdgpu.global_prefetch` (amdgpu::GlobalPrefetchOp) ¶

`amdgpu.global_transpose_load` (amdgpu::GlobalTransposeLoadOp) ¶

`amdgpu.lds_barrier` (amdgpu::LDSBarrierOp) ¶

`amdgpu.make_dma_base` (amdgpu::MakeDmaBaseOp) ¶

`amdgpu.make_dma_descriptor` (amdgpu::MakeDmaDescriptorOp) ¶

`amdgpu.make_gather_dma_base` (amdgpu::MakeGatherDmaBaseOp) ¶

`amdgpu.make_gather_dma_descriptor` (amdgpu::MakeGatherDmaDescriptorOp) ¶

`amdgpu.memory_counter_wait` (amdgpu::MemoryCounterWaitOp) ¶

`amdgpu.mfma` (amdgpu::MFMAOp) ¶

`amdgpu.packed_scaled_trunc` (amdgpu::PackedScaledTruncOp) ¶

`amdgpu.packed_stoch_round_fp8` (amdgpu::PackedStochRoundFp8Op) ¶

`amdgpu.packed_trunc_2xfp8` (amdgpu::PackedTrunc2xFp8Op) ¶

`amdgpu.permlane_swap` (amdgpu::PermlaneSwapOp) ¶

`amdgpu.permlane_var` (amdgpu::PermlaneVarOp) ¶

`amdgpu.raw_buffer_atomic_cmpswap` (amdgpu::RawBufferAtomicCmpswapOp) ¶

`amdgpu.raw_buffer_atomic_fadd` (amdgpu::RawBufferAtomicFaddOp) ¶

`amdgpu.raw_buffer_atomic_fmax` (amdgpu::RawBufferAtomicFmaxOp) ¶

`amdgpu.raw_buffer_atomic_smax` (amdgpu::RawBufferAtomicSmaxOp) ¶

`amdgpu.raw_buffer_atomic_umin` (amdgpu::RawBufferAtomicUminOp) ¶

`amdgpu.raw_buffer_load` (amdgpu::RawBufferLoadOp) ¶

`amdgpu.raw_buffer_store` (amdgpu::RawBufferStoreOp) ¶

`amdgpu.scaled_ext_packed` (amdgpu::ScaledExtPackedOp) ¶

`amdgpu.scaled_ext_packed_matrix` (amdgpu::ScaledExtPackedMatrixOp) ¶

`amdgpu.scaled_mfma` (amdgpu::ScaledMFMAOp) ¶

`amdgpu.scaled_wmma` (amdgpu::ScaledWMMAOp) ¶

`amdgpu.sched_barrier` (amdgpu::SchedBarrierOp) ¶

`amdgpu.sparse_mfma` (amdgpu::SparseMFMAOp) ¶

`amdgpu.sparse_wmma` (amdgpu::SparseWMMAOp) ¶