'amdgpu' Dialect

The AMDGPU dialect provides wrappers around AMD-specific functionality and LLVM intrinsics. These wrappers should be used in conjunction with more generic dialects, such as gpu and vector, when generating LLVM IR that will eventually be executed on AMD hardware.

Operations ¶

source

`amdgpu.dpp` (amdgpu::DPPOp) ¶

AMDGPU DPP operation

Syntax:

operation ::= `amdgpu.dpp` $old $src $kind (`(` $permArgument^ `)`)? attr-dict `:` type($result)

This operation represents DPP functionality in a GPU program. DPP provides the following operations:

Full crossbar in a group of four (quad_perm)
Wavefront shift left by one lane (wave_shl)
Wavefront shift right by one lane (wave_shr)
Wavefront rotate right by one lane (wave_ror)
Wavefront rotate left by one lane (wave_rol)
Row shift left by 1–15 lanes (row_shl)
Row shift right by 1–15 lanes (row_shr)
Row rotate right by 1–15 lanes (row_ror)
Reverse within a row (row_mirror)
Reverse within a half-row (row_half_mirror)
Broadcast the 15th lane of each row to the next row (row_bcast)
Broadcast lane 31 to rows 2 and 3 (row_bcast)

Traits: AlwaysSpeculatableImplTrait, SameTypeOperands

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`kind`	::mlir::amdgpu::DPPPermAttr	The possible permutations for a DPP operation
`permArgument`	::mlir::Attribute	32-bit signless integer attribute or array attribute or unit attribute
`row_mask`	::mlir::IntegerAttr	32-bit signless integer attribute
`bank_mask`	::mlir::IntegerAttr	32-bit signless integer attribute
`bound_ctrl`	::mlir::BoolAttr	bool attribute

Operands: ¶

Operand	Description
`old`	any type
`src`	any type

Results: ¶

Result	Description
`result`	any type

`amdgpu.ext_packed_fp8` (amdgpu::ExtPackedFp8Op) ¶

Extend a fp8 value to a float or a vector of packed fp8 values to two floats

Syntax:

operation ::= `amdgpu.ext_packed_fp8` attr-dict $source `[` $index `]` `:` type($source) `to` type($res)

Extend one or two 8-bit floats in source[index] to a 32-bit float or two floats and return them.

This rather unusual signature arises from the fact that AMD GPUs cannot easily work with sub 32-bit quantities, so the compiler intrinsics for extending 8-bit floats (which are, currently, the only way to work with this operation) take packed vectors of 4 such floats.

If the passed-in vector has fewer than four elements, or the input is scalar, the remaining values in the <4 x i8> will be filled with undefined values as needed.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`index`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 3

Operands: ¶

Operand	Description
`source`	f8E5M2FNUZ type or f8E4M3FNUZ type or f8E5M2 type or f8E4M3FN type or vector of f8E5M2FNUZ type or f8E4M3FNUZ type or f8E5M2 type or f8E4M3FN type values of length 1/2/3/4

Results: ¶

Result	Description
`res`	32-bit float or fixed-length vector of 32-bit float values of length 2

`amdgpu.fat_raw_buffer_cast` (amdgpu::FatRawBufferCastOp) ¶

Create a raw buffer fat pointer that matches memref

Syntax:

operation ::= `amdgpu.fat_raw_buffer_cast` $source oilist (`validBytes` `(` $validBytes `)`
              | `cacheSwizzleStride` `(` $cacheSwizzleStride `)`
              | `boundsCheck` `(` $boundsCheck `)`
              | `resetOffset` $resetOffset )
              attr-dict `:` type($source) `to` type($result)

Wraps the memory pointed to by source as a raw buffer fat pointer, or, in LLVM terms, a ptr addrspace(7), returning a memref that has the same sizes and layout but the #amdgpu.address_space<fat_raw_buffer> address space.

This memref can be used with standard memref operations like memref.load, memref.store, and memref.atomicrmw, which will be lowered to the relevant buffer intrinsics. (vector.masked_load/store will work once there’s backend support for lowering them, and then this document will be updated)

If validBytes is given, it is the number of bytes that will be valid as an offset to out. If it is not provided, this will be inferred from the size of the memref during lowering. This size is max_{d = 0 upto rank(source)} (sizes[d] * strides[d]) * sizeof(element type).

The flags of the buffer descriptor will be set up to enable raw usage - for example, stride = 0, add_tid = 0, and so on. The boundsCheck property determines if bounds checking is enabled or not (on architectures where this can be controlled - that is, on RDNA chips).

If cacheSwizzleStride is provided, L1 cache swizzling will be enabled on architectures that support it. This swizzling, unlike the main swizzling mode (whose usage makes a buffer non-raw) does not affect index calculation, but does affect cache behavior. Mixing access between cache-swizzled raw buffers and other forms of memory access, like ordinary pointer loads or unswizzled buffer pointers can cause incorrect behavior and must be avoided.

This operation preserves the sizes, strides, and offset of the input memref - they’ll be added in by memref.load later. However, if resetOffset is set, that offset will be added to the base pointer. If the value of the memref’s offset is not uniform (independent of the lane/thread ID), this will lead to substantially decreased performance due to the need for a waterfall loop on the base address of the buffer resource.

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), ViewLikeOpInterface

Effects: MemoryEffects::Effect{}

Operands: ¶

Operand	Description
`source`	memref of any type values
`validBytes`	32-bit signless integer
`cacheSwizzleStride`	14-bit signless integer

Results: ¶

Result	Description
`result`	memref of any type values

`amdgpu.gather_to_lds` (amdgpu::GatherToLDSOp) ¶

MLIR wrapper for CDNA Gather to LDS instructions

Syntax:

operation ::= `amdgpu.gather_to_lds` $src `[` $srcIndices `]` `,` $dst `[` $dstIndices `]` attr-dict `:` $transferType `,` type($src) `,` type($dst)

The amdgpu.gather_to_lds op is a wrapper around the global_load_lds instructions.

Operands:

$src: global memory (including fat buffer) memref to read from.
$srcIndices: indices into $src to read from for this thread.
$dst: LDS memory memref to write to.
$dstIndices: base indices into $dst to write to for the subgroup of this thread. The elements gathered by the subgroup will be written contiguously in order of lane ID starting at $dst[$dstIndices]. Byte-sized (ex. i8) or short-sized (ex. i16) types will be zero-padded/extended to 32 bits before being written. 96-bit types (ex. vector<3xf32>) will be zero-padded to 128 bits before being written.
$transferType: type of the data to be transferred by each thread. This is used to determine the size of the data to be transferred and the number of threads in the subgroup. The transfer type must be a scalar type or a vector type with a single element type.

The $dst, along with its indices, points to the memory location the subgroup of this thread will write to.

Note: only supported on gfx9 and gfx10.

Traits: SameVariadicOperandSize

Attributes: ¶

Attribute	MLIR Type	Description
`transferType`	::mlir::TypeAttr	any type attribute

Operands: ¶

Operand	Description
`src`	memref of any type values
`srcIndices`	variadic of index
`dst`	memref of any type values
`dstIndices`	variadic of index

`amdgpu.lds_barrier` (amdgpu::LDSBarrierOp) ¶

Barrier that includes a wait for LDS memory operations.

Syntax:

operation ::= `amdgpu.lds_barrier` attr-dict

amdgpu.lds_barrier is both a barrier (all workitems in a workgroup must reach the barrier before any of them may proceed past it) and a wait for all operations that affect the Local Data Store (LDS) issued from that wrokgroup to complete before the workgroup may continue. Since the LDS is per-workgroup memory, this barrier may be used, for example, to ensure all workitems have written data to LDS before any workitem attempts to read from it.

Note that lds_barrier does not force reads to or from global memory to complete before execution continues. Therefore, it should be used when operations on global memory can be issued far in advance of when their results are used (for example, by writing them to LDS).

WARNING: On architectures that do not support the BackOffBarrier feature, (those which will implement this barrier by emitting inline assembly), use of this operation will impede the usabiliity of memory watches (including breakpoints set on variables) when debugging.

`amdgpu.mfma` (amdgpu::MFMAOp) ¶

MLIR wrapper for CDNA mfma instructions

Syntax:

operation ::= `amdgpu.mfma` $sourceA `*` $sourceB `+` $destC
              attr-dict
              `blgp` `=` $blgp
              `:` type($sourceA) `,` type($sourceB) `,` type($destC)

The amdgpu.mfma op is an MLIR wrapper around intrinsics for various mfma instructions in the CDNA architecture, which perform multiple outer products in order to allow fast matrix multiplication.

The wrapper will select an appropriate mfma instruction, if one is available, based on the provided m, k, n, and nBlks attributes, along with the types of the source and destination arguments.

For information on the layouts of the input and output matrces (which are stored in sourceA, sourceB, destC, and destD), see the CDNA ISA documentation.

The cbsz, abid, and blgp parameters control how the lanes of the wave are permuted when matrix data is being loaded: blgp can be any number of fixed permutations, cbsz specifies the log_2 of the number of chunks the lanes holding sourceA are split into, and abid selects one of those chunks.

Note, this wrapper allows specifying vector<4Kxi8> arguments to MFMA intrinsics that take an integer type of width 4K. For example, one can provide a vector<4xi8> as an argument to an MFMA instruction that logically takes 4 i8s but whose intrinsics are specified to take an i32. In these cases, the bytes in the vector will be concatenated in little-endian order (that is, v[0] will go to arg[7:0], v[1] to arg[15:8] and so on).

The negateA, negateB, and negateC flags are only supported for double-precision operations on gfx94x.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`m`	::mlir::IntegerAttr	32-bit signless integer attribute
`n`	::mlir::IntegerAttr	32-bit signless integer attribute
`k`	::mlir::IntegerAttr	32-bit signless integer attribute
`blocks`	::mlir::IntegerAttr	32-bit signless integer attribute
`cbsz`	::mlir::IntegerAttr	32-bit signless integer attribute
`abid`	::mlir::IntegerAttr	32-bit signless integer attribute
`blgp`	::mlir::amdgpu::MFMAPermBAttr	The possible permutations of the lanes storing B available in an MFMA
`reducePrecision`	::mlir::UnitAttr	unit attribute
`negateA`	::mlir::UnitAttr	unit attribute
`negateB`	::mlir::UnitAttr	unit attribute
`negateC`	::mlir::UnitAttr	unit attribute

Operands: ¶

Operand	Description
`sourceA`	32-bit float or 64-bit float or 32-bit signless integer or 64-bit signless integer or vector of 32-bit float values of length 2 or vector of 16-bit float values of length 4/8 or vector of bfloat16 type values of length 2/4/8 or vector of 8-bit signless integer values of length 4/8/16 or vector of f8E5M2FNUZ type or f8E4M3FNUZ type values of length 8 or vector of f8E5M2 type or f8E4M3FN type values of length 8/32 or vector of f6E2M3FN type or f6E3M2FN type or f4E2M1FN type values of length 32
`sourceB`	32-bit float or 64-bit float or 32-bit signless integer or 64-bit signless integer or vector of 32-bit float values of length 2 or vector of 16-bit float values of length 4/8 or vector of bfloat16 type values of length 2/4/8 or vector of 8-bit signless integer values of length 4/8/16 or vector of f8E5M2FNUZ type or f8E4M3FNUZ type values of length 8 or vector of f8E5M2 type or f8E4M3FN type values of length 8/32 or vector of f6E2M3FN type or f6E3M2FN type or f4E2M1FN type values of length 32
`destC`	64-bit float or vector of 32-bit float values of length 4/16/32 or vector of 32-bit signless integer values of length 4/16/32 or vector of 64-bit float values of length 4

Results: ¶

Result	Description
`destD`	64-bit float or vector of 32-bit float values of length 4/16/32 or vector of 32-bit signless integer values of length 4/16/32 or vector of 64-bit float values of length 4

`amdgpu.packed_scaled_trunc` (amdgpu::PackedScaledTruncOp) ¶

Round two floats into a packed vector of floats

Syntax:

operation ::= `amdgpu.packed_scaled_trunc` attr-dict $source `into` ($existing^):(`undef`)? `[` $index `]`
              `,` $scale
              `:` type($source) `to` type($res) (`into` type($existing)^)?

Scale and round the inputs source (which is undefined if not specified) into the low or high word (bottom two or top two) elements of the returned vector, keeping the other two elements of existing unchanged if present (or undefined if it was not passed in).

The reason for this odd signature is that AMD GPUs cannot easily work with sub-registers, and so the conversion intrinsics take 32-bit wide packed vectors of float values.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`index`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 7

Operands: ¶

Operand	Description
`source`	vector of 32-bit float or 16-bit float or bfloat16 type values of length 1/2
`scale`	32-bit float
`existing`	fixed-length vector of f8E5M2 type or f8E4M3FN type values of length 4 or fixed-length vector of f4E2M1FN type values of length 8

Results: ¶

Result	Description
`res`	fixed-length vector of f8E5M2 type or f8E4M3FN type values of length 4 or fixed-length vector of f4E2M1FN type values of length 8

`amdgpu.packed_stoch_round_fp8` (amdgpu::PackedStochRoundFp8Op) ¶

Round float stochiastically into a packed vector of 8-bit floats

Syntax:

operation ::= `amdgpu.packed_stoch_round_fp8` attr-dict $source `+` $stochiasticParam
              `into` ($existing^):(`undef`)? `[` $storeIndex `]`
              `:` type($source) `to` type($res) (`into` type($existing)^)?

Round the input source, adding in stochiasticParam, and place it into the storeIndexth element of res.

If existing is passed in, elements of res other than the one at storeIndex are copied from existing.

The reason for this odd signature is that AMD GPUs cannot easily work with sub-registers, and so the conversion intrinsics (which are currently the only way to work with 8-bit float types) take packed vectors of 4 8-bit values.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`storeIndex`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 3

Operands: ¶

Operand	Description
`source`	32-bit float
`stochiasticParam`	32-bit signless integer
`existing`	fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4

Results: ¶

Result	Description
`res`	fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4

`amdgpu.packed_trunc_2xfp8` (amdgpu::PackedTrunc2xFp8Op) ¶

Round two floats into a packed vector of 8-bit floats

Syntax:

operation ::= `amdgpu.packed_trunc_2xfp8` attr-dict $sourceA `,` ($sourceB^):(`undef`)?
              `into` ($existing^):(`undef`)? `[` `word` $wordIndex `]`
              `:` type($sourceA) `to` type($res) (`into` type($existing)^)?

Round the inputs sourceA and sourceB (which is undefined if not specified) into the low or high word (bottom two or top two) elements of the returned vector, keeping the other two elements of existing unchanged if present (or undefined if it was not passed in).

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`wordIndex`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 1

Operands: ¶

Operand	Description
`sourceA`	32-bit float
`sourceB`	32-bit float
`existing`	fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4

Results: ¶

Result	Description
`res`	fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4

`amdgpu.raw_buffer_atomic_cmpswap` (amdgpu::RawBufferAtomicCmpswapOp) ¶

Raw Buffer Atomic compare-and-swap

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_cmpswap` attr-dict $src `,` $cmp `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_cmpswap op is a wrapper around the buffer-based atomic compare-and-swap min available on AMD GPUs.

The index into the buffer is computed as for memref.store with the addition of indexOffset (which is used to aid in emitting vectorized code) and, if present sgprOffset (which is added after bounds checks and includes any non-zero offset on the memref type).

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Interfaces: InferTypeOpInterface

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`src`	any type
`cmp`	any type
`memref`	memref of any type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

Results: ¶

Result	Description
`value`	any type

`amdgpu.raw_buffer_atomic_fadd` (amdgpu::RawBufferAtomicFaddOp) ¶

Raw Buffer Floating-point Atomic Add (MI-* only)

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_fadd` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_fadd op is a wrapper around the buffer-based atomic floating point addition available on the MI-* series of AMD GPUs.

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`value`	32-bit float or vector of 16-bit float or bfloat16 type values of length 2
`memref`	memref of any type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

`amdgpu.raw_buffer_atomic_fmax` (amdgpu::RawBufferAtomicFmaxOp) ¶

Raw Buffer Floating-point Atomic Max (non-GFX9)

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_fmax` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_fmax op is a wrapper around the buffer-based atomic floating point max available on AMD GPUs (except GFX9).

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`value`	32-bit float or 64-bit float
`memref`	memref of any type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

`amdgpu.raw_buffer_atomic_smax` (amdgpu::RawBufferAtomicSmaxOp) ¶

Raw Buffer Signed Integer Atomic Max

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_smax` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_smax op is a wrapper around the buffer-based atomic signed integer max available on AMD GPUs.

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`value`	32-bit signless integer
`memref`	memref of any type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

`amdgpu.raw_buffer_atomic_umin` (amdgpu::RawBufferAtomicUminOp) ¶

Raw Buffer Unsigned Integer Atomic Min

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_umin` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_umin op is a wrapper around the buffer-based atomic signed integer min available on AMD GPUs.

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`value`	32-bit signless integer
`memref`	memref of any type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

`amdgpu.raw_buffer_load` (amdgpu::RawBufferLoadOp) ¶

Raw Buffer load, exposing GCN features

Syntax:

operation ::= `amdgpu.raw_buffer_load` attr-dict $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($memref) (`,` type($indices)^)? `->` type($value)

The amdgpu.raw_buffer_load op is a wrapper around the buffer load intrinsics available on AMD GPUs, including extensions in newer GPUs.

The index into the buffer is computed as for memref.load with the additon of indexOffset and sgprOffset (which may or may not be considered in bounds checks and includes any offset present on the memref type if it’s non-zero).

All indices and offsets are in units of the memref’s data type and are converted to bytes during lowering.

When a load is out of bounds, the instruction returns zero. Partially-out of bounds have chipset-dependent behavior: whether reading 2 elements starting at index 7 of a memref<8xf32> returns the last element in the first vector component depends on the architecture.

The memref struct is converted into a buffer resource (a V#) and the arguments are translated to intrinsic arguments as follows:

The base address of the buffer is the base address of the memref
The stride is 0 to enable raw mode
The number of records is the size of the memref, in bytes In the case of dynamically-shaped memrefs, this is computed at runtime as max_d (size(d) * stride(d)) * sizeof(elementType(memref))
The offset enable bit is 1, the index enable bit is 0.
The thread ID addition bit is off
If boundsCheck is false and the target chipset is RDNA, OOB_SELECT is set to 2 to disable bounds checks, otherwise it is 3
The cache coherency bits are off

Traits: AttrSizedOperandSegments

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`memref`	memref of any type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

Results: ¶

Result	Description
`value`	any type

`amdgpu.raw_buffer_store` (amdgpu::RawBufferStoreOp) ¶

Raw Buffer Store, exposing GCN features

Syntax:

operation ::= `amdgpu.raw_buffer_store` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) (`,` type($indices)^)?

The amdgpu.raw_buffer_store op is a wrapper around the buffer store intrinsics available on AMD GPUs, including extensions in newer GPUs.

The store index is computed as in memref.store with the addition of indexOffset (which is included for uniformity with atomics and may be useful when writing vectorized code) and sgprOffset (which is added after bounds checks and implicitly includes the offset of the memref type if non-zero). All index components are in terms of the elements of the memref, not bytes, and are scaled up appropriately.

Out of bounds stores are ignored in hardware. Wthether a vector write that includes some in-bounds and soeme out-of-bounds components is partically completed is chipset-dependent.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Attributes: ¶

Attribute	MLIR Type	Description
`boundsCheck`	::mlir::BoolAttr	bool attribute
`indexOffset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`value`	any type
`memref`	memref of any type values
`indices`	variadic of 32-bit signless integer
`sgprOffset`	32-bit signless integer

`amdgpu.scaled_ext_packed` (amdgpu::ScaledExtPackedOp) ¶

Extend a vector of packed floating point values

Syntax:

operation ::= `amdgpu.scaled_ext_packed` attr-dict $source `[` $index `]` `,` $scale `:` type($source) `to` type($res)

Extend and scale two packed floats in source[index] to two floats and return them.

If the passed-in vector has fewer than two elements, or the input is scalar, the remaining values in the <2 x i8> will be filled with undefined values as needed.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`index`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 7

Operands: ¶

Operand	Description
`source`	vector of f8E5M2 type or f8E4M3FN type values of length 1/2/3/4 or vector of f4E2M1FN type values of length 1/2/3/4/5/6/7/8
`scale`	32-bit float

Results: ¶

Result	Description
`res`	fixed-length vector of 32-bit float values of length 2 or fixed-length vector of 16-bit float values of length 2 or fixed-length vector of bfloat16 type values of length 2

`amdgpu.scaled_mfma` (amdgpu::ScaledMFMAOp) ¶

MLIR wrapper for CDNA scaled mfma instructions

Syntax:

operation ::= `amdgpu.scaled_mfma` `(` $scalesA `[` $scalesIdxA `]` `*` $sourceA `)` `*` `(` $scalesB `[` $scalesIdxB `]` `*` $sourceB `)` `+` $destC
              attr-dict
              `:` type($scalesA) `,` type($sourceA) `,` type($scalesB) `,` type($sourceB) `,` type($destC)

The amdgpu.scaled_mfma op is an MLIR wrapper around intrinsics for various scaled versions of mfma instructions in the CDNA architecture, which perform multiple outer products in order to allow fast matrix multiplication.

Note, this wrapper allows specifying vector<4Kxi8> arguments to MFMA intrinsics that take an integer type of width 4K. For example, one can provide a vector<4xi8> as an argument to an MFMA instruction that logically takes 4 i8s but whose intrinsics are specified to take an i32. In these cases, the bytes in the vector will be concatenated in little-endian order (that is, v[0] will go to arg[7:0], v[1] to arg[15:8] and so on).

This wrapper takes inspiration from amdgpu.mfma, but has some key differences:

amdgpu.scaled_mfma operates on fp4 (f4E2M1FN), fp6 (f6E2M3FN and f6E3M2FN) and fp8 (f8E4M3FN and f8E5M2) types using either M=N=16, K=128 or M=N=32, K=64 as their tile size.
amdgpu.scaled_mfma does not support broadcasting. So, cbsz, abid, and blgp are omitted from this wrapper.
The negateA, negateB, and negateC flags in amdgpu.mfma are only supported for double-precision operations on gfx94x and so are not included here.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`m`	::mlir::IntegerAttr	32-bit signless integer attribute
`n`	::mlir::IntegerAttr	32-bit signless integer attribute
`k`	::mlir::IntegerAttr	32-bit signless integer attribute
`scalesIdxA`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 3
`scalesIdxB`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 3

Operands: ¶

Operand	Description
`sourceA`	vector of f8E5M2 type or f8E4M3FN type values of length 32 or vector of f6E2M3FN type or f6E3M2FN type or f4E2M1FN type values of length 32
`sourceB`	vector of f8E5M2 type or f8E4M3FN type values of length 32 or vector of f6E2M3FN type or f6E3M2FN type or f4E2M1FN type values of length 32
`destC`	vector of 32-bit float values of length 4/16
`scalesA`	f8E8M0FNU type or fixed-length vector of f8E8M0FNU type values of length 4
`scalesB`	f8E8M0FNU type or fixed-length vector of f8E8M0FNU type values of length 4

Results: ¶

Result	Description
`destD`	vector of 32-bit float values of length 4/16

`amdgpu.sched_barrier` (amdgpu::SchedBarrierOp) ¶

Barrier that limits the backend scheduler of instruction movement

Syntax:

operation ::= `amdgpu.sched_barrier` `allow` `=` $opts attr-dict

amdgpu.sched_barrier serves as a barrier that could be configured to restrict movements of instructions through it as defined by sched_barrier_opts.

Attributes: ¶

Attribute	MLIR Type	Description
`opts`	::mlir::amdgpu::sched_barrier_opt_enumAttr	The possible options for scheduling barriers

`amdgpu.swizzle_bitmode` (amdgpu::SwizzleBitModeOp) ¶

AMDGPU ds_swizzle op, bitmode variant

Syntax:

operation ::= `amdgpu.swizzle_bitmode` $src $and_mask $or_mask $xor_mask attr-dict `:` type($result)

High-level wrapper on bitmode rocdl.ds_swizzle op, masks are represented as separate fields so user won’t need to do manual bitpacking.

Supports arbitrary int/float/vector types, which will be repacked to i32 and one or more rocdl.ds_swizzle ops during lowering.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`and_mask`	::mlir::IntegerAttr	32-bit signless integer attribute
`or_mask`	::mlir::IntegerAttr	32-bit signless integer attribute
`xor_mask`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands: ¶

Operand	Description
`src`	Integer or Float or fixed-length vector of Integer or Float values of ranks 1

Results: ¶

Result	Description
`result`	Integer or Float or fixed-length vector of Integer or Float values of ranks 1

`amdgpu.wmma` (amdgpu::WMMAOp) ¶

MLIR wrapper for RDNA3 wmma instructions

Syntax:

operation ::= `amdgpu.wmma` $sourceA `*` $sourceB `+` $destC
              attr-dict
              `:` type($sourceA) `,` type($sourceB) `,` type($destC)

The amdgpu.wmma op is an MLIR wrapper around intrinsics for various wmma instructions in the RDNA3 or RDNA4 architecture, which perform a 16x16 * 16x16 matrix multiplication for different data types. Note that in gfx12/RDNA4, there is also a 16x32 * 32x16 instruction for 4-bit integer inputs.

On gfx11/RDNA3, emitting f16->f16 (or bf16->bf16) wmma the output is a 16xf16 (or 16xbf16) vector containing only 8 valid values:

If subwordOffset is 0, then the output is stored at indices 0, 2, 4, …, 14.
If subwordOffset is 1, then the output is stored at indices 1, 3, 5, …, 15. On gfx12/RDNA4, the result is instead returned as a vector<8 x f16/bf16> where all values are valid and the subwordOffset must be 0, as it cannot be used.

unsignedA and unsignedB flag that the int8 LLVM inputs are unsigned.

The clamp flag is used to saturate the output of type T to numeric_limits::max() in case of overflow.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: ¶

Attribute	MLIR Type	Description
`subwordOffset`	::mlir::IntegerAttr	32-bit signless integer attribute whose minimum value is 0 whose maximum value is 1
`unsignedA`	::mlir::UnitAttr	unit attribute
`unsignedB`	::mlir::UnitAttr	unit attribute
`clamp`	::mlir::UnitAttr	unit attribute

Operands: ¶

Operand	Description
`sourceA`	vector of 16-bit float or bfloat16 type or 8-bit signless integer or 8-bit signed integer or 8-bit unsigned integer or 4-bit signless integer or 4-bit signed integer or 4-bit unsigned integer or f8E4M3FN type or f8E5M2 type values of length 4/8/16
`sourceB`	vector of 16-bit float or bfloat16 type or 8-bit signless integer or 8-bit signed integer or 8-bit unsigned integer or 4-bit signless integer or 4-bit signed integer or 4-bit unsigned integer or f8E4M3FN type or f8E5M2 type values of length 4/8/16
`destC`	vector of 32-bit float or 32-bit signless integer values of length 4/8 or vector of 16-bit float or bfloat16 type values of length 4/8/16

Results: ¶

Result	Description
`destD`	vector of 32-bit float or 32-bit signless integer values of length 4/8 or vector of 16-bit float or bfloat16 type values of length 4/8/16

Attributes ¶

AddressSpaceAttr ¶

AMDGPU-specific address spaces

Syntax:

#amdgpu.address_space<
  ::mlir::amdgpu::AddressSpace   # value
>

AMDGPU-specific memory spaces that may not have exact analogues on other GPU targets or backends.

fat_raw_buffer is the memory space used when a memref is stored as as a “buffer fat pointer” - that is, a buffer resource (that is set up to use raw byte-level indexing) along with its offset. The AMDGPU backend implements ptr addrspace(7) to represent these fat pointers so that buffer resources (which allow advanced features like bounds checking or cache swizzling) can be used like ordinary LLVM pointers or memrefs. See also the fat_raw_buffer_cast operation
buffer_rsrc is the memory space for ptr addrspace(8), representing a buffer resource. It should not be used for memrefs, since it does not support indexing
fat_structured_buffer represents ptr addrspace(9), a buffer resource that carries both an index and offset field, which are used for complex structured indexing that is primarily seen in graphics applications. This is also incompatible with the simple indexing model supported by memref.

Parameters: ¶

Parameter	C++ type	Description
value	`::mlir::amdgpu::AddressSpace`	an enum of type AddressSpace

DPPPermAttr ¶

The possible permutations for a DPP operation

Syntax:

#amdgpu.dpp_perm<
  ::mlir::amdgpu::DPPPerm   # value
>

Parameters: ¶

Parameter	C++ type	Description
value	`::mlir::amdgpu::DPPPerm`	an enum of type DPPPerm

MFMAPermBAttr ¶

The possible permutations of the lanes storing B available in an MFMA

Syntax:

#amdgpu.mfma_perm_b<
  ::mlir::amdgpu::MFMAPermB   # value
>

Parameters: ¶

Parameter	C++ type	Description
value	`::mlir::amdgpu::MFMAPermB`	an enum of type MFMAPermB

sched_barrier_opt_enumAttr ¶

The possible options for scheduling barriers

Syntax:

#amdgpu.sched_barrier_opt<
  ::mlir::amdgpu::sched_barrier_opt_enum   # value
>

Parameters: ¶

Parameter	C++ type	Description
value	`::mlir::amdgpu::sched_barrier_opt_enum`	an enum of type sched_barrier_opt_enum

Enums ¶

AddressSpace ¶

AMDGPU-specific address spaces

Cases: ¶

Symbol	Value	String
FatRawBuffer	`0`	fat_raw_buffer
BufferRsrc	`1`	buffer_rsrc
FatStructuredBuffer	`2`	fat_structured_buffer

DPPPerm ¶

The possible permutations for a DPP operation

Cases: ¶

Symbol	Value	String
quad_perm	`0`	quad_perm
row_shl	`1`	row_shl
row_shr	`2`	row_shr
row_ror	`3`	row_ror
wave_shl	`4`	wave_shl
wave_shr	`5`	wave_shr
wave_ror	`6`	wave_ror
wave_rol	`7`	wave_rol
row_mirror	`8`	row_mirror
row_half_mirror	`9`	row_half_mirror
row_bcast_15	`10`	row_bcast_15
row_bcast_31	`11`	row_bcast_31

MFMAPermB ¶

The possible permutations of the lanes storing B available in an MFMA

Cases: ¶

Symbol	Value	String
none	`0`	none
bcast_first_32	`1`	bcast_first_32
bcast_second_32	`2`	bcast_second_32
rotate_16_right	`3`	rotate_16_right
bcast_first_16	`4`	bcast_first_16
bcast_second_16	`5`	bcast_second_16
bcast_third_16	`6`	bcast_third_16
bcast_fourth_16	`7`	bcast_fourth_16

sched_barrier_opt_enum ¶

The possible options for scheduling barriers

Cases: ¶

Symbol	Value	String
none	`0`	none
non_mem_non_sideffect	`1`	non_mem_non_sideffect
valu	`2`	valu
salu	`4`	salu
mfma_wmma	`8`	mfma_wmma
all_vmem	`16`	all_vmem
vmem_read	`32`	vmem_read
vmem_write	`64`	vmem_write
all_ds	`128`	all_ds
ds_read	`256`	ds_read
ds_write	`512`	ds_write
transcendental	`1024`	transcendental

'amdgpu' Dialect

Operations ¶

amdgpu.dpp (amdgpu::DPPOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.ext_packed_fp8 (amdgpu::ExtPackedFp8Op) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.fat_raw_buffer_cast (amdgpu::FatRawBufferCastOp) ¶

Operands: ¶

Results: ¶

amdgpu.gather_to_lds (amdgpu::GatherToLDSOp) ¶

Attributes: ¶

Operands: ¶

amdgpu.lds_barrier (amdgpu::LDSBarrierOp) ¶

amdgpu.mfma (amdgpu::MFMAOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.packed_scaled_trunc (amdgpu::PackedScaledTruncOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.packed_stoch_round_fp8 (amdgpu::PackedStochRoundFp8Op) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.packed_trunc_2xfp8 (amdgpu::PackedTrunc2xFp8Op) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.raw_buffer_atomic_cmpswap (amdgpu::RawBufferAtomicCmpswapOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.raw_buffer_atomic_fadd (amdgpu::RawBufferAtomicFaddOp) ¶

Attributes: ¶

Operands: ¶

amdgpu.raw_buffer_atomic_fmax (amdgpu::RawBufferAtomicFmaxOp) ¶

Attributes: ¶

Operands: ¶

amdgpu.raw_buffer_atomic_smax (amdgpu::RawBufferAtomicSmaxOp) ¶

Attributes: ¶

Operands: ¶

amdgpu.raw_buffer_atomic_umin (amdgpu::RawBufferAtomicUminOp) ¶

Attributes: ¶

Operands: ¶

amdgpu.raw_buffer_load (amdgpu::RawBufferLoadOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.raw_buffer_store (amdgpu::RawBufferStoreOp) ¶

Attributes: ¶

Operands: ¶

amdgpu.scaled_ext_packed (amdgpu::ScaledExtPackedOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.scaled_mfma (amdgpu::ScaledMFMAOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.sched_barrier (amdgpu::SchedBarrierOp) ¶

Attributes: ¶

amdgpu.swizzle_bitmode (amdgpu::SwizzleBitModeOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

amdgpu.wmma (amdgpu::WMMAOp) ¶

Attributes: ¶

Operands: ¶

Results: ¶

Attributes ¶

AddressSpaceAttr ¶

Parameters: ¶

DPPPermAttr ¶

Parameters: ¶

MFMAPermBAttr ¶

`amdgpu.dpp` (amdgpu::DPPOp) ¶

`amdgpu.ext_packed_fp8` (amdgpu::ExtPackedFp8Op) ¶

`amdgpu.fat_raw_buffer_cast` (amdgpu::FatRawBufferCastOp) ¶

`amdgpu.gather_to_lds` (amdgpu::GatherToLDSOp) ¶

`amdgpu.lds_barrier` (amdgpu::LDSBarrierOp) ¶

`amdgpu.mfma` (amdgpu::MFMAOp) ¶

`amdgpu.packed_scaled_trunc` (amdgpu::PackedScaledTruncOp) ¶

`amdgpu.packed_stoch_round_fp8` (amdgpu::PackedStochRoundFp8Op) ¶

`amdgpu.packed_trunc_2xfp8` (amdgpu::PackedTrunc2xFp8Op) ¶

`amdgpu.raw_buffer_atomic_cmpswap` (amdgpu::RawBufferAtomicCmpswapOp) ¶

`amdgpu.raw_buffer_atomic_fadd` (amdgpu::RawBufferAtomicFaddOp) ¶

`amdgpu.raw_buffer_atomic_fmax` (amdgpu::RawBufferAtomicFmaxOp) ¶

`amdgpu.raw_buffer_atomic_smax` (amdgpu::RawBufferAtomicSmaxOp) ¶

`amdgpu.raw_buffer_atomic_umin` (amdgpu::RawBufferAtomicUminOp) ¶

`amdgpu.raw_buffer_load` (amdgpu::RawBufferLoadOp) ¶

`amdgpu.raw_buffer_store` (amdgpu::RawBufferStoreOp) ¶

`amdgpu.scaled_ext_packed` (amdgpu::ScaledExtPackedOp) ¶

`amdgpu.scaled_mfma` (amdgpu::ScaledMFMAOp) ¶

`amdgpu.sched_barrier` (amdgpu::SchedBarrierOp) ¶

`amdgpu.swizzle_bitmode` (amdgpu::SwizzleBitModeOp) ¶

`amdgpu.wmma` (amdgpu::WMMAOp) ¶