MLIR

Multi-Level IR Compiler Framework

'amdgpu' Dialect

The AMDGPU dialect provides wrappers around AMD-specific functionality and LLVM intrinsics. These wrappers should be used in conjunction with more generic dialects, such as gpu and vector, when generating LLVM IR that will eventually be executed on AMD hardware.

Operations 

source

amdgpu.ext_packed_fp8 (amdgpu::ExtPackedFp8Op) 

Extend one of a vector of packed fp8 values to a float

Syntax:

operation ::= `amdgpu.ext_packed_fp8` attr-dict $source `[` $index `]` `:` type($source) `to` type($res)

Extend the value source[index] to a 32-bit float and return it.

This rather unusual signature arises from the fact that AMD GPUs cannot easily work with sub 32-bit quantities, so the compiler intrinsics for extending 8-bit floats (which are, currently, the only way to work with this operation) take packed vectors of 4 such floats.

If the passed-in vector has fewer than four elements, or the input is scalar, the remaining values in the <4 x i8> will be filled with with undefined values as needed.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
index::mlir::IntegerAttr32-bit signless integer attribute whose value is non-negative whose maximum value is 3

Operands: 

OperandDescription
sourcef8E5M2FNUZ type or f8E4M3FNUZ type or vector of f8E5M2FNUZ type or f8E4M3FNUZ type values of length 1/2/3/4

Results: 

ResultDescription
res32-bit float

amdgpu.lds_barrier (amdgpu::LDSBarrierOp) 

Barrier that includes a wait for LDS memory operations.

Syntax:

operation ::= `amdgpu.lds_barrier` attr-dict

amdgpu.lds_barrier is both a barrier (all workitems in a workgroup must reach the barrier before any of them may proceed past it) and a wait for all operations that affect the Local Data Store (LDS) issued from that wrokgroup to complete before the workgroup may continue. Since the LDS is per-workgroup memory, this barrier may be used, for example, to ensure all workitems have written data to LDS before any workitem attempts to read from it.

Note that lds_barrier does not force reads to or from global memory to complete before execution continues. Therefore, it should be used when operations on global memory can be issued far in advance of when their results are used (for example, by writing them to LDS).

WARNING: On architectures that do not support the BackOffBarrier feature, (those which will implement this barrier by emitting inline assembly), use of this operation will impede the usabiliity of memory watches (including breakpoints set on variables) when debugging.

amdgpu.mfma (amdgpu::MFMAOp) 

MLIR wrapper for CDNA mfma instructions

Syntax:

operation ::= `amdgpu.mfma` $sourceA `*` $sourceB `+` $destC
              attr-dict
              `blgp` `=` $blgp
              `:` type($sourceA) `,` type($sourceB) `,` type($destC)

The amdgpu.mfma op is an MLIR wrapper around intrinsics for various mfma instructions in the CDNA architecture, which perform multiple outer products in order to allow fast matrix multiplication.

The wrapper will select an appropriate mfma instruction, if one is available, based on the provided m, k, n, and nBlks attributes, along with the types of the source and destination arguments.

For information on the layouts of the input and output matrces (which are stored in sourceA, sourceB, destC, and destD), see the CDNA ISA documentation.

The cbsz, abid, and blgp parameters control how the lanes of the wave are permuted when matrix data is being loaded: blgp can be any number of fixed permutations, cbsz specifies the log_2 of the number of chunks the lanes holding sourceA are split into, and abid selects one of those chunks.

Note, this wrapper allows specifying vector<4Kxi8> arguments to MFMA intrinsics that take an integer type of width 4K. For example, one can provide a vector<4xi8> as an argument to an MFMA instruction that logically takes 4 i8s but whose intrinsics are specified to take an i32. In these cases, the bytes in the vector will be concatenated in little-endian order (that is, v[0] will go to arg[7:0], v[1] to arg[15:8] and so on).

The negateA, negateB, and negateC flags are only supported for double-precision operations on gfx940+.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
m::mlir::IntegerAttr32-bit signless integer attribute
n::mlir::IntegerAttr32-bit signless integer attribute
k::mlir::IntegerAttr32-bit signless integer attribute
blocks::mlir::IntegerAttr32-bit signless integer attribute
cbsz::mlir::IntegerAttr32-bit signless integer attribute
abid::mlir::IntegerAttr32-bit signless integer attribute
blgp::mlir::amdgpu::MFMAPermBAttr
The possible permutations of the lanes storing B available in an MFMA

Enum cases:

  • none (none)
  • bcast_first_32 (bcast_first_32)
  • bcast_second_32 (bcast_second_32)
  • rotate_16_right (rotate_16_right)
  • bcast_first_16 (bcast_first_16)
  • bcast_second_16 (bcast_second_16)
  • bcast_third_16 (bcast_third_16)
  • bcast_fourth_16 (bcast_fourth_16)
reducePrecision::mlir::UnitAttrunit attribute
negateA::mlir::UnitAttrunit attribute
negateB::mlir::UnitAttrunit attribute
negateC::mlir::UnitAttrunit attribute

Operands: 

OperandDescription
sourceA32-bit float or 64-bit float or 32-bit signless integer or 64-bit signless integer or vector of 32-bit float values of length 2 or vector of 16-bit float values of length 4 or vector of bfloat16 type values of length 2/4 or vector of 8-bit signless integer values of length 4/8 or vector of f8E5M2FNUZ type or f8E4M3FNUZ type values of length 8
sourceB32-bit float or 64-bit float or 32-bit signless integer or 64-bit signless integer or vector of 32-bit float values of length 2 or vector of 16-bit float values of length 4 or vector of bfloat16 type values of length 2/4 or vector of 8-bit signless integer values of length 4/8 or vector of f8E5M2FNUZ type or f8E4M3FNUZ type values of length 8
destC64-bit float or vector of 32-bit float values of length 4/16/32 or vector of 32-bit signless integer values of length 4/16/32 or vector of 64-bit float values of length 4

Results: 

ResultDescription
destD64-bit float or vector of 32-bit float values of length 4/16/32 or vector of 32-bit signless integer values of length 4/16/32 or vector of 64-bit float values of length 4

amdgpu.packed_stoch_round_fp8 (amdgpu::PackedStochRoundFp8Op) 

Round float stochiastically into a packed vector of 8-bit floats

Syntax:

operation ::= `amdgpu.packed_stoch_round_fp8` attr-dict $source `+` $stochiasticParam
              `into` ($existing^):(`undef`)? `[` $storeIndex `]`
              `:` type($source) `to` type($res) (`into` type($existing)^)?

Round the input source, adding in stochiasticParam, and place it into the storeIndexth element of res.

If existing is passed in, elements of res other than the one at storeIndex are copied from existing.

The reason for this odd signature is that AMD GPUs cannot easily work with sub-registers, and so the conversion intrinsics (which are currently the only way to work with 8-bit float types) take packed vectors of 4 8-bit values.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
storeIndex::mlir::IntegerAttr32-bit signless integer attribute whose value is non-negative whose maximum value is 3

Operands: 

OperandDescription
source32-bit float
stochiasticParam32-bit signless integer
existingfixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type values of length 4

Results: 

ResultDescription
resfixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type values of length 4

amdgpu.packed_trunc_2xfp8 (amdgpu::PackedTrunc2xFp8Op) 

Round two floats into a packed vector of 8-bit floats

Syntax:

operation ::= `amdgpu.packed_trunc_2xfp8` attr-dict $sourceA `,` ($sourceB^):(`undef`)?
              `into` ($existing^):(`undef`)? `[` `word` $wordIndex `]`
              `:` type($sourceA) `to` type($res) (`into` type($existing)^)?

Round the inputs sourceA and sourceB (which is undefined if not specified) into the low or high word (bottom two or top two) elements of the returned vector, keeping the other two elements of existing unchanged if present (or undefined if it was not passed in).

The reason for this odd signature is that AMD GPUs cannot easily work with sub-registers, and so the conversion intrinsics (which are currently the only way to work with 8-bit float types) take packed vectors of 4 8-bit values.

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
wordIndex::mlir::IntegerAttr32-bit signless integer attribute whose value is non-negative whose maximum value is 1

Operands: 

OperandDescription
sourceA32-bit float
sourceB32-bit float
existingfixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type values of length 4

Results: 

ResultDescription
resfixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type values of length 4

amdgpu.raw_buffer_atomic_cmpswap (amdgpu::RawBufferAtomicCmpswapOp) 

Raw Buffer Atomic compare-and-swap

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_cmpswap` attr-dict $src `,` $cmp `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_cmpswap op is a wrapper around the buffer-based atomic compare-and-swap min available on AMD GPUs.

The index into the buffer is computed as for memref.store with the addition of indexOffset (which is used to aid in emitting vectorized code) and, if present sgprOffset (which is added after bounds checks and includes any non-zero offset on the memref type).

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
boundsCheck::mlir::BoolAttrbool attribute
indexOffset::mlir::IntegerAttr32-bit signless integer attribute

Operands: 

OperandDescription
src32-bit signless integer or 64-bit signless integer or 32-bit float or 64-bit float
cmpany type
memrefmemref of any type values
indicesvariadic of 32-bit signless integer
sgprOffset32-bit signless integer

Results: 

ResultDescription
valueany type

amdgpu.raw_buffer_atomic_fadd (amdgpu::RawBufferAtomicFaddOp) 

Raw Buffer Floating-point Atomic Add (MI-* only)

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_fadd` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_fadd op is a wrapper around the buffer-based atomic floating point addition available on the MI-* series of AMD GPUs.

The index into the buffer is computed as for memref.store with the addition of indexOffset (which is used to aid in emitting vectorized code) and, if present sgprOffset (which is added after bounds checks and includes any non-zero offset on the memref type).

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
boundsCheck::mlir::BoolAttrbool attribute
indexOffset::mlir::IntegerAttr32-bit signless integer attribute

Operands: 

OperandDescription
value32-bit float
memrefmemref of any type values
indicesvariadic of 32-bit signless integer
sgprOffset32-bit signless integer

amdgpu.raw_buffer_atomic_fmax (amdgpu::RawBufferAtomicFmaxOp) 

Raw Buffer Floating-point Atomic Max (non-GFX9)

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_fmax` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_fmax op is a wrapper around the buffer-based atomic floating point max available on AMD GPUs (except GFX9).

The index into the buffer is computed as for memref.store with the addition of indexOffset (which is used to aid in emitting vectorized code) and, if present sgprOffset (which is added after bounds checks and includes any non-zero offset on the memref type).

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
boundsCheck::mlir::BoolAttrbool attribute
indexOffset::mlir::IntegerAttr32-bit signless integer attribute

Operands: 

OperandDescription
value32-bit float or 64-bit float
memrefmemref of any type values
indicesvariadic of 32-bit signless integer
sgprOffset32-bit signless integer

amdgpu.raw_buffer_atomic_smax (amdgpu::RawBufferAtomicSmaxOp) 

Raw Buffer Signed Integer Atomic Max

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_smax` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_smax op is a wrapper around the buffer-based atomic signed integer max available on AMD GPUs.

The index into the buffer is computed as for memref.store with the addition of indexOffset (which is used to aid in emitting vectorized code) and, if present sgprOffset (which is added after bounds checks and includes any non-zero offset on the memref type).

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
boundsCheck::mlir::BoolAttrbool attribute
indexOffset::mlir::IntegerAttr32-bit signless integer attribute

Operands: 

OperandDescription
value32-bit signless integer
memrefmemref of any type values
indicesvariadic of 32-bit signless integer
sgprOffset32-bit signless integer

amdgpu.raw_buffer_atomic_umin (amdgpu::RawBufferAtomicUminOp) 

Raw Buffer Unsigned Integer Atomic Min

Syntax:

operation ::= `amdgpu.raw_buffer_atomic_umin` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) `,` type($indices)

The amdgpu.raw_buffer_atomic_umin op is a wrapper around the buffer-based atomic signed integer min available on AMD GPUs.

The index into the buffer is computed as for memref.store with the addition of indexOffset (which is used to aid in emitting vectorized code) and, if present sgprOffset (which is added after bounds checks and includes any non-zero offset on the memref type).

All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.

Out of bounds atomic operations are ignored in hardware.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
boundsCheck::mlir::BoolAttrbool attribute
indexOffset::mlir::IntegerAttr32-bit signless integer attribute

Operands: 

OperandDescription
value32-bit signless integer
memrefmemref of any type values
indicesvariadic of 32-bit signless integer
sgprOffset32-bit signless integer

amdgpu.raw_buffer_load (amdgpu::RawBufferLoadOp) 

Raw Buffer load, exposing GCN features

Syntax:

operation ::= `amdgpu.raw_buffer_load` attr-dict $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($memref) (`,` type($indices)^)? `->` type($value)

The amdgpu.raw_buffer_load op is a wrapper around the buffer load intrinsics available on AMD GPUs, including extensions in newer GPUs.

The index into the buffer is computed as for memref.load with the additon of indexOffset and sgprOffset (which may or may not be considered in bounds checks and includes any offset present on the memref type if it’s non-zero).

All indices and offsets are in units of the memref’s data type and are converted to bytes during lowering.

When a load is out of bounds, the instruction returns zero. Partially-out of bounds have chipset-dependent behavior: whether reading 2 elements starting at index 7 of a memref<8xf32> returns the last element in the first vector component depends on the architecture.

The memref struct is converted into a buffer resource (a V#) and the arguments are translated to intrinsic arguments as follows:

  • The base address of the buffer is the base address of the memref
  • The stride is 0 to enable raw mode
  • The number of records is the size of the memref, in bytes In the case of dynamically-shaped memrefs, this is computed at runtime as max_d (size(d) * stride(d)) * sizeof(elementType(memref))
  • The offset enable bit is 1, the index enable bit is 0.
  • The thread ID addition bit is off
  • If boundsCheck is false and the target chipset is RDNA, OOB_SELECT is set to 2 to disable bounds checks, otherwise it is 3
  • The cache coherency bits are off

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
boundsCheck::mlir::BoolAttrbool attribute
indexOffset::mlir::IntegerAttr32-bit signless integer attribute

Operands: 

OperandDescription
memrefmemref of any type values
indicesvariadic of 32-bit signless integer
sgprOffset32-bit signless integer

Results: 

ResultDescription
valuebfloat16 type or 16-bit float or 32-bit float or 32-bit signless integer or 8-bit signless integer or f8E5M2FNUZ type or f8E4M3FNUZ type or vector of 32-bit float or 32-bit signless integer values of length 2/4 or vector of 16-bit float or bfloat16 type values of length 2/4/8 or vector of 8-bit signless integer or f8E5M2FNUZ type or f8E4M3FNUZ type values of length 2/4/8/16

amdgpu.raw_buffer_store (amdgpu::RawBufferStoreOp) 

Raw Buffer Store, exposing GCN features

Syntax:

operation ::= `amdgpu.raw_buffer_store` attr-dict $value `->` $memref `[` $indices `]`
              (`sgprOffset` $sgprOffset^)? `:`
              type($value) `->` type($memref) (`,` type($indices)^)?

The amdgpu.raw_buffer_store op is a wrapper around the buffer store intrinsics available on AMD GPUs, including extensions in newer GPUs.

The store index is computed as in memref.store with the addition of indexOffset (which is included for uniformity with atomics and may be useful when writing vectorized code) and sgprOffset (which is added after bounds checks and implicitly includes the offset of the memref type if non-zero). All index components are in terms of the elements of the memref, not bytes, and are scaled up appropriately.

Out of bounds stores are ignored in hardware. Wthether a vector write that includes some in-bounds and soeme out-of-bounds components is partically completed is chipset-dependent.

See amdgpu.raw_buffer_load for a description of how the underlying instruction is constructed.

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
boundsCheck::mlir::BoolAttrbool attribute
indexOffset::mlir::IntegerAttr32-bit signless integer attribute

Operands: 

OperandDescription
valuebfloat16 type or 16-bit float or 32-bit float or 32-bit signless integer or 8-bit signless integer or f8E5M2FNUZ type or f8E4M3FNUZ type or vector of 32-bit float or 32-bit signless integer values of length 2/4 or vector of 16-bit float or bfloat16 type values of length 2/4/8 or vector of 8-bit signless integer or f8E5M2FNUZ type or f8E4M3FNUZ type values of length 2/4/8/16
memrefmemref of any type values
indicesvariadic of 32-bit signless integer
sgprOffset32-bit signless integer

amdgpu.wmma (amdgpu::WMMAOp) 

MLIR wrapper for RDNA3 wmma instructions

Syntax:

operation ::= `amdgpu.wmma` $sourceA `*` $sourceB `+` $destC
              attr-dict
              `:` type($sourceA) `,` type($sourceB) `,` type($destC)

The amdgpu.wmma op is an MLIR wrapper around intrinsics for various wmma instructions in the RDNA3 architecture, which perform a 16x16 matrix multiplication for different data types.

When emitting f16->f16 (or bf16->bf16) wmma the output is a 16xf16 (or 16xbf16) vector containing only 8 valid values:

  • If subwordOffset is 0, then the output is stored at indices 0, 2, 4, …, 14.
  • If subwordOffset is 1, then the output is stored at indices 1, 3, 5, …, 15.

unsignedA and unsignedB flag that the int8 LLVM inputs are unsigned.

The clamp flag is used to saturate the output of type T to numeric_limits::max() in case of overflow.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
subwordOffset::mlir::IntegerAttr32-bit signless integer attribute whose minimum value is 0 whose maximum value is 1
unsignedA::mlir::UnitAttrunit attribute
unsignedB::mlir::UnitAttrunit attribute
clamp::mlir::UnitAttrunit attribute

Operands: 

OperandDescription
sourceAvector of 16-bit float or bfloat16 type or 8-bit signless integer or 8-bit signed integer or 8-bit unsigned integer values of length 16
sourceBvector of 16-bit float or bfloat16 type or 8-bit signless integer or 8-bit signed integer or 8-bit unsigned integer values of length 16
destCvector of 32-bit float or 32-bit signless integer values of length 4/8 or vector of 16-bit float or bfloat16 type values of length 8/16

Results: 

ResultDescription
destDvector of 32-bit float or 32-bit signless integer values of length 4/8 or vector of 16-bit float or bfloat16 type values of length 8/16

Attributes 

MFMAPermBAttr 

The possible permutations of the lanes storing B available in an MFMA

Syntax:

#amdgpu.mfma_perm_b<
  ::mlir::amdgpu::MFMAPermB   # value
>

Enum cases:

  • none (none)
  • bcast_first_32 (bcast_first_32)
  • bcast_second_32 (bcast_second_32)
  • rotate_16_right (rotate_16_right)
  • bcast_first_16 (bcast_first_16)
  • bcast_second_16 (bcast_second_16)
  • bcast_third_16 (bcast_third_16)
  • bcast_fourth_16 (bcast_fourth_16)

Parameters: 

ParameterC++ typeDescription
value::mlir::amdgpu::MFMAPermBan enum of type MFMAPermB