'amdgpu' Dialect
The AMDGPU
dialect provides wrappers around AMD-specific functionality
and LLVM intrinsics. These wrappers should be used in conjunction with
more generic dialects, such as gpu
and vector
, when generating LLVM IR
that will eventually be executed on AMD hardware.
Operations ¶
amdgpu.dpp
(amdgpu::DPPOp) ¶
AMDGPU DPP operation
Syntax:
operation ::= `amdgpu.dpp` $old $src $kind (`(` $permArgument^ `)`)? attr-dict `:` type($result)
This operation represents DPP functionality in a GPU program. DPP provides the following operations:
- Full crossbar in a group of four (
quad_perm
) - Wavefront shift left by one lane (
wave_shl
) - Wavefront shift right by one lane (
wave_shr
) - Wavefront rotate right by one lane (
wave_ror
) - Wavefront rotate left by one lane (
wave_rol
) - Row shift left by 1–15 lanes (
row_shl
) - Row shift right by 1–15 lanes (
row_shr
) - Row rotate right by 1–15 lanes (
row_ror
) - Reverse within a row (
row_mirror
) - Reverse within a half-row (
row_half_mirror
) - Broadcast the 15th lane of each row to the next row (
row_bcast
) - Broadcast lane 31 to rows 2 and 3 (
row_bcast
)
Traits: SameTypeOperands
Interfaces: InferTypeOpInterface
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
kind | ::mlir::amdgpu::DPPPermAttr | The possible permutations for a DPP operationEnum cases:
|
permArgument | ::mlir::Attribute | 32-bit signless integer attribute or array attribute or unit attribute |
row_mask | ::mlir::IntegerAttr | 32-bit signless integer attribute |
bank_mask | ::mlir::IntegerAttr | 32-bit signless integer attribute |
bound_ctrl | ::mlir::BoolAttr | bool attribute |
Operands: ¶
Operand | Description |
---|---|
old | any type |
src | any type |
Results: ¶
Result | Description |
---|---|
result | any type |
amdgpu.ext_packed_fp8
(amdgpu::ExtPackedFp8Op) ¶
Extend a fp8 value to a float or a vector of packed fp8 values to two floats
Syntax:
operation ::= `amdgpu.ext_packed_fp8` attr-dict $source `[` $index `]` `:` type($source) `to` type($res)
Extend one or two 8-bit floats in source[index]
to a 32-bit float or
two floats and return them.
This rather unusual signature arises from the fact that AMD GPUs cannot easily work with sub 32-bit quantities, so the compiler intrinsics for extending 8-bit floats (which are, currently, the only way to work with this operation) take packed vectors of 4 such floats.
If the passed-in vector has fewer than four elements, or the input is scalar, the remaining values in the <4 x i8> will be filled with undefined values as needed.
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
index | ::mlir::IntegerAttr | 32-bit signless integer attribute whose value is non-negative whose maximum value is 3 |
Operands: ¶
Operand | Description |
---|---|
source | f8E5M2FNUZ type or f8E4M3FNUZ type or f8E5M2 type or f8E4M3FN type or vector of f8E5M2FNUZ type or f8E4M3FNUZ type or f8E5M2 type or f8E4M3FN type values of length 1/2/3/4 |
Results: ¶
Result | Description |
---|---|
res | 32-bit float or fixed-length vector of 32-bit float values of length 2 |
amdgpu.fat_raw_buffer_cast
(amdgpu::FatRawBufferCastOp) ¶
Create a raw buffer fat pointer that matches memref
Syntax:
operation ::= `amdgpu.fat_raw_buffer_cast` $source oilist (`validBytes` `(` $validBytes `)`
| `cacheSwizzleStride` `(` $cacheSwizzleStride `)`
| `boundsCheck` `(` $boundsCheck `)`
| `resetOffset` $resetOffset )
attr-dict `:` type($source) `to` type($result)
Wraps the memory pointed to by source
as a raw buffer fat pointer, or,
in LLVM terms, a ptr addrspace(7)
, returning a memref that has the same
sizes and layout but the #amdgpu.address_space<fat_raw_buffer>
address space.
This memref can be used with standard memref operations like memref.load
,
memref.store
, and memref.atomicrmw
, which will be lowered to the relevant
buffer intrinsics. (vector.masked_load/store
will work once there’s backend
support for lowering them, and then this document will be updated)
If validBytes
is given, it is the number of bytes that will be valid as
an offset to out
. If it is not provided, this will be inferred from
the size of the memref during lowering. This size is
max_{d = 0 upto rank(source)} (sizes[d] * strides[d]) * sizeof(element type).
The flags of the buffer descriptor will be set up to enable raw usage -
for example, stride = 0, add_tid = 0, and so on. The boundsCheck
property determines if bounds checking is enabled or not (on architectures
where this can be controlled - that is, on RDNA chips).
If cacheSwizzleStride
is provided, L1 cache swizzling will be enabled
on architectures that support it. This swizzling, unlike the main swizzling
mode (whose usage makes a buffer non-raw) does not affect index calculation,
but does affect cache behavior. Mixing access between cache-swizzled raw
buffers and other forms of memory access, like ordinary pointer loads or
unswizzled buffer pointers can cause incorrect behavior and must be avoided.
This operation preserves the sizes, strides, and offset of the input
memref - they’ll be added in by memref.load
later. However, if
resetOffset
is set, that offset will be added to the base pointer.
If the value of the memref’s offset is not uniform (independent of the lane/thread ID),
this will lead to substantially decreased performance due to the need for
a waterfall loop on the base address of the buffer resource.
Traits: AlwaysSpeculatableImplTrait
, AttrSizedOperandSegments
Interfaces: ConditionallySpeculatable
, InferTypeOpInterface
, NoMemoryEffect (MemoryEffectOpInterface)
, ViewLikeOpInterface
Effects: MemoryEffects::Effect{}
Operands: ¶
Operand | Description |
---|---|
source | memref of any type values |
validBytes | 32-bit signless integer |
cacheSwizzleStride | 14-bit signless integer |
Results: ¶
Result | Description |
---|---|
result | memref of any type values |
amdgpu.lds_barrier
(amdgpu::LDSBarrierOp) ¶
Barrier that includes a wait for LDS memory operations.
Syntax:
operation ::= `amdgpu.lds_barrier` attr-dict
amdgpu.lds_barrier
is both a barrier (all workitems in a workgroup must reach
the barrier before any of them may proceed past it) and a wait for all
operations that affect the Local Data Store (LDS) issued from that wrokgroup
to complete before the workgroup may continue. Since the LDS is per-workgroup
memory, this barrier may be used, for example, to ensure all workitems have
written data to LDS before any workitem attempts to read from it.
Note that lds_barrier
does not force reads to or from global memory
to complete before execution continues. Therefore, it should be used when
operations on global memory can be issued far in advance of when their results
are used (for example, by writing them to LDS).
WARNING: On architectures that do not support the BackOffBarrier feature, (those which will implement this barrier by emitting inline assembly), use of this operation will impede the usabiliity of memory watches (including breakpoints set on variables) when debugging.
amdgpu.mfma
(amdgpu::MFMAOp) ¶
MLIR wrapper for CDNA mfma instructions
Syntax:
operation ::= `amdgpu.mfma` $sourceA `*` $sourceB `+` $destC
attr-dict
`blgp` `=` $blgp
`:` type($sourceA) `,` type($sourceB) `,` type($destC)
The amdgpu.mfma
op is an MLIR wrapper around intrinsics
for various mfma
instructions in the CDNA architecture, which perform
multiple outer products in order to allow fast matrix multiplication.
The wrapper will select an appropriate mfma
instruction, if one is available,
based on the provided m
, k
, n
, and nBlks
attributes, along with the
types of the source and destination arguments.
For information on the layouts of the input and output matrces (which are stored
in sourceA
, sourceB
, destC
, and destD
), see the CDNA ISA documentation.
The cbsz
, abid
, and blgp
parameters control how the lanes of the wave
are permuted when matrix data is being loaded: blgp
can be any number of
fixed permutations, cbsz
specifies the log_2 of the number of chunks the lanes
holding sourceA are split into, and abid
selects one of those chunks.
Note, this wrapper allows specifying vector<4Kxi8>
arguments to MFMA
intrinsics that take an integer type of width 4K
. For example,
one can provide a vector<4xi8> as an argument to an MFMA instruction that
logically takes 4 i8s but whose intrinsics are specified to take an i32.
In these cases, the bytes in the vector will be concatenated in little-endian
order (that is, v[0] will go to arg[7:0], v[1] to arg[15:8] and so on).
The negateA, negateB, and negateC flags are only supported for double-precision operations on gfx94x.
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable
, InferTypeOpInterface
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
m | ::mlir::IntegerAttr | 32-bit signless integer attribute |
n | ::mlir::IntegerAttr | 32-bit signless integer attribute |
k | ::mlir::IntegerAttr | 32-bit signless integer attribute |
blocks | ::mlir::IntegerAttr | 32-bit signless integer attribute |
cbsz | ::mlir::IntegerAttr | 32-bit signless integer attribute |
abid | ::mlir::IntegerAttr | 32-bit signless integer attribute |
blgp | ::mlir::amdgpu::MFMAPermBAttr | The possible permutations of the lanes storing B available in an MFMAEnum cases:
|
reducePrecision | ::mlir::UnitAttr | unit attribute |
negateA | ::mlir::UnitAttr | unit attribute |
negateB | ::mlir::UnitAttr | unit attribute |
negateC | ::mlir::UnitAttr | unit attribute |
Operands: ¶
Operand | Description |
---|---|
sourceA | 32-bit float or 64-bit float or 32-bit signless integer or 64-bit signless integer or vector of 32-bit float values of length 2 or vector of 16-bit float values of length 4 or vector of bfloat16 type values of length 2/4 or vector of 8-bit signless integer values of length 4/8 or vector of f8E5M2FNUZ type or f8E4M3FNUZ type or f8E5M2 type or f8E4M3FN type values of length 8 |
sourceB | 32-bit float or 64-bit float or 32-bit signless integer or 64-bit signless integer or vector of 32-bit float values of length 2 or vector of 16-bit float values of length 4 or vector of bfloat16 type values of length 2/4 or vector of 8-bit signless integer values of length 4/8 or vector of f8E5M2FNUZ type or f8E4M3FNUZ type or f8E5M2 type or f8E4M3FN type values of length 8 |
destC | 64-bit float or vector of 32-bit float values of length 4/16/32 or vector of 32-bit signless integer values of length 4/16/32 or vector of 64-bit float values of length 4 |
Results: ¶
Result | Description |
---|---|
destD | 64-bit float or vector of 32-bit float values of length 4/16/32 or vector of 32-bit signless integer values of length 4/16/32 or vector of 64-bit float values of length 4 |
amdgpu.packed_stoch_round_fp8
(amdgpu::PackedStochRoundFp8Op) ¶
Round float stochiastically into a packed vector of 8-bit floats
Syntax:
operation ::= `amdgpu.packed_stoch_round_fp8` attr-dict $source `+` $stochiasticParam
`into` ($existing^):(`undef`)? `[` $storeIndex `]`
`:` type($source) `to` type($res) (`into` type($existing)^)?
Round the input source
, adding in stochiasticParam
, and place it into
the storeIndex
th element of res
.
If existing
is passed in, elements of res
other than the one at storeIndex
are copied from existing
.
The reason for this odd signature is that AMD GPUs cannot easily work with sub-registers, and so the conversion intrinsics (which are currently the only way to work with 8-bit float types) take packed vectors of 4 8-bit values.
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
storeIndex | ::mlir::IntegerAttr | 32-bit signless integer attribute whose value is non-negative whose maximum value is 3 |
Operands: ¶
Operand | Description |
---|---|
source | 32-bit float |
stochiasticParam | 32-bit signless integer |
existing | fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4 |
Results: ¶
Result | Description |
---|---|
res | fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4 |
amdgpu.packed_trunc_2xfp8
(amdgpu::PackedTrunc2xFp8Op) ¶
Round two floats into a packed vector of 8-bit floats
Syntax:
operation ::= `amdgpu.packed_trunc_2xfp8` attr-dict $sourceA `,` ($sourceB^):(`undef`)?
`into` ($existing^):(`undef`)? `[` `word` $wordIndex `]`
`:` type($sourceA) `to` type($res) (`into` type($existing)^)?
Round the inputs sourceA
and sourceB
(which is undefined if not
specified) into the low or high word (bottom two or top two) elements
of the returned vector, keeping the other two elements of existing
unchanged if present (or undefined if it was not passed in).
The reason for this odd signature is that AMD GPUs cannot easily work with sub-registers, and so the conversion intrinsics (which are currently the only way to work with 8-bit float types) take packed vectors of 4 8-bit values.
Traits: AlwaysSpeculatableImplTrait
, AttrSizedOperandSegments
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
wordIndex | ::mlir::IntegerAttr | 32-bit signless integer attribute whose value is non-negative whose maximum value is 1 |
Operands: ¶
Operand | Description |
---|---|
sourceA | 32-bit float |
sourceB | 32-bit float |
existing | fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4 |
Results: ¶
Result | Description |
---|---|
res | fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4 |
amdgpu.raw_buffer_atomic_cmpswap
(amdgpu::RawBufferAtomicCmpswapOp) ¶
Raw Buffer Atomic compare-and-swap
Syntax:
operation ::= `amdgpu.raw_buffer_atomic_cmpswap` attr-dict $src `,` $cmp `->` $memref `[` $indices `]`
(`sgprOffset` $sgprOffset^)? `:`
type($value) `->` type($memref) `,` type($indices)
The amdgpu.raw_buffer_atomic_cmpswap
op is a wrapper around the
buffer-based atomic compare-and-swap min available on AMD GPUs.
The index into the buffer is computed as for memref.store
with the addition
of indexOffset
(which is used to aid in emitting vectorized code) and,
if present sgprOffset
(which is added after bounds checks and includes
any non-zero offset on the memref type).
All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.
Out of bounds atomic operations are ignored in hardware.
See amdgpu.raw_buffer_load
for a description of how the underlying
instruction is constructed.
Traits: AttrSizedOperandSegments
Interfaces: InferTypeOpInterface
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
boundsCheck | ::mlir::BoolAttr | bool attribute |
indexOffset | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands: ¶
Operand | Description |
---|---|
src | any type |
cmp | any type |
memref | memref of any type values |
indices | variadic of 32-bit signless integer |
sgprOffset | 32-bit signless integer |
Results: ¶
Result | Description |
---|---|
value | any type |
amdgpu.raw_buffer_atomic_fadd
(amdgpu::RawBufferAtomicFaddOp) ¶
Raw Buffer Floating-point Atomic Add (MI-* only)
Syntax:
operation ::= `amdgpu.raw_buffer_atomic_fadd` attr-dict $value `->` $memref `[` $indices `]`
(`sgprOffset` $sgprOffset^)? `:`
type($value) `->` type($memref) `,` type($indices)
The amdgpu.raw_buffer_atomic_fadd
op is a wrapper around the
buffer-based atomic floating point addition available on the MI-* series
of AMD GPUs.
The index into the buffer is computed as for memref.store
with the addition
of indexOffset
(which is used to aid in emitting vectorized code) and,
if present sgprOffset
(which is added after bounds checks and includes
any non-zero offset on the memref type).
All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.
Out of bounds atomic operations are ignored in hardware.
See amdgpu.raw_buffer_load
for a description of how the underlying
instruction is constructed.
Traits: AttrSizedOperandSegments
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
boundsCheck | ::mlir::BoolAttr | bool attribute |
indexOffset | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands: ¶
Operand | Description |
---|---|
value | 32-bit float or vector of 16-bit float or bfloat16 type values of length 2 |
memref | memref of any type values |
indices | variadic of 32-bit signless integer |
sgprOffset | 32-bit signless integer |
amdgpu.raw_buffer_atomic_fmax
(amdgpu::RawBufferAtomicFmaxOp) ¶
Raw Buffer Floating-point Atomic Max (non-GFX9)
Syntax:
operation ::= `amdgpu.raw_buffer_atomic_fmax` attr-dict $value `->` $memref `[` $indices `]`
(`sgprOffset` $sgprOffset^)? `:`
type($value) `->` type($memref) `,` type($indices)
The amdgpu.raw_buffer_atomic_fmax
op is a wrapper around the
buffer-based atomic floating point max available on AMD GPUs (except GFX9).
The index into the buffer is computed as for memref.store
with the addition
of indexOffset
(which is used to aid in emitting vectorized code) and,
if present sgprOffset
(which is added after bounds checks and includes
any non-zero offset on the memref type).
All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.
Out of bounds atomic operations are ignored in hardware.
See amdgpu.raw_buffer_load
for a description of how the underlying
instruction is constructed.
Traits: AttrSizedOperandSegments
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
boundsCheck | ::mlir::BoolAttr | bool attribute |
indexOffset | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands: ¶
Operand | Description |
---|---|
value | 32-bit float or 64-bit float |
memref | memref of any type values |
indices | variadic of 32-bit signless integer |
sgprOffset | 32-bit signless integer |
amdgpu.raw_buffer_atomic_smax
(amdgpu::RawBufferAtomicSmaxOp) ¶
Raw Buffer Signed Integer Atomic Max
Syntax:
operation ::= `amdgpu.raw_buffer_atomic_smax` attr-dict $value `->` $memref `[` $indices `]`
(`sgprOffset` $sgprOffset^)? `:`
type($value) `->` type($memref) `,` type($indices)
The amdgpu.raw_buffer_atomic_smax
op is a wrapper around the
buffer-based atomic signed integer max available on AMD GPUs.
The index into the buffer is computed as for memref.store
with the addition
of indexOffset
(which is used to aid in emitting vectorized code) and,
if present sgprOffset
(which is added after bounds checks and includes
any non-zero offset on the memref type).
All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.
Out of bounds atomic operations are ignored in hardware.
See amdgpu.raw_buffer_load
for a description of how the underlying
instruction is constructed.
Traits: AttrSizedOperandSegments
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
boundsCheck | ::mlir::BoolAttr | bool attribute |
indexOffset | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands: ¶
Operand | Description |
---|---|
value | 32-bit signless integer |
memref | memref of any type values |
indices | variadic of 32-bit signless integer |
sgprOffset | 32-bit signless integer |
amdgpu.raw_buffer_atomic_umin
(amdgpu::RawBufferAtomicUminOp) ¶
Raw Buffer Unsigned Integer Atomic Min
Syntax:
operation ::= `amdgpu.raw_buffer_atomic_umin` attr-dict $value `->` $memref `[` $indices `]`
(`sgprOffset` $sgprOffset^)? `:`
type($value) `->` type($memref) `,` type($indices)
The amdgpu.raw_buffer_atomic_umin
op is a wrapper around the
buffer-based atomic signed integer min available on AMD GPUs.
The index into the buffer is computed as for memref.store
with the addition
of indexOffset
(which is used to aid in emitting vectorized code) and,
if present sgprOffset
(which is added after bounds checks and includes
any non-zero offset on the memref type).
All indexing components are given in terms of the memref’s element size, not the byte lengths required by the intrinsic.
Out of bounds atomic operations are ignored in hardware.
See amdgpu.raw_buffer_load
for a description of how the underlying
instruction is constructed.
Traits: AttrSizedOperandSegments
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
boundsCheck | ::mlir::BoolAttr | bool attribute |
indexOffset | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands: ¶
Operand | Description |
---|---|
value | 32-bit signless integer |
memref | memref of any type values |
indices | variadic of 32-bit signless integer |
sgprOffset | 32-bit signless integer |
amdgpu.raw_buffer_load
(amdgpu::RawBufferLoadOp) ¶
Raw Buffer load, exposing GCN features
Syntax:
operation ::= `amdgpu.raw_buffer_load` attr-dict $memref `[` $indices `]`
(`sgprOffset` $sgprOffset^)? `:`
type($memref) (`,` type($indices)^)? `->` type($value)
The amdgpu.raw_buffer_load
op is a wrapper around the buffer load intrinsics
available on AMD GPUs, including extensions in newer GPUs.
The index into the buffer is computed as for memref.load
with the additon
of indexOffset
and sgprOffset
(which may or may not be considered
in bounds checks and includes any offset present on the memref type if it’s
non-zero).
All indices and offsets are in units of the memref’s data type and are converted to bytes during lowering.
When a load is out of bounds, the instruction returns zero.
Partially-out of bounds have chipset-dependent behavior: whether reading
2 elements starting at index 7 of a memref<8xf32>
returns the last element
in the first vector component depends on the architecture.
The memref struct is converted into a buffer resource (a V#) and the arguments are translated to intrinsic arguments as follows:
- The base address of the buffer is the base address of the memref
- The stride is 0 to enable raw mode
- The number of records is the size of the memref, in bytes In the case of dynamically-shaped memrefs, this is computed at runtime as max_d (size(d) * stride(d)) * sizeof(elementType(memref))
- The offset enable bit is 1, the index enable bit is 0.
- The thread ID addition bit is off
- If
boundsCheck
is false and the target chipset is RDNA, OOB_SELECT is set to 2 to disable bounds checks, otherwise it is 3 - The cache coherency bits are off
Traits: AttrSizedOperandSegments
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
boundsCheck | ::mlir::BoolAttr | bool attribute |
indexOffset | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands: ¶
Operand | Description |
---|---|
memref | memref of any type values |
indices | variadic of 32-bit signless integer |
sgprOffset | 32-bit signless integer |
Results: ¶
Result | Description |
---|---|
value | any type |
amdgpu.raw_buffer_store
(amdgpu::RawBufferStoreOp) ¶
Raw Buffer Store, exposing GCN features
Syntax:
operation ::= `amdgpu.raw_buffer_store` attr-dict $value `->` $memref `[` $indices `]`
(`sgprOffset` $sgprOffset^)? `:`
type($value) `->` type($memref) (`,` type($indices)^)?
The amdgpu.raw_buffer_store
op is a wrapper around the buffer store
intrinsics available on AMD GPUs, including extensions in newer GPUs.
The store index is computed as in memref.store
with the addition of
indexOffset
(which is included for uniformity with atomics and may be useful
when writing vectorized code) and sgprOffset
(which is added after bounds
checks and implicitly includes the offset of the memref type if non-zero).
All index components are in terms of the elements of the memref, not bytes,
and are scaled up appropriately.
Out of bounds stores are ignored in hardware. Wthether a vector write that includes some in-bounds and soeme out-of-bounds components is partically completed is chipset-dependent.
See amdgpu.raw_buffer_load
for a description of how the underlying
instruction is constructed.
Traits: AttrSizedOperandSegments
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
boundsCheck | ::mlir::BoolAttr | bool attribute |
indexOffset | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands: ¶
Operand | Description |
---|---|
value | any type |
memref | memref of any type values |
indices | variadic of 32-bit signless integer |
sgprOffset | 32-bit signless integer |
amdgpu.sched_barrier
(amdgpu::SchedBarrierOp) ¶
Barrier that limits the backend scheduler of instruction movement
Syntax:
operation ::= `amdgpu.sched_barrier` `allow` `=` $opts attr-dict
amdgpu.sched_barrier
serves as a barrier that could be
configured to restrict movements of instructions through it as
defined by sched_barrier_opts.
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
opts | ::mlir::amdgpu::sched_barrier_opt_enumAttr | The possible options for scheduling barriersEnum cases:
|
amdgpu.wmma
(amdgpu::WMMAOp) ¶
MLIR wrapper for RDNA3 wmma instructions
Syntax:
operation ::= `amdgpu.wmma` $sourceA `*` $sourceB `+` $destC
attr-dict
`:` type($sourceA) `,` type($sourceB) `,` type($destC)
The amdgpu.wmma
op is an MLIR wrapper around intrinsics
for various wmma
instructions in the RDNA3 or RDNA4 architecture, which
perform a 16x16 * 16x16 matrix multiplication for different data types.
Note that in gfx12/RDNA4, there is also a 16x32 * 32x16 instruction for 4-bit
integer inputs.
On gfx11/RDNA3, emitting f16->f16 (or bf16->bf16) wmma the output is a 16xf16 (or 16xbf16) vector containing only 8 valid values:
- If
subwordOffset
is 0, then the output is stored at indices 0, 2, 4, …, 14. - If
subwordOffset
is 1, then the output is stored at indices 1, 3, 5, …, 15. On gfx12/RDNA4, the result is instead returned as a vector<8 x f16/bf16> where all values are valid and thesubwordOffset
must be0
, as it cannot be used.
unsignedA
and unsignedB
flag that the int8
LLVM inputs are unsigned.
The clamp
flag is used to saturate the output of type T to numeric_limits
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable
, InferTypeOpInterface
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes: ¶
Attribute | MLIR Type | Description |
---|---|---|
subwordOffset | ::mlir::IntegerAttr | 32-bit signless integer attribute whose minimum value is 0 whose maximum value is 1 |
unsignedA | ::mlir::UnitAttr | unit attribute |
unsignedB | ::mlir::UnitAttr | unit attribute |
clamp | ::mlir::UnitAttr | unit attribute |
Operands: ¶
Operand | Description |
---|---|
sourceA | vector of 16-bit float or bfloat16 type or 8-bit signless integer or 8-bit signed integer or 8-bit unsigned integer or 4-bit signless integer or 4-bit signed integer or 4-bit unsigned integer or f8E4M3FN type or f8E5M2 type values of length 4/8/16 |
sourceB | vector of 16-bit float or bfloat16 type or 8-bit signless integer or 8-bit signed integer or 8-bit unsigned integer or 4-bit signless integer or 4-bit signed integer or 4-bit unsigned integer or f8E4M3FN type or f8E5M2 type values of length 4/8/16 |
destC | vector of 32-bit float or 32-bit signless integer values of length 4/8 or vector of 16-bit float or bfloat16 type values of length 4/8/16 |
Results: ¶
Result | Description |
---|---|
destD | vector of 32-bit float or 32-bit signless integer values of length 4/8 or vector of 16-bit float or bfloat16 type values of length 4/8/16 |
Attributes ¶
AddressSpaceAttr ¶
AMDGPU-specific address spaces
Syntax:
#amdgpu.address_space<
::mlir::amdgpu::AddressSpace # value
>
AMDGPU-specific memory spaces that may not have exact analogues on other GPU targets or backends.
fat_raw_buffer
is the memory space used when a memref is stored as as a “buffer fat pointer” - that is, a buffer resource (that is set up to use raw byte-level indexing) along with its offset. The AMDGPU backend implementsptr addrspace(7)
to represent these fat pointers so that buffer resources (which allow advanced features like bounds checking or cache swizzling) can be used like ordinary LLVM pointers or memrefs. See also thefat_raw_buffer_cast
operationbuffer_rsrc
is the memory space forptr addrspace(8)
, representing a buffer resource. It should not be used for memrefs, since it does not support indexingfat_structured_buffer
representsptr addrspace(9)
, a buffer resource that carries both an index and offset field, which are used for complex structured indexing that is primarily seen in graphics applications. This is also incompatible with the simple indexing model supported by memref.
Parameters: ¶
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::amdgpu::AddressSpace | an enum of type AddressSpace |
DPPPermAttr ¶
The possible permutations for a DPP operation
Syntax:
#amdgpu.dpp_perm<
::mlir::amdgpu::DPPPerm # value
>
Enum cases:
- quad_perm (
quad_perm
) - row_shl (
row_shl
) - row_shr (
row_shr
) - row_ror (
row_ror
) - wave_shl (
wave_shl
) - wave_shr (
wave_shr
) - wave_ror (
wave_ror
) - wave_rol (
wave_rol
) - row_mirror (
row_mirror
) - row_half_mirror (
row_half_mirror
) - row_bcast_15 (
row_bcast_15
) - row_bcast_31 (
row_bcast_31
)
Parameters: ¶
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::amdgpu::DPPPerm | an enum of type DPPPerm |
MFMAPermBAttr ¶
The possible permutations of the lanes storing B available in an MFMA
Syntax:
#amdgpu.mfma_perm_b<
::mlir::amdgpu::MFMAPermB # value
>
Enum cases:
- none (
none
) - bcast_first_32 (
bcast_first_32
) - bcast_second_32 (
bcast_second_32
) - rotate_16_right (
rotate_16_right
) - bcast_first_16 (
bcast_first_16
) - bcast_second_16 (
bcast_second_16
) - bcast_third_16 (
bcast_third_16
) - bcast_fourth_16 (
bcast_fourth_16
)
Parameters: ¶
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::amdgpu::MFMAPermB | an enum of type MFMAPermB |
sched_barrier_opt_enumAttr ¶
The possible options for scheduling barriers
Syntax:
#amdgpu.sched_barrier_opt<
::mlir::amdgpu::sched_barrier_opt_enum # value
>
Enum cases:
- none (
none
) - non_mem_non_sideffect (
non_mem_non_sideffect
) - valu (
valu
) - salu (
salu
) - mfma_wmma (
mfma_wmma
) - all_vmem (
all_vmem
) - vmem_read (
vmem_read
) - vmem_write (
vmem_write
) - all_ds (
all_ds
) - ds_read (
ds_read
) - ds_write (
ds_write
) - transcendental (
transcendental
)
Parameters: ¶
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::amdgpu::sched_barrier_opt_enum | an enum of type sched_barrier_opt_enum |
Enums ¶
AddressSpace ¶
AMDGPU-specific address spaces
Cases: ¶
Symbol | Value | String |
---|---|---|
FatRawBuffer | 0 | fat_raw_buffer |
BufferRsrc | 1 | buffer_rsrc |
FatStructuredBuffer | 2 | fat_structured_buffer |
DPPPerm ¶
The possible permutations for a DPP operation
Cases: ¶
Symbol | Value | String |
---|---|---|
quad_perm | 0 | quad_perm |
row_shl | 1 | row_shl |
row_shr | 2 | row_shr |
row_ror | 3 | row_ror |
wave_shl | 4 | wave_shl |
wave_shr | 5 | wave_shr |
wave_ror | 6 | wave_ror |
wave_rol | 7 | wave_rol |
row_mirror | 8 | row_mirror |
row_half_mirror | 9 | row_half_mirror |
row_bcast_15 | 10 | row_bcast_15 |
row_bcast_31 | 11 | row_bcast_31 |
MFMAPermB ¶
The possible permutations of the lanes storing B available in an MFMA
Cases: ¶
Symbol | Value | String |
---|---|---|
none | 0 | none |
bcast_first_32 | 1 | bcast_first_32 |
bcast_second_32 | 2 | bcast_second_32 |
rotate_16_right | 3 | rotate_16_right |
bcast_first_16 | 4 | bcast_first_16 |
bcast_second_16 | 5 | bcast_second_16 |
bcast_third_16 | 6 | bcast_third_16 |
bcast_fourth_16 | 7 | bcast_fourth_16 |
sched_barrier_opt_enum ¶
The possible options for scheduling barriers
Cases: ¶
Symbol | Value | String |
---|---|---|
none | 0 | none |
non_mem_non_sideffect | 1 | non_mem_non_sideffect |
valu | 2 | valu |
salu | 4 | salu |
mfma_wmma | 8 | mfma_wmma |
all_vmem | 16 | all_vmem |
vmem_read | 32 | vmem_read |
vmem_write | 64 | vmem_write |
all_ds | 128 | all_ds |
ds_read | 256 | ds_read |
ds_write | 512 | ds_write |
transcendental | 1024 | transcendental |