MLIR

Multi-Level IR Compiler Framework

'xegpu' Dialect

The XeGPU dialect that models Intel GPU’s ISA The XeGPU dialect models Intel Xe ISA semantics but works at vector and TensorDesc data type. It provides 1:1 mappings to match Xe instructions like DPAS and 2D block load. The matrix size being processed at this level exactly matches the hardware instructions or the intrinsic supported by the lower-level GPU compiler.

Operations 

source

xegpu.create_nd_tdesc (xegpu::CreateNdDescOp) 

Create nd-tensor descriptor operation

Syntax:

operation ::= `xegpu.create_nd_tdesc` $source ``
              custom<DynamicIndexList>($offsets, $const_offsets)
              (`,` custom<DynamicIndexList>($shape, $const_shape)^
              `,` custom<DynamicIndexList>($strides, $const_strides))?
              attr-dict `:` type($source) `->` qualified(type($TensorDesc))

The “create_nd_tdesc” operation creates a TensorDescType which represents a sub-view of a 2D memory region (It can be extended to support n-D memory region if needed in future). Elements in the subview continuous in each dimention. It encodes the following important information for supporting Intel hardware features:

  • source: an object representing (starting address/pointer of) a 2D memory region. It can be either a 2D memref object, or simply a pointer represented by uint64_t type. for the later case, the shape and layout information of the 2D memory region should be explicitly passed via dynamic_shape and dynamic_strides parameters.
  • offsets: two index values represents offsets from the “source” at the each dimension at which the subview of the target memory will be created. It is encoded via two variables, including “dynamic_offsets” and “static_offsets”, such that it can accept various forms, such as, operands (e.g., [%c0, %c]) and attributes (e.g., [2, 4])).
  • shape: the shape information of the memory region pointed by the “source”. It is typically encoded via the MemRefType of the source, e.g., memref<4096x4096xf16>. But if “source” is simply a pointer represented as uint64_t type, or a memref type without shape information e.g., memref<?x?xf16>, the shape information has to be explicitly passed via the “dynamic_shape” argument. Currently “dynamic_shape” only accepts operands(e.g., [%c4096, %c4096]), not attributes(e.g., [4096, 4096]).
  • strides: the strides of the memory region pointed by the “source”. Similar to shape, it is typically encoded via the MemRefType of the source too. But if “source” is simply a pointer represented as uint64_t type, or a memref type without shape information e.g., memref<?x?xf16>, the strides information has to be explicitly passed via the “dynamic_strides” argument. And it currently only accepts operands two.

Example 1 (suppose the tensor shape inferred by the compiler is 8x16): %0 = memref.alloc() : memref<1024x1024xf32> %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index %1 = xegpu.create_nd_tdesc %0[%c0, %c0]: memref<1024x1024xf32> -> TensorDesc<8x16xf32>

Example 2 (suppose the tensor shape inferred by the compiler is 8x16): %0 = memref.alloc(%h, %w) : memref<?x?xf32> %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index %1 = xegpu.create_nd_tdesc %0[%c0, %c0], [%h, %w], [%w, %c1]: memref<?x?xf32> -> TensorDesc<8x16xf32>

Example 3 (suppose the tensor shape inferred by the compiler is 8x16): %0 = … : ui64 %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index %1 = xegpu.create_nd_tdesc %0[%c0, %c0], [%h, %w], [%w, %c1]: ui64 -> TensorDesc<8x16xf32>

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OffsetSizeAndStrideOpInterface, ViewLikeOpInterface

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
const_offsets::mlir::DenseI64ArrayAttri64 dense array attribute
const_shape::mlir::DenseI64ArrayAttri64 dense array attribute
const_strides::mlir::DenseI64ArrayAttri64 dense array attribute

Operands: 

OperandDescription
source1D/2D memref of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values or 64-bit unsigned integer or 32-bit unsigned integer or 64-bit signless integer or 32-bit signless integer
offsetsvariadic of index
shapevariadic of index
stridesvariadic of index

Results: 

ResultDescription
TensorDescTensorDesc describing regions of interested data.

xegpu.load_nd (xegpu::LoadNdOp) 

Loads a n-D block from memory (represented by TensorDesc)to registers (represented by vector)

Syntax:

operation ::= `xegpu.load_nd` $TensorDesc prop-dict attr-dict `:` qualified(type($TensorDesc)) `->` type($value)

LoadNdOp essentially mimics the hardware block read instruction to read a block of data from memory to register. It takes a set of optional cache hints for each level of cache, L1, L2 and L3. If hardware does not have a correspoding cache, Corresponding cache hint attribute will be masked. vnni transform is an hardware feature for Intel GPU, which is used to do data packing during the load for B operand of matrix operation, if the bit width of the data type is less then 32 bits, e.g., fp16. And transpose is another Intel hardware feature, which will do transpose operation when loading the data if the bit width of the data type is fp32 or fp64. It implies that vnni and transpose cannot exit at the same time.

Example:

  xegpu.load_nd %1 {transpose = [1, 0],
                    l1_hint = #xegpu.cache_hint<cached>, 
                    l2_hint = #xegpu.cache_hint<uncached>, 
                    l3_hint = #xegpu.cache_hint<streaming>}
          : !xegpu.tensor_desc<8x16xf32> -> vector<16x8xf32>

Attributes: 

AttributeMLIR TypeDescription
vnni_axis::mlir::IntegerAttr64-bit signless integer attribute
transpose::mlir::DenseI64ArrayAttri64 dense array attribute
l1_hint::mlir::xegpu::CachePolicyAttr
Cache policy

Enum cases:

  • cached (CACHED)
  • uncached (UNCACHED)
  • streaming (STREAMING)
  • read_invalidate (READ_INVALIDATE)
  • write_back (WRITE_BACK)
  • write_through (WRITE_THROUGH)
l2_hint::mlir::xegpu::CachePolicyAttr
Cache policy

Enum cases:

  • cached (CACHED)
  • uncached (UNCACHED)
  • streaming (STREAMING)
  • read_invalidate (READ_INVALIDATE)
  • write_back (WRITE_BACK)
  • write_through (WRITE_THROUGH)
l3_hint::mlir::xegpu::CachePolicyAttr
Cache policy

Enum cases:

  • cached (CACHED)
  • uncached (UNCACHED)
  • streaming (STREAMING)
  • read_invalidate (READ_INVALIDATE)
  • write_back (WRITE_BACK)
  • write_through (WRITE_THROUGH)

Operands: 

OperandDescription
TensorDescTensorDesc describing regions of interested data.

Results: 

ResultDescription
valuevector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values of ranks 1/2/3/4 or 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type

xegpu.prefetch_nd (xegpu::PrefetchNdOp) 

Prefetches a nD block to cache

Syntax:

operation ::= `xegpu.prefetch_nd` $TensorDesc prop-dict attr-dict `:` qualified(type($TensorDesc))

It issues an instruction to prefetch the data from memory to each level of the cache based on their cache policy.

Example:

  xegpu.prefetch_nd %tdesc {l1_hint = #xegpu.cache_hint<cached>, 
                            l2_hint = #xegpu.cache_hint<cached>, 
                            l3_hint = #xegpu.cache_hint<cached>}
    : !xegpu.tensor_desc<8x16xf16>

Attributes: 

AttributeMLIR TypeDescription
l1_hint::mlir::xegpu::CachePolicyAttr
Cache policy

Enum cases:

  • cached (CACHED)
  • uncached (UNCACHED)
  • streaming (STREAMING)
  • read_invalidate (READ_INVALIDATE)
  • write_back (WRITE_BACK)
  • write_through (WRITE_THROUGH)
l2_hint::mlir::xegpu::CachePolicyAttr
Cache policy

Enum cases:

  • cached (CACHED)
  • uncached (UNCACHED)
  • streaming (STREAMING)
  • read_invalidate (READ_INVALIDATE)
  • write_back (WRITE_BACK)
  • write_through (WRITE_THROUGH)
l3_hint::mlir::xegpu::CachePolicyAttr
Cache policy

Enum cases:

  • cached (CACHED)
  • uncached (UNCACHED)
  • streaming (STREAMING)
  • read_invalidate (READ_INVALIDATE)
  • write_back (WRITE_BACK)
  • write_through (WRITE_THROUGH)

Operands: 

OperandDescription
TensorDescTensorDesc describing regions of interested data.

xegpu.store_nd (xegpu::StoreNdOp) 

Stores a n-D block register region back to memory, currently only supports 2D

Syntax:

operation ::= `xegpu.store_nd` $value `,` $TensorDesc prop-dict attr-dict
              `:` type($value) `,` qualified(type($TensorDesc))

StoreNdOp essentially mimics the hardware block write instruction io write a block of data from register into the memory region as described by the TensorDesc. It takes a set of optional cache hints for each level of cache, L1, L2 and L3. If hardware does not have a correspoding cache, Corresponding cache hint attribute will be masked.

Example:

  xegpu.store_nd %3, %2 {l1_hint = #xegpu.cache_hint<uncached>,
                         l2_hint = #xegpu.cache_hint<write_back>, 
                         l3_hint = #xegpu.cache_hint<write_through>}
                         : vector<8x16xf16>, !xegpu.tensor_desc<8x16xf16>

Attributes: 

AttributeMLIR TypeDescription
l1_hint::mlir::xegpu::CachePolicyAttr
Cache policy

Enum cases:

  • cached (CACHED)
  • uncached (UNCACHED)
  • streaming (STREAMING)
  • read_invalidate (READ_INVALIDATE)
  • write_back (WRITE_BACK)
  • write_through (WRITE_THROUGH)
l2_hint::mlir::xegpu::CachePolicyAttr
Cache policy

Enum cases:

  • cached (CACHED)
  • uncached (UNCACHED)
  • streaming (STREAMING)
  • read_invalidate (READ_INVALIDATE)
  • write_back (WRITE_BACK)
  • write_through (WRITE_THROUGH)
l3_hint::mlir::xegpu::CachePolicyAttr
Cache policy

Enum cases:

  • cached (CACHED)
  • uncached (UNCACHED)
  • streaming (STREAMING)
  • read_invalidate (READ_INVALIDATE)
  • write_back (WRITE_BACK)
  • write_through (WRITE_THROUGH)

Operands: 

OperandDescription
valuevector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values of ranks 1/2/3/4 or 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type
TensorDescTensorDesc describing regions of interested data.

Attributes 

CachePolicyAttr 

Cache policy

Syntax:

#xegpu.cache_hint<
  ::mlir::xegpu::CachePolicy   # value
>

Enum cases:

  • cached (CACHED)
  • uncached (UNCACHED)
  • streaming (STREAMING)
  • read_invalidate (READ_INVALIDATE)
  • write_back (WRITE_BACK)
  • write_through (WRITE_THROUGH)

Parameters: 

ParameterC++ typeDescription
value::mlir::xegpu::CachePolicyan enum of type CachePolicy

MemoryScopeAttr 

The address space of the memory the tensor descritor is created for

Syntax:

#xegpu.memory_scope<
  ::mlir::xegpu::MemoryScope   # value
>

Enum cases:

  • global (Global)
  • slm (SLM)

Parameters: 

ParameterC++ typeDescription
value::mlir::xegpu::MemoryScopean enum of type MemoryScope

TensorDescAttr 

Syntax:

#xegpu.tdesc_attr<
  MemoryScopeAttr,   # memory_scope
  IntegerAttr,   # array_length
  BoolAttr   # boundary_check
>

Parameters: 

ParameterC++ typeDescription
memory_scopeMemoryScopeAttr
array_lengthIntegerAttr1
boundary_checkBoolAttrtrue

Types 

TensorDescType 

TensorDesc describing regions of interested data.

TensorDesc is a type designed to describe regions of the interested data as well as some features that are unique to Intel hardware. Different with the builtin tensor type in MLIR, it essentially only contains the meta data, and doesn’t hold the data by itself. It is designed to mainly support 2D block load/store and DPAS (matrix multiplication instruction) on Intel GPU. It encodes the following information:

  • shape: the sizes/shape of the intereted data block, e.g., 8x16 means 8 rows and each row contains 16 contiguous data element. The rows could be either contiguous or not, depends on whether the encoding attribute is set or not.
  • element_type: the data type of the data element, e.g., f16, f32.

Similar to the builtin tensor, it also provides an optinal attribute to encoding the following information via the TensorDescAttr object:

  • memory_scope (xegpu::MemoryScope): [optional] where the data is located, global memory or shared memory. It is default to Global.
  • array_length (int): [optional] The number of contiguous blocks with size as shape, that will be loaded by block load at a time. It is default to 1.
  • boundary_check (bool): [optional] indicates whether the operation detects the boundary and pads with zero for out-of-boundary access. It is default to do boundary check.

Syntax:

TensorDesc-type ::= `tensor_desc` `<` dim-list element-type (attr-list)? `>`
element-type ::= float-type | integer-type | index-type
dim-list := (static-dim-list `x`)?
static-dim-list ::= decimal-literal `x` decimal-literal
attr-list = (, memory_scope = value)? (, arr_len = value)? (, boundary_check = value)?

Examples:

// A block TensorDesc with 8x16 i32 elements
xegpu.tensor_desc<8x16xi32>

// A block TensorDesc with 8x16 f32 elements
xegpu.tensor_desc<8x16xf32>

// A TensorDesc with 8x16 f32 elements for a memory region in shared memory space.
xegpu.tensor_desc<8x16xf32, #xegpu.tdesc_attr<memory_scope = slm>>

Parameters: 

ParameterC++ typeDescription
shape::llvm::ArrayRef<int64_t>
elementTypemlir::Type
encodingmlir::Attribute