MLIR

Multi-Level IR Compiler Framework

'nvvm' Dialect

The NVVM dialect is MLIR’s LLVM-IR-based, NVIDIA-specific backend dialect. It models NVVM intrinsics and public ISA functionality and introduces NVIDIA extensions to the MLIR/LLVM type system and address spaces (e.g., global, shared, and cluster memory), enabling faithful lowering of GPU kernels to the NVPTX toolchain. While a NVVM op usually maps to a single LLVM IR intrinsic, the NVVM dialect uses type polymorphism and other attributes so that a single NVVM op can map to different LLVM intrinsics.

Scope and Capabilities 

The dialect covers core GPU features such as thread/block builtins, barriers and atomics, warp-level collectives (e.g., shuffle/vote), matrix/tensor core operations (e.g., mma.sync, wgmma), tensor memory accelerator (TMA) operations, asynchronous copies (cp.async, bulk/tensor variants) with memory barriers, cache and prefetch controls, and NVVM-specific attributes and enums (e.g., FP rounding modes, memory scopes, and MMA types/layouts).

Placement in the Lowering Pipeline 

NVVM sits below target-agnostic dialects like gpu and NVIDIA’s nvgpu. Typical pipelines convert gpu/nvgpu ops into NVVM using -convert-gpu-to-nvvm and -convert-nvgpu-to-nvvm, then translate into LLVM for final code generation via NVPTX backend.

Target Configuration and Serialization 

NVVM provides a #nvvm.target attribute to describe the GPU target (SM, features, and flags). In conjunction with gpu serialization (e.g., gpu-module-to-binary), this enables producing architecture-specific GPU binaries (such as CUBIN) from nested GPU modules.

Inline PTX 

When an intrinsic is unavailable or a performance-critical sequence must be expressed directly, NVVM provides an nvvm.inline_ptx op to embed PTX inline as a last-resort escape hatch, with explicit operands and results.

Memory Spaces 

The NVVM dialect introduces the following memory spaces, each with distinct scopes and lifetimes:

Memory SpaceAddress SpaceScope
generic0All threads
global1All threads (device)
shared3Thread block (CTA)
constant4All threads
local5Single thread
tensor6Thread block (CTA)
shared_cluster7Thread block cluster

Memory Space Details 

  • generic: Can point to any memory space; requires runtime resolution of actual address space. Use when pointer origin is unknown at compile time. Performance varies based on the underlying memory space. A pointer to this memory space is represented by LLVM_PointerGeneric in the NVVM Ops.
  • global: Accessible by all threads across all blocks; persists across kernel launches. Highest latency but largest capacity (device memory). Best for large data and inter-kernel communication. A pointer to this memory space is represented by LLVM_PointerGlobal in the NVVM Ops.
  • shared: Shared within a thread block (CTA); very fast on-chip memory for cooperation between threads in the same block. Limited capacity. Ideal for block-level collaboration, caching, and reducing global memory traffic. This memory is usually referred as shared_cta in the NVVMOps and as shared::cta in the PTX ISA. A pointer to this memory space is represented by the LLVM_PointerShared type in the NVVM Ops.
  • constant: Read-only memory cached per SM. Size typically limited to 64KB. Best for read-only data and uniform values accessed by all threads. A pointer to this memory space is represented by LLVM_PointerConst type in NVVM Ops.
  • local: Private to each thread. Use for per-thread private data and automatic variables that don’t fit in registers. A pointer to this memory is represented by LLVM_PointerLocal type in NVVM Ops.
  • tensor: Special memory space for tensor core operations. Used by tcgen05 instructions on SM 100+ for tensor input/output operations. A pointer to this memory space is represented by the LLVM_PointerTensor type in the NVVM Ops.
  • shared_cluster: Distributed shared memory across thread blocks within a cluster (SM 90+). Enables collaboration beyond single-block scope with fast access across cluster threads. This memory is usually referred as shared_cluster in the NVVMOps and as shared::cluster in the PTX ISA. A pointer to this memory space is represented by the LLVM_PointerSharedCluster type in the NVVM Ops.

MBarrier objects 

An mbarrier is a barrier created in shared memory that supports synchronizing any subset of threads within a CTA. An mbarrier object is an opaque object in shared memory with .b64 type and an alignment of 8-bytes. Unlike nvvm.barrier Op which can access only a limited number of barriers per CTA, the mbarrier objects are user-defined and are only limited by the total shared memory size available. The list of operations supported on an mbarrier object is exposed through the nvvm.mbarrier.* family of NVVM Ops.

Non-Goals 

NVVM is not a place for convenience or “wrapper” ops. It is not intended to introduce high-level ops that expand into multiple unrelated NVVM intrinsics or that lower to no intrinsic at all. Such abstractions belong in higher-level dialects (e.g., nvgpu, gpu, or project-specific dialects). The design intent is a thin, predictable, low-level surface with near-mechanical lowering to NVVM/LLVM IR.

Operations 

All operations in the NVIDIA’s instruction set have a custom form in MLIR. The mnemonic of an operation is that used in LLVM IR prefixed with “nvvm.”.

source

nvvm.barrier0 (NVVM::Barrier0Op) 

CTA Barrier Synchronization Op (Barrier ID 0)

Syntax:

operation ::= `nvvm.barrier0` attr-dict

The nvvm.barrier0 operation is a convenience operation that performs barrier synchronization and communication within a CTA (Cooperative Thread Array) using barrier ID 0. It is functionally equivalent to nvvm.barrier or nvvm.barrier id=0.

For more information, see PTX ISA

nvvm.barrier.arrive (NVVM::BarrierArriveOp) 

Syntax:

operation ::= `nvvm.barrier.arrive` (`id` `=` $barrierId^)? `number_of_threads` `=` $numberOfThreads attr-dict

Thread that executes this op announces their arrival at the barrier with given id and continue their execution.

The default barrier id is 0 that is similar to nvvm.barrier Op. When barrierId is not present, the default barrier id is used.

For more information, see PTX ISA

Interfaces: BasicPtxBuilderInterface

Operands: 

OperandDescription
barrierId32-bit signless integer
numberOfThreads32-bit signless integer

nvvm.barrier (NVVM::BarrierOp) 

CTA Barrier Synchronization Op

Syntax:

operation ::= `nvvm.barrier` (`id` `=` $barrierId^)? (`number_of_threads` `=` $numberOfThreads^)? (qualified($reductionOp)^ $reductionPredicate)? (`->` type($res)^)? attr-dict

The nvvm.barrier operation performs barrier synchronization and communication within a CTA (Cooperative Thread Array). It causes executing threads to wait for all non-exited threads participating in the barrier to arrive.

The operation takes two optional operands:

  • barrierId: Specifies a logical barrier resource with value 0 through 15. Each CTA instance has sixteen barriers numbered 0..15. Defaults to 0 if not specified.
  • numberOfThreads: Specifies the number of threads participating in the barrier. When specified, the value must be a multiple of the warp size. If not specified, all threads in the CTA participate in the barrier.
  • reductionOp: specifies the reduction operation (popc, and, or).
  • reductionPredicate: specifies the predicate to be used with the reductionOp.

The barrier operation guarantees that when the barrier completes, prior memory accesses requested by participating threads are performed relative to all threads participating in the barrier. It also ensures that no new memory access is requested by participating threads before the barrier completes.

When a barrier completes, the waiting threads are restarted without delay, and the barrier is reinitialized so that it can be immediately reused.

This operation generates an aligned barrier, indicating that all threads in the CTA will execute the same barrier instruction. Behavior is undefined if all threads in the CTA do not reach this instruction.

For more information, see PTX ISA

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
reductionOp::mlir::NVVM::BarrierReductionAttrNVVM barrier reduction operation

Operands: 

OperandDescription
barrierId32-bit signless integer
numberOfThreads32-bit signless integer
reductionPredicate32-bit signless integer

Results: 

ResultDescription
res32-bit signless integer

nvvm.read.ptx.sreg.ntid.x (NVVM::BlockDimXOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.ntid.x` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`<` `i`(width($lower)) $lower `,` $upper `>`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.ntid.y (NVVM::BlockDimYOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.ntid.y` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`<` `i`(width($lower)) $lower `,` $upper `>`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.ntid.z (NVVM::BlockDimZOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.ntid.z` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`<` `i`(width($lower)) $lower `,` $upper `>`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.ctaid.x (NVVM::BlockIdXOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.ctaid.x` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`<` `i`(width($lower)) $lower `,` $upper `>`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.ctaid.y (NVVM::BlockIdYOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.ctaid.y` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`<` `i`(width($lower)) $lower `,` $upper `>`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.ctaid.z (NVVM::BlockIdZOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.ctaid.z` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`<` `i`(width($lower)) $lower `,` $upper `>`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.cluster.ctaid.x (NVVM::BlockInClusterIdXOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.cluster.ctaid.x` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait, NVVMRequiresSM<90>

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.cluster.ctaid.y (NVVM::BlockInClusterIdYOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.cluster.ctaid.y` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait, NVVMRequiresSM<90>

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.cluster.ctaid.z (NVVM::BlockInClusterIdZOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.cluster.ctaid.z` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait, NVVMRequiresSM<90>

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.breakpoint (NVVM::Breakpoint) 

Breakpoint Op

Syntax:

operation ::= `nvvm.breakpoint` attr-dict

Breakpoint suspends execution of the program for debugging. For more information, see PTX ISA

nvvm.st.bulk (NVVM::BulkStoreOp) 

Bulk Store Op

Syntax:

operation ::= `nvvm.st.bulk` $addr `,` `size` `=` $size (`,` `init` `=` $initVal^)? attr-dict `:` type($addr)

Initializes a region of shared memory at the address given by addr. The size operand specifies the number of bytes to initialize and must be a multiple of 8. The initVal operand specifies the value to initialize the memory to. The only supported value is 0.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
initVal::mlir::IntegerAttr64-bit signless integer attribute

Operands: 

OperandDescription
addrLLVM pointer in address space 0 or LLVM pointer in address space 3
size64-bit signless integer

nvvm.read.ptx.sreg.clock64 (NVVM::Clock64Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.clock64` attr-dict `:` type($res)

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.clock (NVVM::ClockOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.clock` attr-dict `:` type($res)

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.cluster.arrive (NVVM::ClusterArriveOp) 

Cluster Barrier Arrive Op

Syntax:

operation ::= `nvvm.cluster.arrive` attr-dict

The cluster.arrive can be used by the threads within the cluster for synchronization and communication. The cluster.arrive instruction marks the warps’ arrival at the barrier without causing the executing thread to wait for other participating threads.

The aligned attribute, when provided, generates the .aligned version of the PTX instruction.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
aligned::mlir::UnitAttrunit attribute

nvvm.cluster.arrive.relaxed (NVVM::ClusterArriveRelaxedOp) 

Cluster Barrier Relaxed Arrive Op

Syntax:

operation ::= `nvvm.cluster.arrive.relaxed` attr-dict

The cluster.arrive can be used by the threads within the cluster for synchronization and communication. The cluster.arrive instruction marks the warps’ arrival at the barrier without causing the executing thread to wait for other participating threads.

The aligned attribute, when provided, generates the .aligned version of the PTX instruction. The .relaxed qualifier on cluster.arrive specifies that there are no memory ordering and visibility guarantees provided for the memory accesses performed prior to cluster.arrive.

For more information, see PTX ISA

Traits: NVVMRequiresSM<90>

Attributes: 

AttributeMLIR TypeDescription
aligned::mlir::UnitAttrunit attribute

nvvm.read.ptx.sreg.cluster.nctarank (NVVM::ClusterDim) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.cluster.nctarank` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.cluster.nctaid.x (NVVM::ClusterDimBlocksXOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.cluster.nctaid.x` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait, NVVMRequiresSM<90>

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.cluster.nctaid.y (NVVM::ClusterDimBlocksYOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.cluster.nctaid.y` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait, NVVMRequiresSM<90>

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.cluster.nctaid.z (NVVM::ClusterDimBlocksZOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.cluster.nctaid.z` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.nclusterid.x (NVVM::ClusterDimXOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.nclusterid.x` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.nclusterid.y (NVVM::ClusterDimYOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.nclusterid.y` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.nclusterid.z (NVVM::ClusterDimZOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.nclusterid.z` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.cluster.ctarank (NVVM::ClusterId) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.cluster.ctarank` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait, NVVMRequiresSM<90>

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.clusterid.x (NVVM::ClusterIdXOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.clusterid.x` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait, NVVMRequiresSM<90>

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.clusterid.y (NVVM::ClusterIdYOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.clusterid.y` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.clusterid.z (NVVM::ClusterIdZOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.clusterid.z` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.clusterlaunchcontrol.query.cancel (NVVM::ClusterLaunchControlQueryCancelOp) 

Query the response of a clusterlaunchcontrol.try.cancel operation

Syntax:

operation ::= `nvvm.clusterlaunchcontrol.query.cancel` `query` `=` $query_type `,` $try_cancel_response attr-dict `:` type($res)

clusterlaunchcontrol.query.cancel queries the response of a clusterlaunchcontrol.try.cancel operation specified by operand try_cancel_response.

Operand query_type specifies the type of query to perform and can be one of the following:

  • is_canceled : Returns true if the try cancel request succeeded, and false otherwise.
  • get_first_cta_id_{x/y/z} : Returns the x, y, or z coordinate of the first CTA in the canceled cluster. Behaviour is defined only if the try cancel request succeeded.

For more information, see PTX ISA

Traits: NVVMRequiresSM<100>

Attributes: 

AttributeMLIR TypeDescription
query_type::mlir::NVVM::ClusterLaunchControlQueryTypeAttr
NVVM ClusterLaunchControlQueryType

Enum cases:

  • is_canceled (IS_CANCELED)
  • get_first_cta_id_x (GET_FIRST_CTA_ID_X)
  • get_first_cta_id_y (GET_FIRST_CTA_ID_Y)
  • get_first_cta_id_z (GET_FIRST_CTA_ID_Z)

Operands: 

OperandDescription
try_cancel_response128-bit signless integer

Results: 

ResultDescription
res1-bit signless integer or 32-bit signless integer

nvvm.clusterlaunchcontrol.try.cancel (NVVM::ClusterLaunchControlTryCancelOp) 

Request atomically canceling the launch of a cluster that has not started running yet

Syntax:

operation ::= `nvvm.clusterlaunchcontrol.try.cancel` (`multicast` $multicast^ `,`)? $smemAddress `,` $mbarrier attr-dict

clusterlaunchcontrol.try.cancel requests atomically canceling the launch of a cluster that has not started running yet. It asynchronously writes an opaque response to shared memory indicating whether the operation succeeded or failed.

Operand smemAddress specifies the naturally aligned address of the 16-byte wide shared memory location where the request’s response is written.

Operand mbarrier specifies the mbarrier object used to track the completion of the asynchronous operation.

If multicast is specified, the response is asynchronously written to the corresponding local shared memory location (specifed by addr) of each CTA in the requesting cluster.

For more information, see PTX ISA

Traits: NVVMRequiresSM<100>

Attributes: 

AttributeMLIR TypeDescription
multicast::mlir::UnitAttrunit attribute

Operands: 

OperandDescription
smemAddressLLVM pointer in address space 3
mbarrierLLVM pointer in address space 3

nvvm.cluster.wait (NVVM::ClusterWaitOp) 

Cluster Barrier Wait Op

Syntax:

operation ::= `nvvm.cluster.wait` attr-dict

The cluster.wait causes the executing thread to wait for all non-exited threads of the cluster to perform cluster.arrive. The aligned attribute, when provided, generates the .aligned version of the PTX instruction.

For more information, see PTX ISA

Traits: NVVMRequiresSM<90>

Attributes: 

AttributeMLIR TypeDescription
aligned::mlir::UnitAttrunit attribute

nvvm.convert.bf16x2.to.f8x2 (NVVM::ConvertBF16x2ToF8x2Op) 

Convert a pair of bf16 inputs to f8x2

Syntax:

operation ::= `nvvm.convert.bf16x2.to.f8x2` $a attr-dict `:` type($a) `->` type($dst) `(` $dstTy `)`

This Op converts the given bf16 inputs in a bf16x2 vector to the specified f8 type. The result dst is represented as an i16 type or as a vector of two i8 types. If dst is returned as an i16 type, the converted values from a are packed such that the value converted from the first element of a is stored in the upper 8 bits of dst and the value converted from the second element of a is stored in the lower 8 bits of dst. If dst is returned as a vector type, each converted value is stored as an i8 element in the vector. The rnd and sat attributes specify the rounding and saturation modes respectively.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
rnd::mlir::NVVM::FPRoundingModeAttrNVVM FPRoundingMode kind
sat::mlir::NVVM::SaturationModeAttrNVVM SaturationMode kind
dstTy::mlir::TypeAttrany type attribute

Operands: 

OperandDescription
avector of bfloat16 type values of length 2

Results: 

ResultDescription
dst16-bit signless integer or vector of 8-bit signless integer values of length 2

nvvm.convert.f16x2.to.f8x2 (NVVM::ConvertF16x2ToF8x2Op) 

Convert an f16x2 input to f8x2

Syntax:

operation ::= `nvvm.convert.f16x2.to.f8x2` $a attr-dict `:` type($a) `->` type($dst) `(` $dstTy `)`

This Op converts the given f16 inputs in an f16x2 vector to the specified f8 type. The result dst is represented as an i16 type or as a vector of two i8 types. If dst is returned as an i16 type, the converted values from a are packed such that the value converted from the first element of a is stored in the upper 8 bits of dst and the value converted from the second element of a is stored in the lower 8 bits of dst. If dst is returned as a vector type, each converted value is stored as an i8 element in the vector. The relu attribute, when set, lowers to the ‘.relu’ variant of the cvt instruction.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
relu::mlir::BoolAttrbool attribute
dstTy::mlir::TypeAttrany type attribute

Operands: 

OperandDescription
avector of 16-bit float values of length 2

Results: 

ResultDescription
dst16-bit signless integer or vector of 8-bit signless integer values of length 2

nvvm.convert.f32x2.to.bf16x2 (NVVM::ConvertF32x2ToBF16x2Op) 

Convert two F32 values to packed bf16x2.

Syntax:

operation ::= `nvvm.convert.f32x2.to.bf16x2` $src_hi `,` $src_lo (`,` $random_bits^)? attr-dict `:` type($dst)

Converts two F32 values to packed bf16x2 format with the specified rounding mode. The src_hi and src_lo parameters correspond to operands a and b in the PTX ISA, respectively.

The random_bits parameter is required for stochastic rounding and provides the random bits to be used for the conversion.

The relu attribute clamps negative results to 0.

The sat attribute determines saturation behavior.

For more information, see PTX ISA

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
rnd::mlir::NVVM::FPRoundingModeAttrNVVM FPRoundingMode kind
sat::mlir::NVVM::SaturationModeAttrNVVM SaturationMode kind
relu::mlir::BoolAttrbool attribute

Operands: 

OperandDescription
src_hi32-bit float
src_lo32-bit float
random_bits32-bit signless integer

Results: 

ResultDescription
dstvector of bfloat16 type values of length 2

nvvm.convert.f32x2.to.f16x2 (NVVM::ConvertF32x2ToF16x2Op) 

Convert two F32 values to packed f16x2.

Syntax:

operation ::= `nvvm.convert.f32x2.to.f16x2` $src_hi `,` $src_lo (`,` $random_bits^)? attr-dict `:` type($dst)

Converts two F32 values to packed f16x2 format with the specified rounding mode. The src_hi and src_lo parameters correspond to operands a and b in the PTX ISA, respectively.

The random_bits parameter is required for stochastic rounding and provides the random bits to be used for the conversion.

The relu attribute clamps negative results to 0.

The sat attribute determines saturation behavior.

For more information, see PTX ISA

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
rnd::mlir::NVVM::FPRoundingModeAttrNVVM FPRoundingMode kind
sat::mlir::NVVM::SaturationModeAttrNVVM SaturationMode kind
relu::mlir::BoolAttrbool attribute

Operands: 

OperandDescription
src_hi32-bit float
src_lo32-bit float
random_bits32-bit signless integer

Results: 

ResultDescription
dstvector of 16-bit float values of length 2

nvvm.convert.f32x2.to.f4x2 (NVVM::ConvertF32x2ToF4x2Op) 

Convert a pair of float inputs to f4x2

Syntax:

operation ::= `nvvm.convert.f32x2.to.f4x2` $a `,` $b attr-dict `:` type($dst) `(` $dstTy `)`

This Op converts each of the given float inputs to the specified fp4 type. The result dst is returned as an i8 type where the converted values are packed such that the value converted from a is stored in the upper 4 bits of dst and the value converted from b is stored in the lower 4 bits of dst. The relu attribute, when set, lowers to the ‘.relu’ variant of the cvt instruction.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
relu::mlir::BoolAttrbool attribute
dstTy::mlir::TypeAttrany type attribute

Operands: 

OperandDescription
a32-bit float
b32-bit float

Results: 

ResultDescription
dst8-bit signless integer

nvvm.convert.f32x2.to.f6x2 (NVVM::ConvertF32x2ToF6x2Op) 

Convert a pair of float inputs to f6x2

Syntax:

operation ::= `nvvm.convert.f32x2.to.f6x2` $a `,` $b attr-dict `:` type($dst) `(` $dstTy `)`

This Op converts each of the given float inputs to the specified fp6 type. The result dst is represented either as an i16 type or as a vector of two i8 types. If dst is returned as an i16 type, the converted values are packed such that the value converted from a is stored in the upper 8 bits of dst with 2 MSB bits padded with zeros and the value converted from b is stored in the lower 8 bits of dst with 2 MSB bits padded with zeros. If dst is returned as a vector type, each converted value is stored as an i8 element in the vector. The relu attribute, when set, lowers to the ‘.relu’ variant of the cvt instruction.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
relu::mlir::BoolAttrbool attribute
dstTy::mlir::TypeAttrany type attribute

Operands: 

OperandDescription
a32-bit float
b32-bit float

Results: 

ResultDescription
dst16-bit signless integer or vector of 8-bit signless integer values of length 2

nvvm.convert.f32x2.to.f8x2 (NVVM::ConvertF32x2ToF8x2Op) 

Convert a pair of float inputs to f8x2

Syntax:

operation ::= `nvvm.convert.f32x2.to.f8x2` $a `,` $b attr-dict `:` type($dst) `(` $dstTy `)`

This Op converts each of the given float inputs to the specified fp8 type. The result dst is represented as an i16 type or as a vector of two i8 types. If dst is returned as an i16 type, the converted values are packed such that the value converted from a is stored in the upper 8 bits of dst and the value converted from b is stored in the lower 8 bits of dst. If dst is returned as a vector type, each converted value is stored as an i8 element in the vector. The rnd and sat attributes specify the rounding and saturation modes respectively. The relu attribute, when set, lowers to the ‘.relu’ variant of the cvt instruction.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
rnd::mlir::NVVM::FPRoundingModeAttrNVVM FPRoundingMode kind
sat::mlir::NVVM::SaturationModeAttrNVVM SaturationMode kind
relu::mlir::BoolAttrbool attribute
dstTy::mlir::TypeAttrany type attribute

Operands: 

OperandDescription
a32-bit float
b32-bit float

Results: 

ResultDescription
dst16-bit signless integer or vector of 8-bit signless integer values of length 2

nvvm.convert.f32x4.to.f4x4 (NVVM::ConvertF32x4ToF4x4Op) 

Convert vector<4xf32> to packed f4x4 with stochastic rounding (.rs) and satfinite

Syntax:

operation ::= `nvvm.convert.f32x4.to.f4x4` $src `,` $rbits attr-dict `:` type($src) `->` type($dst) `(` $dstTy `)`

Converts a vector<4xf32> to packed f4x4 format using stochastic rounding (.rs) mode with SATFINITE saturation. Randomness is provided by the rbits parameter. The dstTy attribute specifies the target floating-point format. The relu attribute clamps negative results to 0.

Note: These operations always use RS rounding mode and SATFINITE saturation mode.

For more information, see PTX ISA

Traits: AlwaysSpeculatableImplTrait, NVVMRequiresSMa<100,103>

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
relu::mlir::BoolAttrbool attribute
dstTy::mlir::TypeAttrany type attribute

Operands: 

OperandDescription
srcvector of 32-bit float values of length 4
rbits32-bit signless integer

Results: 

ResultDescription
dst16-bit signless integer

nvvm.convert.f32x4.to.f6x4 (NVVM::ConvertF32x4ToF6x4Op) 

Convert vector<4xf32> to packed f6x4 with stochastic rounding (.rs) and satfinite

Syntax:

operation ::= `nvvm.convert.f32x4.to.f6x4` $src `,` $rbits attr-dict `:` type($src) `->` type($dst) `(` $dstTy `)`

Converts a vector<4xf32> to packed f6x4 format using stochastic rounding (.rs) mode with SATFINITE saturation. Randomness is provided by the rbits parameter. The dstTy attribute specifies the target floating-point format. The relu attribute clamps negative results to 0.

Note: These operations always use RS rounding mode and SATFINITE saturation mode.

For more information, see PTX ISA

Traits: AlwaysSpeculatableImplTrait, NVVMRequiresSMa<100,103>

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
relu::mlir::BoolAttrbool attribute
dstTy::mlir::TypeAttrany type attribute

Operands: 

OperandDescription
srcvector of 32-bit float values of length 4
rbits32-bit signless integer

Results: 

ResultDescription
dstvector of 8-bit signless integer values of length 4

nvvm.convert.f32x4.to.f8x4 (NVVM::ConvertF32x4ToF8x4Op) 

Convert vector<4xf32> to packed f8x4 with stochastic rounding (.rs) and satfinite

Syntax:

operation ::= `nvvm.convert.f32x4.to.f8x4` $src `,` $rbits attr-dict `:` type($src) `->` type($dst) `(` $dstTy `)`

Converts a vector<4xf32> to packed f8x4 format using stochastic rounding (.rs) mode with SATFINITE saturation. Randomness is provided by the rbits parameter. The dstTy attribute specifies the target floating-point format. The relu attribute clamps negative results to 0.

Note: These operations always use RS rounding mode and SATFINITE saturation mode.

For more information, see PTX ISA

Traits: AlwaysSpeculatableImplTrait, NVVMRequiresSMa<100,103>

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
relu::mlir::BoolAttrbool attribute
dstTy::mlir::TypeAttrany type attribute

Operands: 

OperandDescription
srcvector of 32-bit float values of length 4
rbits32-bit signless integer

Results: 

ResultDescription
dstvector of 8-bit signless integer values of length 4

nvvm.convert.f4x2.to.f16x2 (NVVM::ConvertF4x2ToF16x2Op) 

Convert a pair of f4 inputs to f16x2

Syntax:

operation ::= `nvvm.convert.f4x2.to.f16x2` $src attr-dict `:` type($src) `(` $srcType `)` `->` type($dst)

This Op converts the given f4 inputs in a packed i8 to f16.

The result dst is represented as a vector of f16 elements. The relu attribute, when set, lowers to the ‘.relu’ variant of the cvt instruction."

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
relu::mlir::BoolAttrbool attribute
srcType::mlir::TypeAttrany type attribute

Operands: 

OperandDescription
src8-bit signless integer

Results: 

ResultDescription
dstvector of 16-bit float values of length 2

nvvm.convert.f6x2.to.f16x2 (NVVM::ConvertF6x2ToF16x2Op) 

Convert a pair of f6 inputs to f16x2

Syntax:

operation ::= `nvvm.convert.f6x2.to.f16x2` $src attr-dict `:` type($src) `(` $srcType `)` `->` type($dst)

This Op converts the given f6 inputs in a i8x2 vector to f16.

The result dst is represented as a vector of f16 elements. The relu attribute, when set, lowers to the ‘.relu’ variant of the cvt instruction."

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
relu::mlir::BoolAttrbool attribute
srcType::mlir::TypeAttrany type attribute

Operands: 

OperandDescription
srcvector of 8-bit signless integer values of length 2

Results: 

ResultDescription
dstvector of 16-bit float values of length 2

nvvm.convert.f8x2.to.bf16x2 (NVVM::ConvertF8x2ToBF16x2Op) 

Convert a pair of f8 inputs to bf16x2

Syntax:

operation ::= `nvvm.convert.f8x2.to.bf16x2` $src attr-dict `:` type($src) `(` $srcType `)` `->` type($dst)

This Op converts the given f8 inputs in a i8x2 vector to bf16.

The result dst is represented as a vector of bf16 elements.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
srcType::mlir::TypeAttrany type attribute

Operands: 

OperandDescription
srcvector of 8-bit signless integer values of length 2

Results: 

ResultDescription
dstvector of bfloat16 type values of length 2

nvvm.convert.f8x2.to.f16x2 (NVVM::ConvertF8x2ToF16x2Op) 

Convert a pair of f8 inputs to f16x2

Syntax:

operation ::= `nvvm.convert.f8x2.to.f16x2` $src attr-dict `:` type($src) `(` $srcType `)` `->` type($dst)

This Op converts the given f8 inputs in a i8x2 vector to f16.

The result dst is represented as a vector of f16 elements. The relu attribute, when set, lowers to the ‘.relu’ variant of the cvt instruction."

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
relu::mlir::BoolAttrbool attribute
srcType::mlir::TypeAttrany type attribute

Operands: 

OperandDescription
srcvector of 8-bit signless integer values of length 2

Results: 

ResultDescription
dstvector of 16-bit float values of length 2

nvvm.convert.float.to.tf32 (NVVM::ConvertFloatToTF32Op) 

Convert the given float input to TF32

Syntax:

operation ::= `nvvm.convert.float.to.tf32` $src attr-dict

This Op converts the given f32 input to tf32. The result res is represented as an i32 type. The relu attribute, when set, lowers to the ‘.relu’ variant of the cvt instruction. The rnd and sat attributes specify the the rounding and saturation modes respectively.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
rnd::mlir::NVVM::FPRoundingModeAttrNVVM FPRoundingMode kind
sat::mlir::NVVM::SaturationModeAttrNVVM SaturationMode kind
relu::mlir::BoolAttrbool attribute

Operands: 

OperandDescription
src32-bit float

Results: 

ResultDescription
res32-bit signless integer

nvvm.cp.async.bulk.commit.group (NVVM::CpAsyncBulkCommitGroupOp) 

Syntax:

operation ::= `nvvm.cp.async.bulk.commit.group` attr-dict

This Op commits all prior initiated but uncommitted cp.async.bulk instructions into a cp.async.bulk-group.

For more information, see PTX ISA

nvvm.cp.async.bulk.shared.cluster.global (NVVM::CpAsyncBulkGlobalToSharedClusterOp) 

Async bulk copy from global to Shared {cta or cluster} memory

Syntax:

operation ::= `nvvm.cp.async.bulk.shared.cluster.global` $dstMem `,` $srcMem `,` $mbar `,` $size
              (`multicast_mask` `=` $multicastMask^ )?
              (`l2_cache_hint` `=` $l2CacheHint^ )?
              attr-dict  `:` type($dstMem) `,` type($srcMem)

Initiates an asynchronous copy operation from global memory to shared memory or shared_cluster memory.

The multicastMask operand is optional and can be used only when the destination is shared::cluster memory. When it is present, this Op copies data from global memory to shared memory of multiple CTAs in the cluster. Operand multicastMask specifies the destination CTAs in the cluster such that each bit position in the 16-bit multicastMask operand corresponds to the nvvm.read.ptx.sreg.ctaid of the destination CTA.

The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.

For more information, see PTX ISA

Traits: AttrSizedOperandSegments

Operands: 

OperandDescription
dstMemLLVM pointer in address space 3 or LLVM pointer in address space 7
srcMemLLVM pointer in address space 1
mbarLLVM pointer in address space 3
size32-bit signless integer
multicastMask16-bit signless integer
l2CacheHint64-bit signless integer

nvvm.cp.async.bulk.prefetch (NVVM::CpAsyncBulkPrefetchOp) 

Async bulk prefetch from global memory to L2 cache

Syntax:

operation ::= `nvvm.cp.async.bulk.prefetch` $srcMem `,` $size (`l2_cache_hint` `=` $l2CacheHint^ )?
              attr-dict  `:` type($srcMem)

Initiates an asynchronous prefetch of data from the location specified by srcMem to the L2 cache.

The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.

Example:

  nvvm.cp.async.bulk.prefetch %src, %size : !llvm.ptr<1>

  // with l2_cache_hint
  nvvm.cp.async.bulk.prefetch %src, %size l2_cache_hint = %ch : !llvm.ptr<1>

For more information, see PTX ISA

Operands: 

OperandDescription
srcMemLLVM pointer in address space 1
size32-bit signless integer
l2CacheHint64-bit signless integer

nvvm.cp.async.bulk.global.shared.cta (NVVM::CpAsyncBulkSharedCTAToGlobalOp) 

Async bulk copy from Shared CTA memory to Global memory

Syntax:

operation ::= `nvvm.cp.async.bulk.global.shared.cta` $dstMem `,` $srcMem `,` $size
              (`l2_cache_hint` `=` $l2CacheHint^ )?
              (`byte_mask` `=` $byteMask^ )?
              attr-dict `:` type($dstMem) `,` type($srcMem)

Initiates an asynchronous copy operation from Shared CTA memory to global memory. The 32-bit operand size specifies the amount of memory to be copied, in terms of number of bytes. size must be a multiple of 16. The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access. The byteMask operand is optional. The i-th bit in the 16-bit wide byteMask specifies whether the i-th byte of each 16-byte wide chunk of source data is copied to the destination. If the bit is set, the byte is copied.

Example:

  nvvm.cp.async.bulk.global.shared.cta %dst, %src, %size
      : !llvm.ptr<1>, !llvm.ptr<3>

  // with l2_cache_hint
  nvvm.cp.async.bulk.global.shared.cta %dst, %src, %size l2_cache_hint = %ch
      : !llvm.ptr<1>, !llvm.ptr<3>

  // with byte_mask
  nvvm.cp.async.bulk.global.shared.cta %dst, %src, %size byte_mask = %mask
      : !llvm.ptr<1>, !llvm.ptr<3>

  // with both l2_cache_hint and byte_mask
  nvvm.cp.async.bulk.global.shared.cta %dst, %src, %size l2_cache_hint = %ch byte_mask = %mask
      : !llvm.ptr<1>, !llvm.ptr<3>

For more information, see PTX ISA

Traits: AttrSizedOperandSegments

Operands: 

OperandDescription
dstMemLLVM pointer in address space 1
srcMemLLVM pointer in address space 3
size32-bit signless integer
l2CacheHint64-bit signless integer
byteMask16-bit signless integer

nvvm.cp.async.bulk.shared.cluster.shared.cta (NVVM::CpAsyncBulkSharedCTAToSharedClusterOp) 

Async bulk copy from Shared CTA memory to Shared cluster memory

Syntax:

operation ::= `nvvm.cp.async.bulk.shared.cluster.shared.cta` $dstMem `,` $srcMem `,` $mbar `,` $size
              attr-dict  `:` type($dstMem) `,` type($srcMem)

Initiates an asynchronous copy operation from Shared CTA memory to Shared cluster memory.

For more information, see PTX ISA

Operands: 

OperandDescription
dstMemLLVM pointer in address space 7
srcMemLLVM pointer in address space 3
mbarLLVM pointer in address space 3
size32-bit signless integer

nvvm.cp.async.bulk.tensor.shared.cluster.global (NVVM::CpAsyncBulkTensorGlobalToSharedClusterOp) 

Syntax:

operation ::= `nvvm.cp.async.bulk.tensor.shared.cluster.global` $dstMem `,`
              $tmaDescriptor `,`
              $mbar `,`
              `box` `[`$coordinates `]`
              (`im2col` `[` $im2colOffsets^ `]` )?
              (`multicast_mask` `=` $multicastMask^ )?
              (`l2_cache_hint` `=` $l2CacheHint^ )?
              (`predicate` `=` $predicate^)?
              attr-dict  `:` type($dstMem) `,` type($tmaDescriptor)

Initiates an asynchronous copy operation on the tensor data from global memory to shared::cluster (or) shared::cta memory. This Op supports all the load modes specified in TMALoadMode.

The multicastMask operand is optional. When it is present, the Op copies data from global memory to shared memory of multiple CTAs in the cluster. Operand multicastMask specifies the destination CTAs in the cluster such that each bit position in the 16-bit multicastMask operand corresponds to the nvvm.read.ptx.sreg.ctaid of the destination CTA.

The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.

When the isCTAOnly attribute is set to true, the destination is shared::cta only. Hence, multicastMask and CTAGroup are not applicable when isCTAOnly is true.

For more information, see PTX ISA

Traits: AttrSizedOperandSegments, NVVMRequiresSM<90>

Interfaces: BasicPtxBuilderInterface

Attributes: 

AttributeMLIR TypeDescription
mode::mlir::NVVM::TMALoadModeAttr
List of Load-Modes supported for TMA Tensor Ops
TMA Tensor Ops support the following modes, when copying data from
global memory to shared memory (i.e. load):

Tile Mode: It’s the default mode. The source multi-dimensional tensor layout is preserved at the destination. For more information, see PTX ISA

Im2col Mode: This mode is used when im2colOffsets operands are present. The elements in the Bounding Box of the source tensor are rearranged into columns at the destination. In this mode, the tensor has to be at least 3-dimensional. The number of im2colOffsets is dims - 2 where dims is the dimension of the tensor. For more information, see PTX ISA

Im2col_W Mode: This mode is similar to Im2Col mode with the restriction that elements are accessed across the W dimension only. The number of im2colOffsets are always two, referred as wHalo and wOffset. For more information, see PTX ISA

Im2col_W_128 Mode: This mode is similar to Im2Col_W mode with the number of elements accessed across the W dimension is always 128 only. For more information, see PTX ISA

Tile_Gather4 Mode: This mode is similar to Tile mode but works only on 2D tensor. In gather4 mode, four rows in the source 2D tensor are combined to form a single 2D tensor at the destination. This mode requires five co-ordinates. The first one represents the column-index followed by four row indices. For more information, see PTX ISA

isCTAOnly::mlir::BoolAttrbool attribute
group::mlir::NVVM::CTAGroupKindAttrNVVM CTA group kind

Operands: 

OperandDescription
dstMemLLVM pointer in address space 3 or LLVM pointer in address space 7
tmaDescriptorLLVM pointer in address space 0
coordinatesvariadic of 32-bit signless integer
mbarLLVM pointer in address space 3
im2colOffsetsvariadic of 16-bit signless integer
multicastMask16-bit signless integer
l2CacheHint64-bit signless integer
predicate1-bit signless integer

nvvm.cp.async.bulk.tensor.prefetch (NVVM::CpAsyncBulkTensorPrefetchOp) 

Syntax:

operation ::= `nvvm.cp.async.bulk.tensor.prefetch` $tmaDescriptor `,`
              `box` `[`$coordinates `]`
              (`im2col` `[` $im2colOffsets^ `]` )?
              (`l2_cache_hint` `=` $l2CacheHint^ )?
              attr-dict  `:` type($tmaDescriptor)

Initiates an asynchronous prefetch operation on the tensor data from global memory to L2 cache. This Op supports all the load modes specified in TMALoadMode.

The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.

For more information, see PTX ISA

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
mode::mlir::NVVM::TMALoadModeAttr
List of Load-Modes supported for TMA Tensor Ops
TMA Tensor Ops support the following modes, when copying data from
global memory to shared memory (i.e. load):

Tile Mode: It’s the default mode. The source multi-dimensional tensor layout is preserved at the destination. For more information, see PTX ISA

Im2col Mode: This mode is used when im2colOffsets operands are present. The elements in the Bounding Box of the source tensor are rearranged into columns at the destination. In this mode, the tensor has to be at least 3-dimensional. The number of im2colOffsets is dims - 2 where dims is the dimension of the tensor. For more information, see PTX ISA

Im2col_W Mode: This mode is similar to Im2Col mode with the restriction that elements are accessed across the W dimension only. The number of im2colOffsets are always two, referred as wHalo and wOffset. For more information, see PTX ISA

Im2col_W_128 Mode: This mode is similar to Im2Col_W mode with the number of elements accessed across the W dimension is always 128 only. For more information, see PTX ISA

Tile_Gather4 Mode: This mode is similar to Tile mode but works only on 2D tensor. In gather4 mode, four rows in the source 2D tensor are combined to form a single 2D tensor at the destination. This mode requires five co-ordinates. The first one represents the column-index followed by four row indices. For more information, see PTX ISA

Operands: 

OperandDescription
tmaDescriptorLLVM pointer in address space 0
coordinatesvariadic of 32-bit signless integer
im2colOffsetsvariadic of 16-bit signless integer
l2CacheHint64-bit signless integer

nvvm.cp.async.bulk.tensor.reduce (NVVM::CpAsyncBulkTensorReduceOp) 

Syntax:

operation ::= `nvvm.cp.async.bulk.tensor.reduce` $tmaDescriptor `,`
              $srcMem `,`
              `box` `[`$coordinates `]`
              (`l2_cache_hint` `=` $l2CacheHint^ )?
              attr-dict  `:` type($tmaDescriptor) `,` type($srcMem)

Initiates an asynchronous reduction operation of tensor data in global memory with tensor data in shared memory.

The mode attribute indicates whether the copy mode is tile or im2col. The redOp attribute specifies the reduction operations applied. The supported reduction operations are: {add, min, max, inc, dec, and, or, xor}

The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.

For more information, see PTX ISA

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
redKind::mlir::NVVM::TMAReduxKindAttrNVVM TMA redux kind
mode::mlir::NVVM::TMAStoreModeAttrNVVM TMA Store Mode

Operands: 

OperandDescription
tmaDescriptorLLVM pointer type
srcMemLLVM pointer in address space 3
coordinatesvariadic of 32-bit signless integer
l2CacheHint64-bit signless integer

nvvm.cp.async.bulk.tensor.global.shared.cta (NVVM::CpAsyncBulkTensorSharedCTAToGlobalOp) 

Syntax:

operation ::= `nvvm.cp.async.bulk.tensor.global.shared.cta` $tmaDescriptor `,`
              $srcMem `,`
              `box` `[`$coordinates `]`
              (`l2_cache_hint` `=` $l2CacheHint^ )?
              (`,` `predicate` `=` $predicate^)?
              attr-dict `:` type($tmaDescriptor) `,` type($srcMem)

Initiates an asynchronous copy of the tensor data from shared::cta memory to global memory. This Op supports all the store modes specified in TMAStoreMode.

The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.

For more information, see PTX ISA

Traits: AttrSizedOperandSegments

Interfaces: BasicPtxBuilderInterface

Attributes: 

AttributeMLIR TypeDescription
mode::mlir::NVVM::TMAStoreModeAttrNVVM TMA Store Mode

Operands: 

OperandDescription
tmaDescriptorLLVM pointer in address space 0
srcMemLLVM pointer in address space 3
coordinatesvariadic of 32-bit signless integer
l2CacheHint64-bit signless integer
predicate1-bit signless integer

nvvm.cp.async.bulk.wait_group (NVVM::CpAsyncBulkWaitGroupOp) 

Syntax:

operation ::= `nvvm.cp.async.bulk.wait_group` $group attr-dict

Op waits for completion of the most recent bulk async-groups.

The $group operand tells waiting has to be done until for $group or fewer of the most recent bulk async-groups. If $group is 0, the op wait until all the most recent bulk async-groups have completed.

The $read indicates that the waiting has to be done until all the bulk async operations in the specified bulk async-group have completed reading from their source locations.

For more information, see PTX ISA

Traits: NVVMRequiresSM<90>

Attributes: 

AttributeMLIR TypeDescription
group::mlir::IntegerAttr32-bit signless integer attribute whose minimum value is 0
read::mlir::UnitAttrunit attribute

nvvm.cp.async.commit.group (NVVM::CpAsyncCommitGroupOp) 

Syntax:

operation ::= `nvvm.cp.async.commit.group` attr-dict

nvvm.cp.async.mbarrier.arrive (NVVM::CpAsyncMBarrierArriveOp) 

NVVM Dialect Op for cp.async.mbarrier.arrive

Syntax:

operation ::= `nvvm.cp.async.mbarrier.arrive` $addr attr-dict `:` type(operands)

The cp.async.mbarrier.arrive Op makes the mbarrier object track all prior cp.async operations initiated by the executing thread. The addr operand specifies the address of the mbarrier object in generic or shared::cta address space. When it is generic, the underlying memory should fall within the shared::cta space; otherwise the behavior is undefined. The noinc attr impacts how the mbarrier’s state is updated.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
noinc::mlir::IntegerAttr1-bit signless integer attribute

Operands: 

OperandDescription
addrLLVM pointer in address space 0 or LLVM pointer in address space 3

nvvm.cp.async.shared.global (NVVM::CpAsyncOp) 

Syntax:

operation ::= `nvvm.cp.async.shared.global` $dst `,` $src `,` $size `,` `cache` `=` $modifier (`,` $cpSize^)? attr-dict `:` type(operands)

Attributes: 

AttributeMLIR TypeDescription
size::mlir::IntegerAttr32-bit signless integer attribute
modifier::mlir::NVVM::LoadCacheModifierKindAttr
NVVM load cache modifier kind
Enum attribute of the different kinds of cache operators for load instructions.

For more information, see PTX ISA

Operands: 

OperandDescription
dstLLVM pointer in address space 3
srcLLVM pointer in address space 1
cpSize32-bit signless integer

nvvm.cp.async.wait.group (NVVM::CpAsyncWaitGroupOp) 

Syntax:

operation ::= `nvvm.cp.async.wait.group` $n attr-dict

Attributes: 

AttributeMLIR TypeDescription
n::mlir::IntegerAttr32-bit signless integer attribute

nvvm.dot.accumulate.2way (NVVM::DotAccumulate2WayOp) 

Two-way 16-bit to 8-bit dot product-accumulate instruction

Syntax:

operation ::= `nvvm.dot.accumulate.2way` $a $a_type `,` $b $b_type `,` $c attr-dict `:` type($a) `,` type($b)

Performs a two-way 16-bit to 8-bit dot-product which is accumulated in a 32-bit result. Operand a is a vector of two 16-bit elements and operand b a vector of four 8-bit elements between which the dot product is computed.

The a_type and b_type attributes specify the type of the elements in a and b respectively. If a_type or b_type is s, then the elements in the corresponding vector are sign-extended to 32-bit before the dot product is computed. If a_type or b_type is u, then the elements in the corresponding vector are zero-extended to 32-bit instead.

The b_hi boolean attribute specifies which two bytes of b are used for the dot product. If b_hi is true, then the dot product is computed between a and elements at indices 2 and 3 of b. If b_hi is false, then the dot product is computed between a and elements at indices 0 and 1 of b.

Operand c is a 32-bit integer to which the result is accumulated. It is treated as holding a signed integer if any of a_type or b_type is signed.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
a_type::mlir::NVVM::DotAccumulateTypeAttrNVVM DotAccumulateType
b_type::mlir::NVVM::DotAccumulateTypeAttrNVVM DotAccumulateType
b_hi::mlir::BoolAttrbool attribute

Operands: 

OperandDescription
avector of 16-bit signless integer values of length 2
bvector of 8-bit signless integer values of length 4
c32-bit signless integer

Results: 

ResultDescription
res32-bit signless integer

nvvm.dot.accumulate.4way (NVVM::DotAccumulate4WayOp) 

Four-way byte dot product-accumulate instruction

Syntax:

operation ::= `nvvm.dot.accumulate.4way` $a $a_type `,` $b $b_type `,` $c attr-dict `:` type($a) `,` type($b)

Performs a four-way byte dot-product which is accumulated in a 32-bit result. Operand a and b are vectors of 4 bytes between which the dot product is computed.

The a_type and b_type attributes specify the type of the elements in a and b respectively. If a_type or b_type is signed, then the elements in the corresponding vector are sign-extended to 32-bit before the dot product is computed. If a_type or b_type is unsigned, then the elements in the corresponding vector are zero-extended to 32-bit instead.

Operand c is a 32-bit integer to which the result is accumulated. It is treated as holding a signed integer if any of a_type or b_type is s8.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
a_type::mlir::NVVM::DotAccumulateTypeAttrNVVM DotAccumulateType
b_type::mlir::NVVM::DotAccumulateTypeAttrNVVM DotAccumulateType

Operands: 

OperandDescription
avector of 8-bit signless integer values of length 4
bvector of 8-bit signless integer values of length 4
c32-bit signless integer

Results: 

ResultDescription
res32-bit signless integer

nvvm.elect.sync (NVVM::ElectSyncOp) 

Elect one leader thread

Syntax:

operation ::= `nvvm.elect.sync` ($membermask^)? attr-dict `->` type(results)

The elect.sync instruction elects one predicated active leader thread from among a set of threads specified in the membermask. When the membermask is not provided explicitly, a default value of 0xFFFFFFFF is used. The predicate result is set to True for the leader thread, and False for all other threads.

For more information, see PTX ISA

Operands: 

OperandDescription
membermask32-bit signless integer

Results: 

ResultDescription
pred1-bit signless integer

nvvm.read.ptx.sreg.envreg0 (NVVM::EnvReg0Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg0` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg10 (NVVM::EnvReg10Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg10` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg11 (NVVM::EnvReg11Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg11` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg12 (NVVM::EnvReg12Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg12` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg13 (NVVM::EnvReg13Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg13` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg14 (NVVM::EnvReg14Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg14` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg15 (NVVM::EnvReg15Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg15` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg16 (NVVM::EnvReg16Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg16` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg17 (NVVM::EnvReg17Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg17` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg18 (NVVM::EnvReg18Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg18` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg19 (NVVM::EnvReg19Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg19` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg1 (NVVM::EnvReg1Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg1` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg20 (NVVM::EnvReg20Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg20` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg21 (NVVM::EnvReg21Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg21` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg22 (NVVM::EnvReg22Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg22` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg23 (NVVM::EnvReg23Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg23` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg24 (NVVM::EnvReg24Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg24` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg25 (NVVM::EnvReg25Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg25` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg26 (NVVM::EnvReg26Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg26` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg27 (NVVM::EnvReg27Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg27` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg28 (NVVM::EnvReg28Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg28` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg29 (NVVM::EnvReg29Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg29` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg2 (NVVM::EnvReg2Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg2` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg30 (NVVM::EnvReg30Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg30` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg31 (NVVM::EnvReg31Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg31` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg3 (NVVM::EnvReg3Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg3` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg4 (NVVM::EnvReg4Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg4` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg5 (NVVM::EnvReg5Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg5` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg6 (NVVM::EnvReg6Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg6` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg7 (NVVM::EnvReg7Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg7` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg8 (NVVM::EnvReg8Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg8` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.envreg9 (NVVM::EnvReg9Op) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.envreg9` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.exit (NVVM::Exit) 

Exit Op

Syntax:

operation ::= `nvvm.exit` attr-dict

Ends execution of a thread. For more information, see PTX ISA

nvvm.fence.mbarrier.init (NVVM::FenceMbarrierInitOp) 

Syntax:

operation ::= `nvvm.fence.mbarrier.init` attr-dict

Fence operation that applies on the prior nvvm.mbarrier.init

For more information, see PTX ISA

Interfaces: BasicPtxBuilderInterface

nvvm.fence.proxy.acquire (NVVM::FenceProxyAcquireOp) 

Uni-directional proxy fence operation with acquire semantics

Syntax:

operation ::= `nvvm.fence.proxy.acquire` $scope $addr `,` $size (`from_proxy` `=` $fromProxy^)? (`to_proxy` `=` $toProxy^)? attr-dict

fence.proxy.acquire is a uni-directional fence used to establish ordering between a prior memory access performed via the generic proxy and a subsequent memory access performed via the tensormap proxy

The address operand addr and the operand size together specify the memory range [addr, addr+size) on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the size operand is 128 and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the .global state space. Otherwise, the behavior is undefined

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
scope::mlir::NVVM::MemScopeKindAttrNVVM Memory Scope kind
fromProxy::mlir::NVVM::ProxyKindAttrProxy kind
toProxy::mlir::NVVM::ProxyKindAttrProxy kind

Operands: 

OperandDescription
addrLLVM pointer in address space 0
size32-bit signless integer

nvvm.fence.proxy (NVVM::FenceProxyOp) 

Syntax:

operation ::= `nvvm.fence.proxy` attr-dict

Fence operation with proxy to establish an ordering between memory accesses that may happen through different proxies.

For more information, see PTX ISA

Interfaces: BasicPtxBuilderInterface

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::ProxyKindAttrProxy kind
space::mlir::NVVM::SharedSpaceAttrShared memory space

nvvm.fence.proxy.release (NVVM::FenceProxyReleaseOp) 

Uni-directional proxy fence operation with release semantics

Syntax:

operation ::= `nvvm.fence.proxy.release` $scope (`from_proxy` `=` $fromProxy^)? (`to_proxy` `=` $toProxy^)? attr-dict

fence.proxy.release is a uni-directional fence used to establish ordering between a prior memory access performed via the generic proxy and a subsequent memory access performed via the tensormap proxy. fence.proxy.release operation can form a release sequence that synchronizes with an acquire sequence that contains the fence.proxy.acquire proxy fence operation

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
scope::mlir::NVVM::MemScopeKindAttrNVVM Memory Scope kind
fromProxy::mlir::NVVM::ProxyKindAttrProxy kind
toProxy::mlir::NVVM::ProxyKindAttrProxy kind

nvvm.fence.sc.cluster (NVVM::FenceScClusterOp) 

Syntax:

operation ::= `nvvm.fence.sc.cluster` attr-dict

nvvm.read.ptx.sreg.globaltimer.lo (NVVM::GlobalTimerLoOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.globaltimer.lo` attr-dict `:` type($res)

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.globaltimer (NVVM::GlobalTimerOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.globaltimer` attr-dict `:` type($res)

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.nctaid.x (NVVM::GridDimXOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.nctaid.x` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.nctaid.y (NVVM::GridDimYOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.nctaid.y` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.nctaid.z (NVVM::GridDimZOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.nctaid.z` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.gridid (NVVM::GridIdOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.gridid` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.griddepcontrol (NVVM::GriddepcontrolOp) 

Syntax:

operation ::= `nvvm.griddepcontrol` $kind attr-dict

If the $kind attribute is set to wait, it causes the executing thread to wait until all prerequisite grids in flight have completed and all the memory operations from the prerequisite grids are performed and made visible to the current grid.

When the $kind is launch_dependents, it signals that specific dependents the runtime system designated to react to this instruction can be scheduled as soon as all other CTAs in the grid issue the same instruction or have completed.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::GridDepActionKindAttr
Action kind for grid dependency control

Enum cases:

  • wait (wait)
  • launch_dependents (launch_dependents)

nvvm.inline_ptx (NVVM::InlinePtxOp) 

Inline PTX Op

Syntax:

operation ::= `nvvm.inline_ptx` $ptxCode
              ( `ro` `(` $readOnlyArgs^ `:` type($readOnlyArgs) `)` )?
              ( `rw` `(` $readWriteArgs^ `:` type($readWriteArgs) `)` )?
              (`,` `predicate` `=` $predicate^)?
              attr-dict
              ( `->` type($writeOnlyArgs)^ )?

This op allows using PTX directly within the NVVM dialect, while greatly simplifying llvm.inline_asm generation. It automatically handles register size selection and sets the correct read/write access for each operand. The operation leverages the BasicPtxBuilderInterface to abstract away low-level details of PTX assembly formatting.

The `predicate` attribute is used to specify a predicate for the 
PTX instruction.

Example 1: Read-only Parameters
```mlir
nvvm.inline_ptx "mbarrier.init.b64 [$0], $1;" (%barrier_gen, %count) : !llvm.ptr, i32

// Lowers to:
llvm.inline_asm has_side_effects asm_dialect = att 
  "mbarrier.init.b64 [$0], $1;", "l,r" %arg0, %arg2 : (!llvm.ptr, i32) -> ()
```

Example 2: Read-only and Write-only Parameters
```mlir
%0 = nvvm.inline_ptx "ex2.approx.ftz.f32 $0, $1;" (%input) : f32 -> f32

// Lowers to:
%0 = llvm.inline_asm has_side_effects asm_dialect = att 
  "ex2.approx.ftz.f32 $0, $1;", "=f,f" %arg0 : (f32) -> f32
```

Example 3: Predicate Usage
```mlir
nvvm.inline_ptx "mbarrier.init.b64 [$0], $1;" (%barrier_gen, %count), 
  predicate = %pred : !llvm.ptr, i32, i1

// Lowers to:
llvm.inline_asm has_side_effects asm_dialect = att 
  "@$2 mbarrier.init.b64 [$0], $1;", "l,r,b" %arg0, %arg2, %arg3 
  : (!llvm.ptr, i32, i1) -> ()
```

Traits: AttrSizedOperandSegments

Interfaces: BasicPtxBuilderInterface

Attributes: 

AttributeMLIR TypeDescription
ptxCode::mlir::StringAttrstring attribute

Operands: 

OperandDescription
readOnlyArgsvariadic of any type
readWriteArgsvariadic of any type
predicate1-bit signless integer

Results: 

ResultDescription
writeOnlyArgsvariadic of any type

nvvm.read.ptx.sreg.laneid (NVVM::LaneIdOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.laneid` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.lanemask.eq (NVVM::LaneMaskEqOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.lanemask.eq` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.lanemask.ge (NVVM::LaneMaskGeOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.lanemask.ge` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.lanemask.gt (NVVM::LaneMaskGtOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.lanemask.gt` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.lanemask.le (NVVM::LaneMaskLeOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.lanemask.le` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.lanemask.lt (NVVM::LaneMaskLtOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.lanemask.lt` attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.ldmatrix (NVVM::LdMatrixOp) 

Cooperative matrix load

Syntax:

operation ::= `nvvm.ldmatrix` $ptr attr-dict `:` functional-type($ptr, $res)

Attributes: 

AttributeMLIR TypeDescription
num::mlir::IntegerAttr32-bit signless integer attribute
layout::mlir::NVVM::MMALayoutAttrNVVM MMA layout
shape::mlir::NVVM::LdStMatrixShapeAttrMatrix shape for ldmatrix and stmatrix
eltType::mlir::NVVM::LdStMatrixEltTypeAttrElement type for ldmatrix and stmatrix

Operands: 

OperandDescription
ptrLLVM pointer in address space 3

Results: 

ResultDescription
resany type

nvvm.mbarrier.arrive_drop.expect_tx (NVVM::MBarrierArriveDropExpectTxOp) 

MBarrier arrive_drop with expected transaction count

Syntax:

operation ::= `nvvm.mbarrier.arrive_drop.expect_tx` $addr `,` $txcount attr-dict `:` type(operands) (`->` type($res)^)?

The nvvm.mbarrier.arrive_drop.expect_tx operation is similar to the nvvm.mbarrier.arrive.expect_tx operation except that it performs an arrive_drop operation instead of only an arrive operation.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
scope::mlir::NVVM::MemScopeKindAttrNVVM Memory Scope kind
relaxed::mlir::BoolAttrbool attribute

Operands: 

OperandDescription
addrLLVM pointer in address space 0 or LLVM pointer in address space 3 or LLVM pointer in address space 7
txcount32-bit signless integer

Results: 

ResultDescription
res64-bit signless integer

nvvm.mbarrier.arrive_drop.nocomplete (NVVM::MBarrierArriveDropNocompleteOp) 

MBarrier Arrive-Drop No-Complete Operation

Syntax:

operation ::= `nvvm.mbarrier.arrive_drop.nocomplete` $addr `,` $count attr-dict `:` type(operands) `->` type($res)

The nvvm.mbarrier.arrive_drop.nocomplete operation decrements the expected arrival count of the mbarrier object by the amount count and then performs an arrive-on operation on the mbarrier object with the guarantee that it will not cause the barrier to complete its current phase.

For more information, see PTX ISA

Operands: 

OperandDescription
addrLLVM pointer in address space 0 or LLVM pointer in address space 3
count32-bit signless integer

Results: 

ResultDescription
res64-bit signless integer

nvvm.mbarrier.arrive_drop (NVVM::MBarrierArriveDropOp) 

MBarrier Arrive-Drop Operation

Syntax:

operation ::= `nvvm.mbarrier.arrive_drop` $addr (`,` $count^)? attr-dict `:` type($addr) (`->` type($res)^)?

The nvvm.mbarrier.arrive_drop operation decrements the expected arrival count of the mbarrier object by count and then performs an arrive-on operation. When count is not specified, it defaults to 1. The decrement of the expected arrival count applies to all the subsequent phases of the mbarrier object. The remaining semantics are identical to those of the nvvm.mbarrier.arrive operation.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
scope::mlir::NVVM::MemScopeKindAttrNVVM Memory Scope kind
relaxed::mlir::BoolAttrbool attribute

Operands: 

OperandDescription
addrLLVM pointer in address space 0 or LLVM pointer in address space 3 or LLVM pointer in address space 7
count32-bit signless integer

Results: 

ResultDescription
res64-bit signless integer

nvvm.mbarrier.arrive.expect_tx (NVVM::MBarrierArriveExpectTxOp) 

MBarrier Arrive with Expected Transaction Count

Syntax:

operation ::= `nvvm.mbarrier.arrive.expect_tx` $addr `,` $txcount (`,` `predicate` `=` $predicate^)? attr-dict `:` type(operands) (`->` type($res)^)?

The nvvm.mbarrier.arrive.expect_tx operation performs an expect-tx operation followed by an arrive-on operation on the mbarrier object. Uses the default .release.cta semantics. This release pattern establishes memory ordering for operations occurring in program order before this arrive instruction by making operations from the current thread visible to subsequent operations in other threads within the CTA. When other threads perform corresponding acquire operations (like ‘mbarrier.test.wait’), they synchronize with this release pattern.

This operation first performs an expect-tx operation with the specified transaction count, then performs an arrive-on operation with an implicit count of 1. The expect-tx operation increases the expect-count of the mbarrier object by the specified value (i.e. txcount), setting the current phase to expect and track the completion of additional asynchronous transactions.

The operation takes the following operands:

  • addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.
  • txcount: An unsigned integer specifying the expected transaction count for the expect-tx operation. This represents the number of asynchronous transactions expected to complete before the barrier phase completes.
  • scope: This specifies the set of threads that directly observe the memory synchronizing effect of the mbarrier.test.wait operation.
  • relaxed: When set to true, the arrive operation has relaxed memory semantics and does not provide any ordering or visibility guarantees.
  • predicate: Optional predicate for conditional execution used only when lowering to inline-ptx.

For more information, see PTX ISA

Interfaces: BasicPtxBuilderInterface

Attributes: 

AttributeMLIR TypeDescription
scope::mlir::NVVM::MemScopeKindAttrNVVM Memory Scope kind
relaxed::mlir::BoolAttrbool attribute

Operands: 

OperandDescription
addrLLVM pointer in address space 0 or LLVM pointer in address space 3 or LLVM pointer in address space 7
txcount32-bit signless integer
predicate1-bit signless integer

Results: 

ResultDescription
res64-bit signless integer

nvvm.mbarrier.arrive.nocomplete (NVVM::MBarrierArriveNocompleteOp) 

MBarrier Arrive No-Complete Operation

Syntax:

operation ::= `nvvm.mbarrier.arrive.nocomplete` $addr `,` $count attr-dict `:` type(operands) `->` type($res)

The nvvm.mbarrier.arrive.nocomplete operation performs an arrive-on operation on the mbarrier object with the guarantee that it will not cause the barrier to complete its current phase. Uses the default .release.cta semantics. This release pattern establishes memory ordering for operations occurring in program order before this arrive instruction by making operations from the current thread visible to subsequent operations in other threads within the CTA. When other threads perform corresponding acquire operations (like ‘mbarrier.test.wait’), they synchronize with this release pattern.

This operation causes the executing thread to signal its arrival at the barrier with a specified count, but ensures that the barrier phase will not complete as a result of this operation. The operation returns an opaque value that captures the phase of the mbarrier object prior to the arrive-on operation.

The operation takes the following operands:

  • addr: A pointer to the memory location of the mbarrier object. The addr must be a pointer to generic or shared::cta memory. When it is generic, the underlying address must be within the shared::cta memory space; otherwise the behavior is undefined.
  • count: Integer specifying the count argument to the arrive-on operation. Must be in the valid range as specified in the mbarrier object contents.

For more information, see PTX ISA

Operands: 

OperandDescription
addrLLVM pointer in address space 0 or LLVM pointer in address space 3
count32-bit signless integer

Results: 

ResultDescription
res64-bit signless integer

nvvm.mbarrier.arrive (NVVM::MBarrierArriveOp) 

MBarrier Arrive Operation

Syntax:

operation ::= `nvvm.mbarrier.arrive` $addr (`,` $count^)? attr-dict `:` type($addr) (`->` type($res)^)?

The nvvm.mbarrier.arrive operation performs an arrive-on operation on the mbarrier object at the specified address. Uses the default .release.cta semantics. This release pattern establishes memory ordering for operations occurring in program order before this arrive instruction by making operations from the current thread visible to subsequent operations in other threads within the CTA. When other threads perform corresponding acquire operations (like ‘mbarrier.test.wait’), they synchronize with this release pattern.

This operation causes the executing thread to signal its arrival at the barrier.

  • res: When the space is not shared_cluster, this operation returns an opaque 64-bit value capturing the phase of the mbarrier object prior to the arrive-on operation. The contents of this return value are implementation-specific. An mbarrier object located in the shared_cluster space cannot return a value.

The operation takes the following operands:

  • addr: A pointer to the memory location of the mbarrier object. The addr must be a pointer to generic or shared_cta or shared_cluster memory. When it is generic, the underlying address must be within the shared_cta memory space; otherwise the behavior is undefined.
  • count: This specifies the amount by which the pending arrival count is decremented. If the count argument is not specified, the pending arrival count is decremented by 1.
  • scope: This specifies the set of threads that directly observe the memory synchronizing effect of the mbarrier.arrive operation.
  • space: This indicates the memory space where the mbarrier object resides.
  • relaxed: When set to true, the arrive operation has relaxed memory semantics and does not provide any ordering or visibility guarantees.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
scope::mlir::NVVM::MemScopeKindAttrNVVM Memory Scope kind
relaxed::mlir::BoolAttrbool attribute

Operands: 

OperandDescription
addrLLVM pointer in address space 0 or LLVM pointer in address space 3 or LLVM pointer in address space 7
count32-bit signless integer

Results: 

ResultDescription
res64-bit signless integer

nvvm.mbarrier.complete_tx (NVVM::MBarrierCompleteTxOp) 

MBarrier complete-tx Operation

Syntax:

operation ::= `nvvm.mbarrier.complete_tx` $addr `,` $txcount attr-dict `:` type(operands)

The nvvm.mbarrier.complete_tx operation decrements the transaction count of the mbarrier object at addr by txcount. It also signals the completion of asynchronous transactions that were tracked by the current phase. The scope specifies the set of threads that can directly observe the memory synchronizing effect of the mbarrier.complete_tx operation. CTA and CLUSTER are the only allowed values for scope.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
scope::mlir::NVVM::MemScopeKindAttrNVVM Memory Scope kind

Operands: 

OperandDescription
addrLLVM pointer in address space 3 or LLVM pointer in address space 7
txcount32-bit signless integer

nvvm.mbarrier.expect_tx (NVVM::MBarrierExpectTxOp) 

MBarrier expect-tx Operation

Syntax:

operation ::= `nvvm.mbarrier.expect_tx` $addr `,` $txcount attr-dict `:` type(operands)

The nvvm.mbarrier.expect_tx operation increases the transaction count of the mbarrier located at addr by txcount amount. The scope specifies the set of threads that can directly observe the memory synchronizing effect of the mbarrier.expect_tx operation. CTA and CLUSTER are the only allowed values for scope.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
scope::mlir::NVVM::MemScopeKindAttrNVVM Memory Scope kind

Operands: 

OperandDescription
addrLLVM pointer in address space 3 or LLVM pointer in address space 7
txcount32-bit signless integer

nvvm.mbarrier.init (NVVM::MBarrierInitOp) 

MBarrier Initialization Op

Syntax:

operation ::= `nvvm.mbarrier.init` $addr `,` $count (`,` `predicate` `=` $predicate^)? attr-dict `:` type(operands)

The nvvm.mbarrier.init operation initializes an mbarrier object at the specified memory location.

This operation initializes the mbarrier object with the following state:

  • Current phase: 0
  • Expected arrival count: count
  • Pending arrival count: count
  • Transaction count (tx-count): 0

The operation takes the following operands:

  • addr: A pointer to the memory location of the mbarrier object. The addr must be a pointer to generic or shared::cta memory. When it is generic, the underlying address must be within the shared::cta memory space; otherwise the behavior is undefined.
  • count: Integer specifying the number of threads that will participate in barrier synchronization. Must be in the range [1, 2²⁰ - 1].
  • predicate: Optional predicate for conditional execution.

For more information, see PTX ISA

Interfaces: BasicPtxBuilderInterface

Operands: 

OperandDescription
addrLLVM pointer in address space 0 or LLVM pointer in address space 3
count32-bit signless integer
predicate1-bit signless integer

nvvm.mbarrier.inval (NVVM::MBarrierInvalOp) 

MBarrier Invalidation Operation

Syntax:

operation ::= `nvvm.mbarrier.inval` $addr attr-dict `:` type(operands)

The nvvm.mbarrier.inval operation invalidates an mbarrier object at the specified memory location.

This operation marks the mbarrier object as invalid, making it safe to repurpose the memory location for other uses or to reinitialize it as a new mbarrier object. It is undefined behavior if the mbarrier object is already invalid.

The operation takes the following operand:

  • addr: A pointer to the memory location of the mbarrier object. The addr must be a pointer to generic or shared::cta memory. When it is generic, the underlying address must be within the shared::cta memory space; otherwise the behavior is undefined.

For more information, see PTX ISA

Operands: 

OperandDescription
addrLLVM pointer in address space 0 or LLVM pointer in address space 3

nvvm.mbarrier.test.wait (NVVM::MBarrierTestWaitOp) 

MBarrier Non-Blocking Test Wait Operation

Syntax:

operation ::= `nvvm.mbarrier.test.wait` $addr `,` $stateOrPhase attr-dict `:` type(operands) `->` type($res)

The nvvm.mbarrier.test.wait operation performs a non-blocking test for the completion of a specific phase of an mbarrier object. It uses the default .acquire.cta semantics. This acquire pattern establishes memory ordering for operations occurring in program order after this wait instruction by making operations from other threads in the CTA visible to subsequent operations in the current thread. When this wait completes, it synchronizes with the corresponding release pattern from the mbarrier.arrive operation, establishing memory ordering within the CTA.

This operation tests whether the mbarrier phase specified by the state operand has completed. It is a non-blocking instruction that immediately returns the completion status without suspending the executing thread.

The operation takes the following operands:

  • addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.
  • stateOrPhase: This argument represents a state when it is a 64-bit value and represents a phase when it is a 32-bit value. The state is an opaque value returned by a previous mbarrier.arrive operation on the same mbarrier object during the current or immediately preceding phase. The phase is an integer specifying the phase parity (0 or 1). Even phases have parity 0, odd phases have parity 1.
  • scope: This specifies the set of threads that directly observe the memory synchronizing effect of the mbarrier.test.wait operation.
  • relaxed: When set to true, the arrive operation has relaxed memory semantics and does not provide any ordering or visibility guarantees.

The operation returns a boolean value indicating whether the specified phase has completed:

  • true: The immediately preceding phase has completed
  • false: The phase is still incomplete (current phase)

Memory ordering guarantees: When this wait returns true, the following ordering guarantees hold:

  1. All memory accesses (except async operations) requested prior to mbarrier.arrive having release semantics by participating CTA threads are visible to the executing thread.
  2. All cp.async operations requested prior to cp.async.mbarrier.arrive by participating CTA threads are visible to the executing thread.
  3. All cp.async.bulk operations using the same mbarrier object requested prior to mbarrier.arrive having release semantics by participating CTA threads are visible to the executing thread.
  4. Memory accesses requested after this wait are not visible to memory accesses performed prior to mbarrier.arrive by other participating threads.
  5. No ordering guarantee exists for memory accesses by the same thread between mbarrier.arrive and this wait.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
scope::mlir::NVVM::MemScopeKindAttrNVVM Memory Scope kind
relaxed::mlir::BoolAttrbool attribute

Operands: 

OperandDescription
addrLLVM pointer in address space 0 or LLVM pointer in address space 3
stateOrPhase64-bit signless integer or 32-bit signless integer

Results: 

ResultDescription
res1-bit signless integer

nvvm.mbarrier.try_wait.parity (NVVM::MBarrierTryWaitParityOp) 

MBarrier Potentially-Blocking Try Wait with Phase Parity

Syntax:

operation ::= `nvvm.mbarrier.try_wait.parity` $addr `,` $phase `,` $ticks attr-dict `:` type(operands)

The nvvm.mbarrier.try_wait.parity operation performs a potentially-blocking test for the completion of a specific phase of an mbarrier object using phase parity. It uses the default .acquire.cta semantics. This acquire pattern establishes memory ordering for operations occurring in program order after this wait instruction by making operations from other threads in the CTA visible to subsequent operations in the current thread. When this wait completes, it synchronizes with the corresponding release pattern from the mbarrier.arrive operation, establishing memory ordering within the CTA.

This operation waits for the completion of the mbarrier phase indicated by the phase parity. While it uses the underlying PTX mbarrier.try_wait.parity instruction, this MLIR operation generates a loop that enforces the test to complete before continuing execution, ensuring the barrier phase is actually completed rather than potentially timing out.

The operation takes the following operands:

  • addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.
  • phase: An integer specifying the phase parity (0 or 1). Even phases have parity 0, odd phases have parity 1.
  • ticks: An unsigned integer specifying the suspend time hint in nanoseconds. This may be used instead of the system-dependent time limit.

Memory ordering guarantees: When this wait returns true, the following ordering guarantees hold:

  1. All memory accesses (except async operations) requested prior to mbarrier.arrive having release semantics by participating CTA threads are visible to the executing thread.
  2. All cp.async operations requested prior to cp.async.mbarrier.arrive by participating CTA threads are visible to the executing thread.
  3. All cp.async.bulk operations using the same mbarrier object requested prior to mbarrier.arrive having release semantics by participating CTA threads are visible to the executing thread.
  4. Memory accesses requested after this wait are not visible to memory accesses performed prior to mbarrier.arrive by other participating threads.
  5. No ordering guarantee exists for memory accesses by the same thread between mbarrier.arrive and this wait.

Implementation behavior: This operation generates a PTX loop that repeatedly calls the underlying mbarrier.try_wait.parity instruction until the barrier phase completes. Unlike the raw PTX instruction which may return without completion after a timeout, this MLIR operation guarantees completion by continuing to loop until the specified phase is reached.

For more information, see PTX ISA

Interfaces: BasicPtxBuilderInterface

Operands: 

OperandDescription
addrLLVM pointer in address space 0 or LLVM pointer in address space 3
phase32-bit signless integer
ticks32-bit signless integer

nvvm.mapa (NVVM::MapaOp) 

Syntax:

operation ::= `nvvm.mapa` $a`,` $b attr-dict `:` type($a) `->` type($res)

Traits: NVVMRequiresSM<90>

Operands: 

OperandDescription
aLLVM pointer in address space 0 or LLVM pointer in address space 3
b32-bit signless integer

Results: 

ResultDescription
resLLVM pointer in address space 0 or LLVM pointer in address space 7

nvvm.match.sync (NVVM::MatchSyncOp) 

Broadcast and compare a value across threads in warp

Syntax:

operation ::= `nvvm.match.sync` $kind $thread_mask `,` $val attr-dict `:` type($val) `->` type($res)

The match.sync op performs broadcast and compare of operand val across all non-exited threads in thread_mask and returns a mask depending on the kind and an optional predicate.

The matching operation kinds are:

  • any: Returns a mask corresponding to the non-exited threads in the thread_mask that have the same value of operand val.
  • all: Returns a mask and a predicate. If all non-exited threads in the thread_mask have the same value of operand val, the predicate is set to true and the mask corresponds to the non-exited threads in the thread_mask. Otherwise, the predicate is set to false and the mask is 0.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::MatchSyncKindAttrNVVM match sync kind

Operands: 

OperandDescription
thread_mask32-bit signless integer
val32-bit signless integer or 64-bit signless integer

Results: 

ResultDescription
res32-bit signless integer or LLVM struct type

nvvm.memory.barrier (NVVM::MembarOp) 

Memory barrier operation

Syntax:

operation ::= `nvvm.memory.barrier` $scope attr-dict

membar operation guarantees that prior memory accesses requested by this thread are performed at the specified scope, before later memory operations requested by this thread following the membar instruction.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
scope::mlir::NVVM::MemScopeKindAttrNVVM Memory Scope kind

nvvm.mma.sync (NVVM::MmaOp) 

Cooperative matrix-multiply and accumulate

The nvvm.mma.sync operation collectively performs the operation D = matmul(A, B) + C using all threads in a warp.

All the threads in the warp must execute the same mma.sync operation.

For each possible multiplicand PTX data type, there are one or more possible instruction shapes given as “mMnNkK”. The below table describes the posssibilities as well as the types required for the operands. Note that the data type for C (the accumulator) and D (the result) can vary independently when there are multiple possibilities in the “C/D Type” column.

When an optional attribute cannot be immediately inferred from the types of the operands and the result during parsing or validation, an error will be raised.

b1Op is only relevant when the binary (b1) type is given to multiplicandDataType. It specifies how the multiply-and-acumulate is performed and is either xor_popc or and_poc. The default is xor_popc.

intOverflowBehavior is only relevant when the multiplicandType attribute is one of u8, s8, u4, s4, this attribute describes how overflow is handled in the accumulator. When the attribute is satfinite, the accumulator values are clamped in the int32 range on overflow. This is the default behavior. Alternatively, accumulator behavior wrapped can also be specified, in which case overflow wraps from one end of the range to the other.

layoutA and layoutB are required and should generally be set to #nvvm.mma_layout<row> and #nvvm.mma_layout<col> respectively, but other combinations are possible for certain layouts according to the table below.

| A/B Type | Shape     | ALayout | BLayout | A Type   | B Type   | C/D Type          |
|----------|-----------|---------|---------|----------|----------|-------------------|
| f64      | .m8n8k4   | row     | col     | 1x f64   | 1x f64   | 2x f64            |
| f16      | .m8n8k4   | row/col | row/col | 2x f16x2 | 2x f16x2 | 4x f16x2 or 8xf32 |
|          | .m16n8k8  | row     | col     | 2x f16x2 | 1x f16x2 | 2x f16x2 or 4 f32 |
|          | .m16n8k16 | row     | col     | 4x f16x2 | 2x f16x2 | 2x f16x2 or 4 f32 |
| bf16     | .m16n8k8  | row     | col     | 2x i32   | 1x i32   | 4x f32            |
|          | .m16n8k16 | row     | col     | 4x i32   | 2x i32   | 4x f32            |
| tf32     | .m16n8k4  | row     | col     | 2x i32   | 1x i32   | 4x f32            |
|          | .m16n8k8  | row     | col     | 4x i32   | 2x i32   | 2x f16x2 or 4 f32 |
| u8/s8    | .m8n8k16  | row     | col     | 1x i32   | 1x i32   | 2x i32            |
|          | .m16n8k16 | row     | col     | 2x i32   | 1x i32   | 4x i32            |
|          | .m16n8k32 | row     | col     | 4x i32   | 2x i32   | 4x i32            |
| u4/s4    | .m8n8k32  | row     | col     | 1x i32   | 1x i32   | 2x i32            |
|          | m16n8k32  | row     | col     | 2x i32   | 1x i32   | 4x i32            |
|          | m16n8k64  | row     | col     | 4x i32   | 2x i32   | 4x i32            |
| b1       | m8n8k128  | row     | col     | 1x i32   | 1x i32   | 2x i32            |
|          | m16n8k128 | row     | col     | 2x i32   | 1x i32   | 4x i32            |

Example:


%128 = nvvm.mma.sync A[%120, %121, %122, %123]
                     B[%124, %125]
                     C[%126, %127]
                     {layoutA = #nvvm.mma_layout<row>,
                      layoutB = #nvvm.mma_layout<col>,
                      shape = {k = 16 : i32, m = 16 : i32, n = 8 : i32}}
    : (vector<2xf16>, vector<2xf16>, vector<2xf16>)
       -> !llvm.struct<(vector<2xf16>, vector<2xf16>)>

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
shape::mlir::NVVM::MMAShapeAttrAttribute for MMA operation shape.
b1Op::mlir::NVVM::MMAB1OpAttrMMA binary operations
intOverflowBehavior::mlir::NVVM::MMAIntOverflowAttrMMA overflow options
layoutA::mlir::NVVM::MMALayoutAttrNVVM MMA layout
layoutB::mlir::NVVM::MMALayoutAttrNVVM MMA layout
multiplicandAPtxType::mlir::NVVM::MMATypesAttrNVVM MMA types
multiplicandBPtxType::mlir::NVVM::MMATypesAttrNVVM MMA types

Operands: 

OperandDescription
operandAvariadic of LLVM dialect-compatible type
operandBvariadic of LLVM dialect-compatible type
operandCvariadic of LLVM dialect-compatible type

Results: 

ResultDescription
resLLVM structure type

nvvm.mma.sp.sync (NVVM::MmaSpOp) 

Cooperative sparse matrix-multiply and accumulate

The nvvm.mma.sp.sync operation collectively performs the sparse operation D = matmul(A_sparse, B) + C using all threads in a warp.

This operation is similar to nvvm.mma.sync but with structured sparsity in the A operand. The sparsity follows the 2:4 structured sparse pattern where 2 out of every 4 elements are non-zero.

All the threads in the warp must execute the same mma.sp.sync operation.

The sparseMetadata operand provides the sparsity indices that indicate which elements in the A operand are non-zero. The sparsitySelector controls how the indices are distributed among threads in the warp and should typically be 0 or 1.

The optional orderedMetadata attribute specifies the metadata ordering:

  • Absence (default): Uses standard sparse metadata ordering
  • Presence: Uses ordered metadata (PTX ISA 8.5+, sm_90+)

The optional kind attribute specifies mixed-precision modes for FP8 operations:

  • f8f6f4: Enables e3m2, e2m3, e2m1 FP8 types and f16 accumulator (PTX ISA 8.7+, sm_90+)
  • Only valid with ordered metadata and m16n8k64 shape

The shapes, layouts, and data types follow the same constraints as the regular nvvm.mma.sync operation, but the A operand contains only the non-zero elements in compressed format.

Example:

%d = nvvm.mma.sp.sync A[%a0, %a1] B[%b0, %b1] C[%c0, %c1]
                      sparseMetadata[%meta] selector[%sel]
                      {shape = {k = 32 : i32, m = 16 : i32, n = 8 : i32}}
    : (vector<2xf16>, vector<2xf16>, vector<2xf16>) -> !llvm.struct<(vector<2xf16>, vector<2xf16>)>

// With ordered metadata:
%d = nvvm.mma.sp.sync A[%a0, %a1] B[%b0, %b1] C[%c0, %c1]
                      sparseMetadata[%meta] selector[%sel]
                      {orderedMetadata, shape = {k = 32 : i32, m = 16 : i32, n = 8 : i32}}
    : (vector<2xf16>, vector<2xf16>, vector<2xf16>) -> !llvm.struct<(vector<2xf16>, vector<2xf16>)>

Traits: AttrSizedOperandSegments

Attributes: 

AttributeMLIR TypeDescription
shape::mlir::NVVM::MMAShapeAttrAttribute for MMA operation shape.
intOverflowBehavior::mlir::NVVM::MMAIntOverflowAttrMMA overflow options
multiplicandAPtxType::mlir::NVVM::MMATypesAttrNVVM MMA types
multiplicandBPtxType::mlir::NVVM::MMATypesAttrNVVM MMA types
orderedMetadata::mlir::UnitAttrunit attribute
kind::mlir::NVVM::MMAKindAttrMMA operation kind

Operands: 

OperandDescription
operandAvariadic of LLVM dialect-compatible type
operandBvariadic of LLVM dialect-compatible type
operandCvariadic of LLVM dialect-compatible type
sparseMetadata32-bit signless integer
sparsitySelector32-bit signless integer

Results: 

ResultDescription
resLLVM structure type

nvvm.nanosleep (NVVM::NanosleepOp) 

Suspends the thread for a specified duration.

Syntax:

operation ::= `nvvm.nanosleep` attr-dict $duration

The op suspends the thread for a sleep duration approximately close to the delay $duration, specified in nanoseconds.

The sleep duration is approximated, but guaranteed to be in the interval [0, 2*t]. The maximum sleep duration is 1 millisecond. The implementation may reduce the sleep duration for individual threads within a warp such that all sleeping threads in the warp wake up together.

For more information, see PTX ISA

Operands: 

OperandDescription
duration32-bit signless integer

nvvm.pmevent (NVVM::PMEventOp) 

Trigger one or more Performance Monitor events.

Syntax:

operation ::= `nvvm.pmevent` attr-dict (`id` `=` $eventId^)? (`mask` `=` $maskedEventId^)?

Triggers one or more of a fixed number of performance monitor events, with event index or mask specified by immediate operand.

Without mask it triggers a single performance monitor event indexed by immediate operand a, in the range 0..15.

With mask it triggers one or more of the performance monitor events. Each bit in the 16-bit immediate operand controls an event.

For more information, see PTX ISA

Interfaces: BasicPtxBuilderInterface

Attributes: 

AttributeMLIR TypeDescription
maskedEventId::mlir::IntegerAttr16-bit signless integer attribute
eventId::mlir::IntegerAttr32-bit signless integer attribute

nvvm.prmt (NVVM::PermuteOp) 

Permute bytes from two 32-bit registers

Syntax:

operation ::= `nvvm.prmt` $mode $selector `,` $lo  (`,` $hi^)?  attr-dict `:` type($res)

The nvvm.prmt operation constructs a permutation of the bytes of the first one or two operands, selecting based on the 2 least significant bits of the final operand.

The bytes in the first one or two source operands are numbered. The first source operand (%lo) is numbered {b3, b2, b1, b0}, in the case of the ‘default’, ‘f4e’ and ‘b4e’ variants, the second source operand (%hi) is numbered {b7, b6, b5, b4}.

Modes:

  • default: Index mode - each nibble in selector selects a byte from the 8-byte pool
  • f4e : Forward 4 extract - extracts 4 contiguous bytes starting from position in selector
  • b4e : Backward 4 extract - extracts 4 contiguous bytes in reverse order
  • rc8 : Replicate 8 - replicates the lower 8 bits across the 32-bit result
  • ecl : Edge clamp left - clamps out-of-range indices to the leftmost valid byte
  • ecr : Edge clamp right - clamps out-of-range indices to the rightmost valid byte
  • rc16 : Replicate 16 - replicates the lower 16 bits across the 32-bit result

Depending on the 2 least significant bits of the %selector operand, the result of the permutation is defined as follows:

+————+—————-+————–+ | Mode | %selector[1:0] | Output | +————+—————-+————–+ | ‘f4e’ | 0 | {3, 2, 1, 0} | | +—————-+————–+ | | 1 | {4, 3, 2, 1} | | +—————-+————–+ | | 2 | {5, 4, 3, 2} | | +—————-+————–+ | | 3 | {6, 5, 4, 3} | +————+—————-+————–+ | ‘b4e’ | 0 | {5, 6, 7, 0} | | +—————-+————–+ | | 1 | {6, 7, 0, 1} | | +—————-+————–+ | | 2 | {7, 0, 1, 2} | | +—————-+————–+ | | 3 | {0, 1, 2, 3} | +————+—————-+————–+ | ‘rc8’ | 0 | {0, 0, 0, 0} | | +—————-+————–+ | | 1 | {1, 1, 1, 1} | | +—————-+————–+ | | 2 | {2, 2, 2, 2} | | +—————-+————–+ | | 3 | {3, 3, 3, 3} | +————+—————-+————–+ | ‘ecl’ | 0 | {3, 2, 1, 0} | | +—————-+————–+ | | 1 | {3, 2, 1, 1} | | +—————-+————–+ | | 2 | {3, 2, 2, 2} | | +—————-+————–+ | | 3 | {3, 3, 3, 3} | +————+—————-+————–+ | ‘ecr’ | 0 | {0, 0, 0, 0} | | +—————-+————–+ | | 1 | {1, 1, 1, 0} | | +—————-+————–+ | | 2 | {2, 2, 1, 0} | | +—————-+————–+ | | 3 | {3, 2, 1, 0} | +————+—————-+————–+ | ‘rc16’ | 0 | {1, 0, 1, 0} | | +—————-+————–+ | | 1 | {3, 2, 3, 2} | | +—————-+————–+ | | 2 | {1, 0, 1, 0} | | +—————-+————–+ | | 3 | {3, 2, 3, 2} | +————+—————-+————–+

[For more information, see PTX ISA] ( https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prmt)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
mode::mlir::NVVM::PermuteModeAttrNVVM permute mode

Operands: 

OperandDescription
lo32-bit signless integer
hi32-bit signless integer
selector32-bit signless integer

Results: 

ResultDescription
res32-bit signless integer

nvvm.prefetch (NVVM::PrefetchOp) 

Brings the cache line containing an address into the specified cache level

Syntax:

operation ::= `nvvm.prefetch` (`level` `=` $cacheLevel^ (`uniform` $uniform^)? `,`)? (`tensormap` $tensormap^ (`in_param_space` $in_param_space^)? `,`)? (`evict_priority` `=` $evictPriority^ `,`)? $addr (`,` `predicate` `=` $predicate^)? attr-dict `:` type(operands)

Prefetches the cache line containing the address given by addr. The operand may be a global, local, or generic pointer. When tensormap is specified, the operand may instead be a constant or generic pointer. If the address maps to shared memory, the operation has no effect.

At most one of cacheLevel or tensormap may be present. The cacheLevel attribute selects the target cache level. When combined with uniform, the prefetch is performed to the uniform cache, in which case addr must be a generic pointer.

When tensormap is used, the line containing addr is brought from the constant or parameter state space for later use by cp.async.bulk.tensor. If in_param_space is specified, the generic pointer is interpreted as referring to the parameter state space.

uniform can be specified after the cacheLevel to indicate that the prefetch is performed to the specified uniform cache level. If uniform is specified, addr must be a generic address pointer and no operation is performed if addr maps to a const, local, or shared memory location.

The evictPriority attribute is optional and specifies the cache eviction priority when cacheLevel is L2.

For more information, see PTX ISA

Interfaces: BasicPtxBuilderInterface

Attributes: 

AttributeMLIR TypeDescription
cacheLevel::mlir::NVVM::PrefetchCacheLevelAttr
NVVM Prefetch Cache Level

Enum cases:

  • L1 (L1)
  • L2 (L2)
evictPriority::mlir::NVVM::CacheEvictionPriorityAttr
NVVM Cache Eviction Priority

Enum cases:

  • evict_normal (EvictNormal)
  • evict_first (EvictFirst)
  • evict_last (EvictLast)
  • evict_unchanged (EvictUnchanged)
  • no_allocate (NoAllocate)
tensormap::mlir::UnitAttrunit attribute
uniform::mlir::UnitAttrunit attribute
in_param_space::mlir::UnitAttrunit attribute

Operands: 

OperandDescription
addrLLVM pointer in address space 1 or LLVM pointer in address space 5 or LLVM pointer in address space 0 or LLVM pointer in address space 4
predicate1-bit signless integer

nvvm.rcp.approx.ftz.f (NVVM::RcpApproxFtzF32Op) 

Syntax:

operation ::= `nvvm.rcp.approx.ftz.f` $arg attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands: 

OperandDescription
arg32-bit float

Results: 

ResultDescription
res32-bit float

nvvm.redux.sync (NVVM::ReduxOp) 

Redux Sync Op

Syntax:

operation ::= `nvvm.redux.sync` $kind $val `,` $mask_and_clamp  attr-dict `:` type($val) `->` type($res)

redux.sync performs a reduction operation kind of the 32 bit source register across all non-exited threads in the membermask.

The abs and nan attributes can be used in the case of f32 input type, where the abs attribute causes the absolute value of the input to be used in the reduction operation, and the nan attribute causes the reduction operation to return NaN if any of the inputs to participating threads are NaN.

For more information, see PTX ISA

Traits: NVVMRequiresSM<80>

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::ReduxKindAttrNVVM redux kind
abs::mlir::BoolAttrbool attribute
nan::mlir::BoolAttrbool attribute

Operands: 

OperandDescription
val32-bit signless integer or 32-bit float
mask_and_clamp32-bit signless integer

Results: 

ResultDescription
res32-bit signless integer or 32-bit float

nvvm.setmaxregister (NVVM::SetMaxRegisterOp) 

Syntax:

operation ::= `nvvm.setmaxregister` $action $regCount attr-dict

Attributes: 

AttributeMLIR TypeDescription
regCount::mlir::IntegerAttr32-bit signless integer attribute
action::mlir::NVVM::SetMaxRegisterActionAttrNVVM set max register action

nvvm.shfl.sync (NVVM::ShflOp) 

NVVM Dialect Op for shfl.sync

Syntax:

operation ::= `nvvm.shfl.sync` $kind $thread_mask `,` $val `,` $offset `,` $mask_and_clamp  attr-dict
              `:` type($val) `->` type($res)

The shfl.sync Op implements data shuffle within threads of a warp. The thread_mask denotes the threads participating in the Op where the bit position corresponds to a particular thread’s laneid. The offset specifies a source lane or source lane offset (depending on kind). The val is the input value to be copied from the source. The mask_and_clamp contains two packed values specifying a mask for logically splitting warps into sub-segments and an upper bound for clamping the source lane index.

The return_value_and_is_valid unit attribute can be specified to indicate that the return value is a two-element struct, where the first element is the result value and the second element is a predicate indicating if the computed source lane index is valid.

For more information, see PTX ISA

Traits: NVVMRequiresSM<30>

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::ShflKindAttrNVVM shuffle kind
return_value_and_is_valid::mlir::UnitAttrunit attribute

Operands: 

OperandDescription
thread_mask32-bit signless integer
val32-bit signless integer or 32-bit float
offset32-bit signless integer
mask_and_clamp32-bit signless integer

Results: 

ResultDescription
res32-bit signless integer or 32-bit float or LLVM struct type

nvvm.read.ptx.sreg.nsmid (NVVM::SmDimOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.nsmid` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.smid (NVVM::SmIdOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.smid` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.stmatrix (NVVM::StMatrixOp) 

Cooperative matrix store

Syntax:

operation ::= `nvvm.stmatrix` $ptr `,` $sources attr-dict `:` type(operands)

Collectively store one or more matrices across all threads in a warp to the location indicated by the address operand $ptr in shared memory.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
layout::mlir::NVVM::MMALayoutAttrNVVM MMA layout
shape::mlir::NVVM::LdStMatrixShapeAttrMatrix shape for ldmatrix and stmatrix
eltType::mlir::NVVM::LdStMatrixEltTypeAttrElement type for ldmatrix and stmatrix

Operands: 

OperandDescription
ptrLLVM pointer in address space 3
sourcesvariadic of 32-bit signless integer

nvvm.bar.warp.sync (NVVM::SyncWarpOp) 

Warp Barrier Synchronization Op

Syntax:

operation ::= `nvvm.bar.warp.sync` $mask attr-dict `:` type($mask)

The nvvm.bar.warp.sync operation performs barrier synchronization for threads within a warp.

This operation causes the executing thread to wait until all threads corresponding to the mask operand have executed a bar.warp.sync with the same mask value before resuming execution.

The mask operand specifies the threads participating in the barrier, where each bit position corresponds to the thread’s lane ID within the warp. Only threads with their corresponding bit set in the mask participate in the barrier synchronization.

Important constraints:

  • The behavior is undefined if the executing thread is not included in the mask (i.e., the bit corresponding to the thread’s lane ID is not set)
  • For compute capability sm_6x or below, all threads in the mask must execute the same bar.warp.sync instruction in convergence

This operation also guarantees memory ordering among participating threads. Threads within the warp that wish to communicate via memory can store to memory, execute bar.warp.sync, and then safely read values stored by other threads in the warp.

For more information, see PTX ISA

Operands: 

OperandDescription
mask32-bit signless integer

nvvm.tcgen05.alloc (NVVM::Tcgen05AllocOp) 

Tcgen05 alloc operation

Syntax:

operation ::= `nvvm.tcgen05.alloc` $addr `,` $nCols attr-dict `:` type(operands)

The tcgen05.alloc Op allocates tensor core memory for the amount specified by nCols and writes the destination address to the addr argument. The nCols operand specifies the number of columns to be allocated and it must be a power-of-two. For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,101>

Attributes: 

AttributeMLIR TypeDescription
group::mlir::NVVM::CTAGroupKindAttrNVVM CTA group kind

Operands: 

OperandDescription
addrLLVM pointer type or LLVM pointer in address space 3
nCols32-bit signless integer

nvvm.tcgen05.commit (NVVM::Tcgen05CommitOp) 

Tcgen05 commit operations

Syntax:

operation ::= `nvvm.tcgen05.commit` $addr (`,` `multicast_mask` `=` $multicastMask^)?
              attr-dict `:` type(operands)

The tcgen05.commit makes the mbarrier object, specified by the operand addr, track the completion of all the prior async-tcgen05 operations initiated by the executing thread. The multicast variants allow signaling on the mbarrier objects of multiple CTAs within the cluster. Operand multicastMask, when present, specifies the destination CTAs in the cluster such that each bit position in the 16-bit multicastMask operand corresponds to the nvvm.read.ptx.sreg.ctaid of the destination CTA. For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,101>

Attributes: 

AttributeMLIR TypeDescription
group::mlir::NVVM::CTAGroupKindAttrNVVM CTA group kind

Operands: 

OperandDescription
addrLLVM pointer type or LLVM pointer in address space 3
multicastMask16-bit signless integer

nvvm.tcgen05.cp (NVVM::Tcgen05CpOp) 

Tcgen05 copy operation

Syntax:

operation ::= `nvvm.tcgen05.cp` $taddr`,` $smem_desc attr-dict

Instruction tcgen05.cp initiates an asynchronous copy operation from shared memory to the location specified by the address operand taddr in the Tensor Memory. The 64-bit register operand smem_desc specifies the matrix descriptor representing the source matrix in the shared memory that needs to be copied.

Example:

  nvvm.tcgen05.cp %taddr, %smem_desc {
    group = #nvvm.tcgen05_group<cta_2>,
    shape = #nvvm.tcgen05_cp_shape<shape_64x128b>,
    multicast = #nvvm.tcgen05_cp_multicast<warpx2_01_23>,
    srcFormat = #nvvm.tcgen05_cp_src_fmt<b6x16_p32>
  }

For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,101>

Attributes: 

AttributeMLIR TypeDescription
shape::mlir::NVVM::Tcgen05CpShapeAttrtcgen05 cp shapes
group::mlir::NVVM::CTAGroupKindAttrNVVM CTA group kind
multicast::mlir::NVVM::Tcgen05CpMulticastAttrtcgen05 cp multicast
srcFormat::mlir::NVVM::Tcgen05CpSrcFormatAttrtcgen05 cp source format

Operands: 

OperandDescription
taddrLLVM pointer in address space 6
smem_desc64-bit signless integer

nvvm.tcgen05.dealloc (NVVM::Tcgen05DeallocOp) 

Tcgen05 dealloc operation

Syntax:

operation ::= `nvvm.tcgen05.dealloc` $taddr `,` $nCols attr-dict `:` type(operands)

The tcgen05.dealloc Op de-allocates the tensor core memory specified by tmemAddr, which must be from a previous tensor memory allocation. The nCols operand specifies the number of columns to be de-allocated, and it must be a power-of-two. For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,101>

Attributes: 

AttributeMLIR TypeDescription
group::mlir::NVVM::CTAGroupKindAttrNVVM CTA group kind

Operands: 

OperandDescription
taddrLLVM pointer in address space 6
nCols32-bit signless integer

nvvm.tcgen05.fence (NVVM::Tcgen05FenceOp) 

Tcgen05 fence operations

Syntax:

operation ::= `nvvm.tcgen05.fence` $kind attr-dict

The tcgen05.fence<before> orders all prior async tcgen05 operations with respect to the subsequent tcgen05 and execution ordering operations. The tcgen05.fence<after> orders all subsequent async tcgen05 operations with respect to the prior tcgen05 and execution ordering operations.

For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,101>

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::Tcgen05FenceKindAttrNVVM Tcgen05 fence kind

nvvm.tcgen05.ld (NVVM::Tcgen05LdOp) 

Tensor memory load instructions

Syntax:

operation ::= `nvvm.tcgen05.ld` $tmemAddr (`,` $offset^)? (`pack` $pack^)? attr-dict `:` type($res)

Instruction tcgen05.ld asynchronously loads data from the Tensor Memory at the location specified by the 32-bit address operand tmemAddr into the destination register res, collectively across all threads of the warps.

The shape and the num attribute together determines the total dimension of the data which is loaded from the Tensor Memory. The shape attribute indicates the base dimension of data to be accessed as described in the Data Movement Shape. The num attribute indicates the repeat factor on the base dimension resulting in the total dimension of the data that is accessed.

The shape 16x32bx2 performs two accesses into Tensor Memory of the shape 16x32b. The base address of the first access is specified by tmemAddr and the base address of the second access is specified by tmemAddr + offset, where offset is an immediate argument.

The unit attribute pack can be used to pack two 16-bit elements from adjacent columns into a single 32-bit element during the load.

The following table describes the size of the vector for various combinations of num and shape attributes:

|=====================================================================|
| num/shape      |     16x32bx2/16x64b/32x32b |  16x128b   | 16x256b  |
|=====================================================================|
| x1             |          1                 |    2       |    4     |
| x2             |          2                 |    4       |    8     |
| x4             |          4                 |    8       |    16    |
| x8             |          8                 |    16      |    32    |
| x16            |          16                |    32      |    64    |
| x32            |          32                |    64      |    128   |
| x64            |          64                |    128     |    NA    |
| x128           |          128               |    NA      |    NA    |
|=====================================================================|

Example:

  nvvm.tcgen05.ld %tmemAddr, %offset pack {
    shape = #nvvm.tcgen05_ldst_shape<shape_16x32bx2>,
  } : <2xi32>

For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,101>

Attributes: 

AttributeMLIR TypeDescription
pack::mlir::UnitAttrunit attribute
shape::mlir::NVVM::Tcgen05LdStShapeAttrallowed 32-bit signless integer cases: 0, 1, 2, 3, 4

Operands: 

OperandDescription
tmemAddrLLVM pointer in address space 6
offset64-bit signless integer

Results: 

ResultDescription
res32-bit signless integer or vector of 32-bit signless integer values of length 2/4/8/16/32/64/128

nvvm.tcgen05.mma.block_scale (NVVM::Tcgen05MMABlockScaleOp) 

Performs block scaled MMA operation on 5th-gen tensor cores

Syntax:

operation ::= `nvvm.tcgen05.mma.block_scale` $matrixD `,` $matrixA `,` $matrixB `,` $idesc `,` $enableInputD `,` $scaleA `,` $scaleB
              attr-dict `:` `(` type(operands) `)`

The tcgen05.mma.block_scale operation is an asynchronous tensor core instruction that performs matrix multiplication, accumulation with block scaling in a single fused operation. It targets 5th-generation tensor cores, providing developers with fine-grained control over execution and scheduling.

D = (A * scale_a)  * (B * scale_b)`      // if `enableInputD` is false
D = (A * scale_a)  * (B * scale_b) + D`

where:

  • A is an M x (K / 2) matrix in tensor memory or described using shared memory descriptor
  • B is a K x N matrix described using shared memory descriptor
  • D is an M x N accumulator matrix in tensor memory
  • scale_a and scale_b are matrices in tensor memory used to scale A and B respectively

The shared memory descriptor can be generated using tcgen05.mma_smem_desc Op

Required Attributes:

  • kind is a Tcgen05MMABlockScaleKind attribute

  • ctaGroup specifies CTA group configuration

    • cta_1: MMA will be performed on the current thread’s CTA
    • cta_2: MMA will be performed on the current thread and it’s peer CTA

Default Attributes:

  • collectorOp is a Tcgen05MMACollectorOp attribute with matrix A as the collector buffer

For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,110>

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::Tcgen05MMABlockScaleKindAttr
tcgen05.mma.block_scale supported types
The Tcgen05MMABlockScaleKind attribute describes the allowed set of types for matrix A and B in the tcgen05.mma.{sp}.block_scale Op. The following are supported types for each kind:
+--------------+-------------------------------------------+
| Matrix Kind  |      supported types for A / B            |
+--------------+-------------------------------------------+
| mxf8f6f4     | e4m3, e5m3, e2m3, e3m2, e2m1              |
| mxf4         | e2m1                                      |
| mxf4nvf4     | e2m1                                      |
+--------------+-------------------------------------------+

ctaGroup::mlir::NVVM::CTAGroupKindAttrNVVM CTA group kind
blockScale::mlir::NVVM::Tcgen05MMABlockScaleAttrtcgen05.mma block scale attribute
collectorOp::mlir::NVVM::Tcgen05MMACollectorOpAttr
tcgen05.mma Collector Buffer Operation
Tcgen05MMACollectorOp attribute specifies the collector buffer operations.
The following are the supported operations:
  * discard : Release buffer after use (default)
  * lastuse : Mark buffer for last use
  * fill    : Fill buffer
  * use     : Use buffer without modification

Operands: 

OperandDescription
matrixDLLVM pointer in address space 6
matrixALLVM pointer in address space 6 or 64-bit signless integer
matrixB64-bit signless integer
idesc32-bit signless integer
enableInputD1-bit signless integer
scaleALLVM pointer in address space 6
scaleBLLVM pointer in address space 6

nvvm.tcgen05.mma (NVVM::Tcgen05MMAOp) 

Performs MMA operation on 5th-gen tensor cores

Syntax:

operation ::= `nvvm.tcgen05.mma` $matrixD `,` $matrixA `,` $matrixB `,` $idesc `,` $enableInputD (`scale` `=` $scaleInputD^)?
              (`mask` `=` $disableOutputLane^)? attr-dict `:` `(` type(operands) `)`

The tcgen05.mma operation is an asynchronous tensor core instruction that performs matrix multiplication, accumulation in a single fused operation. It targets 5th-generation tensor cores, providing developers with fine-grained control over execution and scheduling.

D = A * B + (D * 2^ -scaleInputD)    // if `scaleInputD` is provided
D = A * B                            // if `enableInputD` is false
D = A * B + D                        // otherwise

where:

  • A is an M x K matrix in tensor memory or described using shared memory descriptor
  • B is a K x N matrix described using shared memory descriptor
  • D is an M x N accumulator matrix in tensor memory

The shared memory descriptor can be generated using tcgen05.mma_smem_desc Op

Optional Operands:

  • scaleInputD is an Immediate value operand used for scaling D matrix by 2 ^ (-scaleInputD). The valid range is [0, 15]

  • disableOutputLane is a vector mask for selective output

    • vector<4 x i32> when ctaGroup is CTA_1
    • vector<8 x i32> when ctaGroup is CTA_2

Required Attributes:

  • kind is a Tcgen05MMAKind attribute

  • ctaGroup specifies CTA group configuration

    • cta_1: MMA will be performed on the current thread’s CTA
    • cta_2: MMA will be performed on the current thread and it’s peer CTA

Default Attributes:

  • collectorOp is a Tcgen05MMACollectorOp attribute with matrix A as the collector buffer

  • aShift shifts the rows of the A matrix down by one row and can only be applied if A is in tensor memory

For more information, see PTX ISA

Traits: AttrSizedOperandSegments, NVVMRequiresSMa<100,110>

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::Tcgen05MMAKindAttr
tcgen05 MMA Supported Types
The Tcgen05MMAKind attribute describes the allowed set of types for matrix A and B in the tcgen05.mma.{sp} Op. The following are supported types for each kind:
+-------------+--------------------------------------------+
| Matrix Kind |      supported types for A / B             |
+-------------+--------------------------------------------+
| f16         | f16, bf16                                  |
| tf32        | tf32                                       |
| f8f6f4      | e4m3, e5m2, e2m3, e3m2, e2m1               |
| i8          | unsigned 8b, signed 8b                     |
+-------------+--------------------------------------------+

ctaGroup::mlir::NVVM::CTAGroupKindAttrNVVM CTA group kind
collectorOp::mlir::NVVM::Tcgen05MMACollectorOpAttr
tcgen05.mma Collector Buffer Operation
Tcgen05MMACollectorOp attribute specifies the collector buffer operations.
The following are the supported operations:
  * discard : Release buffer after use (default)
  * lastuse : Mark buffer for last use
  * fill    : Fill buffer
  * use     : Use buffer without modification
aShift::mlir::UnitAttrunit attribute

Operands: 

OperandDescription
matrixDLLVM pointer in address space 6
matrixALLVM pointer in address space 6 or 64-bit signless integer
matrixB64-bit signless integer
idesc32-bit signless integer
enableInputD1-bit signless integer
scaleInputD64-bit signless integer
disableOutputLanefixed-length vector of 32-bit signless integer values of length 4/8

nvvm.tcgen05.mma.sp.block_scale (NVVM::Tcgen05MMASparseBlockScaleOp) 

Performs block scaled MMA operation with sparse A matrix on 5th-gen tensor cores

Syntax:

operation ::= `nvvm.tcgen05.mma.sp.block_scale` $matrixD `,` $matrixA `,` $matrixB `,` $idesc `,` $enableInputD `,` $sparseMetadata `,`  $scaleA `,`  $scaleB
              attr-dict `:` `(` type(operands) `)`

The tcgen05.mma.sp.block_scale operation is an asynchronous tensor core instruction that performs matrix multiplication, accumulation with block scaling, and sparse A matrix in a single fused operation. It targets 5th-generation tensor cores, providing developers with fine-grained control over execution, and scheduling.

D = (A * scale_a)  * (B * scale_b)      // if `enableInputD` is specified
D = (A * scale_a)  * (B * scale_b) + D  // otherwise

where:

  • A is an M x (K / 2) matrix in tensor memory or described using shared memory descriptor
  • B is a K x N matrix described using shared memory descriptor
  • D is an M x N accumulator matrix in tensor memory
  • scale_a and scale_b are matrices in tensor memory used to scale A and B respectively

Other attributes and operands are similar to that of tcgen05.mma.block_scale Op

For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,110>

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::Tcgen05MMABlockScaleKindAttr
tcgen05.mma.block_scale supported types
The Tcgen05MMABlockScaleKind attribute describes the allowed set of types for matrix A and B in the tcgen05.mma.{sp}.block_scale Op. The following are supported types for each kind:
+--------------+-------------------------------------------+
| Matrix Kind  |      supported types for A / B            |
+--------------+-------------------------------------------+
| mxf8f6f4     | e4m3, e5m3, e2m3, e3m2, e2m1              |
| mxf4         | e2m1                                      |
| mxf4nvf4     | e2m1                                      |
+--------------+-------------------------------------------+

ctaGroup::mlir::NVVM::CTAGroupKindAttrNVVM CTA group kind
blockScale::mlir::NVVM::Tcgen05MMABlockScaleAttrtcgen05.mma block scale attribute
collectorOp::mlir::NVVM::Tcgen05MMACollectorOpAttr
tcgen05.mma Collector Buffer Operation
Tcgen05MMACollectorOp attribute specifies the collector buffer operations.
The following are the supported operations:
  * discard : Release buffer after use (default)
  * lastuse : Mark buffer for last use
  * fill    : Fill buffer
  * use     : Use buffer without modification

Operands: 

OperandDescription
matrixDLLVM pointer in address space 6
matrixALLVM pointer in address space 6 or 64-bit signless integer
matrixB64-bit signless integer
idesc32-bit signless integer
enableInputD1-bit signless integer
sparseMetadataLLVM pointer in address space 6
scaleALLVM pointer in address space 6
scaleBLLVM pointer in address space 6

nvvm.tcgen05.mma.sp (NVVM::Tcgen05MMASparseOp) 

Performs MMA operation with sparse A matrix on 5th-gen tensor cores

Syntax:

operation ::= `nvvm.tcgen05.mma.sp` $matrixD `,` $matrixA `,` $matrixB `,` $idesc `,` $enableInputD `,` $sparseMetadata (`scale` `=` $scaleInputD^)? (`mask` `=` $disableOutputLane^)? attr-dict `:` `(` type(operands) `)`

The tcgen05.mma.sp operation is an asynchronous tensor core instruction that performs matrix multiplication, accumulation with sparse A matrix in a single fused operation. It targets 5th-generation tensor cores, providing developers with fine-grained control over execution and scheduling.

D = A * B + (D * 2^ -scaleInputD)    // if `scaleInputD` is provided
D = A * B                            // if `enableInputD` is false
D = A * B + D                        // otherwise

where:

  • A is an M x (K / 2) matrix in tensor memory or described using shared memory descriptor
  • B is a K x N matrix described using shared memory descriptor
  • D is an M x N accumulator matrix in tensor memory
  • sparseMetadata located in tensor memory specifies the mapping of the K / 2 non-zero elements to the K elements before performing the MMA operation

Other attributes and operands are similar to that of tcgen05.mma Op

For more information, see PTX ISA

Traits: AttrSizedOperandSegments, NVVMRequiresSMa<100,110>

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::Tcgen05MMAKindAttr
tcgen05 MMA Supported Types
The Tcgen05MMAKind attribute describes the allowed set of types for matrix A and B in the tcgen05.mma.{sp} Op. The following are supported types for each kind:
+-------------+--------------------------------------------+
| Matrix Kind |      supported types for A / B             |
+-------------+--------------------------------------------+
| f16         | f16, bf16                                  |
| tf32        | tf32                                       |
| f8f6f4      | e4m3, e5m2, e2m3, e3m2, e2m1               |
| i8          | unsigned 8b, signed 8b                     |
+-------------+--------------------------------------------+

ctaGroup::mlir::NVVM::CTAGroupKindAttrNVVM CTA group kind
collectorOp::mlir::NVVM::Tcgen05MMACollectorOpAttr
tcgen05.mma Collector Buffer Operation
Tcgen05MMACollectorOp attribute specifies the collector buffer operations.
The following are the supported operations:
  * discard : Release buffer after use (default)
  * lastuse : Mark buffer for last use
  * fill    : Fill buffer
  * use     : Use buffer without modification
aShift::mlir::UnitAttrunit attribute

Operands: 

OperandDescription
matrixDLLVM pointer in address space 6
matrixALLVM pointer in address space 6 or 64-bit signless integer
matrixB64-bit signless integer
idesc32-bit signless integer
enableInputD1-bit signless integer
sparseMetadataLLVM pointer in address space 6
scaleInputD64-bit signless integer
disableOutputLanefixed-length vector of 32-bit signless integer values of length 4/8

nvvm.tcgen05.mma.ws (NVVM::Tcgen05MMAWsOp) 

Performs weight stationary convolution MMA operation on 5th-gen tensor cores

Syntax:

operation ::= `nvvm.tcgen05.mma.ws` $matrixD `,` $matrixA `,` $matrixB `,` $idesc `,` $enableInputD (`,` $zeroColMask^)?
              attr-dict `:` `(` type(operands) `)`

The tcgen05.mma.ws operation is an asynchronous tensor core instruction that performs weight stationary convolution matrix multiplication, accumulation in a single fused operation. It targets 5th-generation tensor cores, providing developers with fine-grained control over execution, and scheduling.

D = A * B`      // if `enableInputD` is false
D = A * B + D`  // otherwise

where:

  • A is an M x K matrix in tensor memory or described using shared memory descriptor
  • B is a K x N matrix described using shared memory descriptor
  • D is an M x N accumulator matrix in tensor memory

The shared memory descriptor can be generated using tcgen05.mma_smem_desc Op

Optional Operands:

Required Attributes:

  • kind is a Tcgen05MMAKind attribute

Default Valued Attributes:

  • collectorBBuffer specifies collector buffer for matrix B: b0 (default), b1, b2, b3

  • collectorOp is a Tcgen05MMACollectorOp attribute with matrix B as the collector buffer

For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,110>

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::Tcgen05MMAKindAttr
tcgen05 MMA Supported Types
The Tcgen05MMAKind attribute describes the allowed set of types for matrix A and B in the tcgen05.mma.{sp} Op. The following are supported types for each kind:
+-------------+--------------------------------------------+
| Matrix Kind |      supported types for A / B             |
+-------------+--------------------------------------------+
| f16         | f16, bf16                                  |
| tf32        | tf32                                       |
| f8f6f4      | e4m3, e5m2, e2m3, e3m2, e2m1               |
| i8          | unsigned 8b, signed 8b                     |
+-------------+--------------------------------------------+

collectorBBuffer::mlir::NVVM::Tcgen05MMACollectorBBufferAttrtcgen05 MMA Collector Buffer B Attribute
collectorOp::mlir::NVVM::Tcgen05MMACollectorOpAttr
tcgen05.mma Collector Buffer Operation
Tcgen05MMACollectorOp attribute specifies the collector buffer operations.
The following are the supported operations:
  * discard : Release buffer after use (default)
  * lastuse : Mark buffer for last use
  * fill    : Fill buffer
  * use     : Use buffer without modification

Operands: 

OperandDescription
matrixDLLVM pointer in address space 6
matrixALLVM pointer in address space 6 or 64-bit signless integer
matrixB64-bit signless integer
idesc32-bit signless integer
enableInputD1-bit signless integer
zeroColMask64-bit signless integer

nvvm.tcgen05.mma.ws.sp (NVVM::Tcgen05MMAWsSparseOp) 

Performs weight stationary convolution MMA with sparse A matrix on 5th-gen tensor cores

Syntax:

operation ::= `nvvm.tcgen05.mma.ws.sp` $matrixD `,` $matrixA `,` $matrixB `,` $idesc `,` $enableInputD `,` $sparseMetadata (`,` $zeroColMask^)? attr-dict `:` `(` type(operands) `)`

The tcgen05.mma.ws.sp operation is an asynchronous tensor core instruction that performs weight stationary convolution matrix multiplication, accumulation with sparse A matrix in a single fused operation. It targets 5th-generation tensor cores, providing developers with fine-grained control over execution, and scheduling.

D = A * B`      // if `enableInputD` is false
D = A * B + D`  // otherwise

where:

  • A is an M x (K / 2) matrix in memory or descriptor format
  • B is a K x N matrix
  • D is an M x N accumulator matrix
  • sparseMetadata located in tensor memory specifies the mapping of the K / 2 non-zero elements to the K elements before performing the MMA operation

Other attributes and operands are similar to that of tcgen05.mma.ws Op

For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,110>

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::Tcgen05MMAKindAttr
tcgen05 MMA Supported Types
The Tcgen05MMAKind attribute describes the allowed set of types for matrix A and B in the tcgen05.mma.{sp} Op. The following are supported types for each kind:
+-------------+--------------------------------------------+
| Matrix Kind |      supported types for A / B             |
+-------------+--------------------------------------------+
| f16         | f16, bf16                                  |
| tf32        | tf32                                       |
| f8f6f4      | e4m3, e5m2, e2m3, e3m2, e2m1               |
| i8          | unsigned 8b, signed 8b                     |
+-------------+--------------------------------------------+

collectorBBuffer::mlir::NVVM::Tcgen05MMACollectorBBufferAttrtcgen05 MMA Collector Buffer B Attribute
collectorOp::mlir::NVVM::Tcgen05MMACollectorOpAttr
tcgen05.mma Collector Buffer Operation
Tcgen05MMACollectorOp attribute specifies the collector buffer operations.
The following are the supported operations:
  * discard : Release buffer after use (default)
  * lastuse : Mark buffer for last use
  * fill    : Fill buffer
  * use     : Use buffer without modification

Operands: 

OperandDescription
matrixDLLVM pointer in address space 6
matrixALLVM pointer in address space 6 or 64-bit signless integer
matrixB64-bit signless integer
idesc32-bit signless integer
enableInputD1-bit signless integer
sparseMetadataLLVM pointer in address space 6
zeroColMask64-bit signless integer

nvvm.tcgen05.mma_smem_desc (NVVM::Tcgen05MmaSmemDescOp) 

Constructs a Shared Memory descriptor for MMA Operands A or B

Syntax:

operation ::= `nvvm.tcgen05.mma_smem_desc` `(` operands `)` attr-dict `:` `(` type(operands) `)` `->` type($res)

The nvvm.tcgen05_mma_smem_desc constructs a Shared Memory descriptor for tcgen05.mma. This descriptor is a 64-bit value which describes the properties of multiplicand matrix in shared memory including its location in the shared memory of the current CTA.

+-----------+------+------------------------------------------------------+
| Bit-field | Size | Description                                          |
+-----------+------+------------------------------------------------------+
| 0-13      | 14   | Matrix start address                                 |
| 14-15     | 2    | Reserved                                             |
| 16-29     | 14   | Leading dim relative-offset (or) absolute-address    |
| 30-31     | 2    | Reserved                                             |
| 32-45     | 14   | Stride dimension byte offset                         |
| 46-48     | 3    | Fixed constant value of 0b001                        |
| 49-51     | 3    | Matrix base offset                                   |
| 52        | 1    | Leading dimension stride mode:                       |
|           |      |   0: byte offset relative                            |
|           |      |   1: byte address absolute                           |
| 53-60     | 8    | Fixed constant value of 0xb00000000                  |
| 61-63     | 3    | Swizzling mode:                                      |
|           |      |   0: No swizzling                                    |
|           |      |   1: 128-Byte with 32B atomic swizzling              |
|           |      |   2: 128-Byte swizzling                              |
|           |      |   4: 64-Byte swizzling                               |
|           |      |   6: 32-Byte swizzling                               |
|           |      |   (Values 3, 5 and 7 are invalid)                    |
+-----------+------+------------------------------------------------------+    

Example:

  %desc = nvvm.tcgen05.mma_smem_desc (%startAddr, %leadingDimOffset, %strideDimOffset,
                                      %baseOffset, %leadingDimMode, %swizzleMode) : (i32, i32, i32, i8, i1, i8) -> i64

For more information, see PTX ISA

Operands: 

OperandDescription
startAddr32-bit signless integer
leadingDimOffset32-bit signless integer
strideDimOffset32-bit signless integer
baseOffset8-bit signless integer
leadingDimMode1-bit signless integer
swizzleMode8-bit signless integer

Results: 

ResultDescription
res64-bit signless integer

nvvm.tcgen05.relinquish_alloc_permit (NVVM::Tcgen05RelinquishAllocPermitOp) 

Tcgen05 Op to relinquish the right to allocate

Syntax:

operation ::= `nvvm.tcgen05.relinquish_alloc_permit` attr-dict

The tcgen05.relinquish_alloc_permit Op specifies that the CTA of the executing thread is relinquishing the right to allocate Tensor Memory. So, it is illegal for a CTA to perform tcgen05.alloc after any of its constituent threads execute tcgen05.relinquish_alloc_permit. For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,101>

Attributes: 

AttributeMLIR TypeDescription
group::mlir::NVVM::CTAGroupKindAttrNVVM CTA group kind

nvvm.tcgen05.shift (NVVM::Tcgen05ShiftOp) 

Tcgen05 shift operation

Syntax:

operation ::= `nvvm.tcgen05.shift` $taddr attr-dict `:` type(operands)

The tcgen05.shift is an asynchronous instruction which initiates the shifting of 32-byte elements downwards across all the rows, except the last, by one row. The operand taddr specifies the base address of the matrix in Tensor Memory whose rows must be down shifted.

For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,101,103>

Attributes: 

AttributeMLIR TypeDescription
group::mlir::NVVM::CTAGroupKindAttrNVVM CTA group kind

Operands: 

OperandDescription
taddrLLVM pointer in address space 6

nvvm.tcgen05.st (NVVM::Tcgen05StOp) 

Tensor memory store instructions

Syntax:

operation ::= `nvvm.tcgen05.st` $tmemAddr `,` $val (`,` $offset^)? (`unpack` $unpack^)? attr-dict `:` type($val)

Instruction tcgen05.st asynchronously stores data from the source register r into the Tensor Memory at the location specified by the 32-bit address operand tmemAddr, collectively across all threads of the warps.

The shape and the num attribute together determines the total dimension of the data which is stored to the Tensor Memory. The shape indicates the base dimension of data to be accessed. The num attribute indicates the repeat factor on the base dimension resulting in the total dimension of the data that is accessed.

The shape 16x32bx2 performs two accesses into Tensor Memory of the shape 16x32b. The base address of the first access is specified by tmemAddr and the base address of the second access is specified by tmemAddr + offset, where offset is an immediate argument.

The unit attribute unpack can be used to unpack a 32-bit element in the register into two 16-bit elements and store them in adjacent columns.

The following table describes the size of the vector for various combinations of num and shape attributes:

|=====================================================================|
| num/shape      |     16x32bx2/16x64b/32x32b |  16x128b   | 16x256b  |
|=====================================================================|
| x1             |          1                 |    2       |    4     |
| x2             |          2                 |    4       |    8     |
| x4             |          4                 |    8       |    16    |
| x8             |          8                 |    16      |    32    |
| x16            |          16                |    32      |    64    |
| x32            |          32                |    64      |    128   |
| x64            |          64                |    128     |    NA    |
| x128           |          128               |    NA      |    NA    |
|=====================================================================|

Example:

  nvvm.tcgen05.st %tmemAddr, %val, %offset unpack {
    shape = #nvvm.tcgen05_ldst_shape<shape_16x32bx2>,
  } : <2xi32>

For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,101>

Attributes: 

AttributeMLIR TypeDescription
unpack::mlir::UnitAttrunit attribute
shape::mlir::NVVM::Tcgen05LdStShapeAttrallowed 32-bit signless integer cases: 0, 1, 2, 3, 4

Operands: 

OperandDescription
tmemAddrLLVM pointer in address space 6
val32-bit signless integer or vector of 32-bit signless integer values of length 2/4/8/16/32/64/128
offset64-bit signless integer

nvvm.tcgen05.wait (NVVM::Tcgen05WaitOp) 

Tcgen05 wait operations

Syntax:

operation ::= `nvvm.tcgen05.wait` $kind attr-dict

The tcgen05.wait<load> causes the executing thread to block until all prior tcgen05.ld operations issued by the executing thread have completed. Similarly, the tcgen05.wait<store> causes the executing thread to block until all prior tcgen05.st operations issued by the executing thread have completed. For more information, see PTX ISA

Traits: NVVMRequiresSMa<100,101>

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::Tcgen05WaitKindAttrNVVM Tcgen05 wait kind

nvvm.read.ptx.sreg.tid.x (NVVM::ThreadIdXOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.tid.x` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.tid.y (NVVM::ThreadIdYOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.tid.y` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.tid.z (NVVM::ThreadIdZOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.tid.z` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.vote.sync (NVVM::VoteSyncOp) 

Vote across thread group

Syntax:

operation ::= `nvvm.vote.sync` $kind $mask `,` $pred attr-dict `->` type($res)

The vote.sync op will cause executing thread to wait until all non-exited threads corresponding to membermask have executed vote.sync with the same qualifiers and same membermask value before resuming execution.

The vote operation kinds are:

  • any: True if source predicate is True for some thread in membermask.
  • all: True if source predicate is True for all non-exited threads in membermask.
  • uni: True if source predicate has the same value in all non-exited threads in membermask.
  • ballot: In the ballot form, the destination result is a 32 bit integer. In this form, the predicate from each thread in membermask are copied into the corresponding bit position of the result, where the bit position corresponds to the thread’s lane id.

For more information, see PTX ISA

Attributes: 

AttributeMLIR TypeDescription
kind::mlir::NVVM::VoteSyncKindAttrNVVM vote sync kind

Operands: 

OperandDescription
mask32-bit signless integer
pred1-bit signless integer

Results: 

ResultDescription
res32-bit signless integer or 1-bit signless integer

nvvm.wmma.load (NVVM::WMMALoadOp) 

Warp synchronous matrix load

Syntax:

operation ::= `nvvm.wmma.load` $ptr `,` $stride attr-dict `:` functional-type($ptr, $res)

Attributes: 

AttributeMLIR TypeDescription
m::mlir::IntegerAttr32-bit signless integer attribute
n::mlir::IntegerAttr32-bit signless integer attribute
k::mlir::IntegerAttr32-bit signless integer attribute
layout::mlir::NVVM::MMALayoutAttrNVVM MMA layout
eltype::mlir::NVVM::MMATypesAttrNVVM MMA types
frag::mlir::NVVM::MMAFragAttrNVVM MMA frag type

Operands: 

OperandDescription
ptrLLVM pointer type
stride32-bit signless integer

Results: 

ResultDescription
resLLVM structure type or 64-bit float

nvvm.wmma.mma (NVVM::WMMAMmaOp) 

Warp synchronous matrix-multiply accumulate using tensor cores.

Syntax:

operation ::= `nvvm.wmma.mma` $args attr-dict `:` functional-type($args, $res)

Attributes: 

AttributeMLIR TypeDescription
m::mlir::IntegerAttr32-bit signless integer attribute
n::mlir::IntegerAttr32-bit signless integer attribute
k::mlir::IntegerAttr32-bit signless integer attribute
layoutA::mlir::NVVM::MMALayoutAttrNVVM MMA layout
layoutB::mlir::NVVM::MMALayoutAttrNVVM MMA layout
eltypeA::mlir::NVVM::MMATypesAttrNVVM MMA types
eltypeB::mlir::NVVM::MMATypesAttrNVVM MMA types

Operands: 

OperandDescription
argsvariadic of LLVM dialect-compatible type

Results: 

ResultDescription
resLLVM structure type

nvvm.wmma.store (NVVM::WMMAStoreOp) 

Warp synchronous matrix store

Syntax:

operation ::= `nvvm.wmma.store` $ptr `,` $stride `,` $args attr-dict `:` qualified(type($ptr)) `,`
              type($args)

Attributes: 

AttributeMLIR TypeDescription
m::mlir::IntegerAttr32-bit signless integer attribute
n::mlir::IntegerAttr32-bit signless integer attribute
k::mlir::IntegerAttr32-bit signless integer attribute
layout::mlir::NVVM::MMALayoutAttrNVVM MMA layout
eltype::mlir::NVVM::MMATypesAttrNVVM MMA types

Operands: 

OperandDescription
ptrLLVM pointer type
argsvariadic of LLVM dialect-compatible type
stride32-bit signless integer

nvvm.read.ptx.sreg.nwarpid (NVVM::WarpDimOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.nwarpid` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.warpid (NVVM::WarpIdOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.warpid` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.read.ptx.sreg.warpsize (NVVM::WarpSizeOp) 

Syntax:

operation ::= `nvvm.read.ptx.sreg.warpsize` (`range` $range^)? attr-dict `:` type($res)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferIntRangeInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes: 

AttributeMLIR TypeDescription
range::mlir::LLVM::ConstantRangeAttr
A range of two integers, corresponding to LLVM's ConstantRange
A pair of two integers, mapping to the ConstantRange structure in LLVM IR,
which is allowed to wrap or be empty.

The range represented is [Lower, Upper), and is either signed or unsigned depending on context.

lower and upper must have the same width.

Syntax:

`&lt;` `i`(width($lower)) $lower `,` $upper `&gt;`

Results: 

ResultDescription
resLLVM dialect-compatible type

nvvm.wgmma.fence.aligned (NVVM::WgmmaFenceAlignedOp) 

Syntax:

operation ::= `nvvm.wgmma.fence.aligned` attr-dict

Enforce an ordering of register accesses between warpgroup level matrix multiplication and other operations.

For more information, see PTX ISA

Traits: NVVMRequiresSMa<90>

nvvm.wgmma.commit.group.sync.aligned (NVVM::WgmmaGroupSyncAlignedOp) 

Syntax:

operation ::= `nvvm.wgmma.commit.group.sync.aligned` attr-dict

Commits all prior uncommitted warpgroup level matrix multiplication operations.

For more information, see PTX ISA

Traits: NVVMRequiresSMa<90>

nvvm.wgmma.mma_async (NVVM::WgmmaMmaAsyncOp) 

Syntax:

operation ::= `nvvm.wgmma.mma_async` $descriptorA `,` $descriptorB `,` $inouts `,` $shape `,`
              `D` `[` $typeD `,` $scaleD (`,` $satfinite^)? `]` `,`
              `A` `[` $typeA `,` $scaleA `,` $layoutA `]` `,`
              `B` `[` $typeB `,` $scaleB `,` $layoutB `]`
              attr-dict `:`
              type($inouts) `->` type($results)

The warpgroup (128 threads) level matrix multiply and accumulate operation has either of the following forms, where matrix D is called accumulator: D = A * B + D D = A * B, where the input from accumulator D is disabled.

Supported shapes:

|--------------|--------------|------------|--------------|---------------|
|              |              |            |              |f16+=e4m3*e4m3 |
|              |              |            |              |f16+=e5m2*e5m2 |
|f32+=tf32*tf32|f16+=f16 *f16 | s32+=s8*s8 |s32 += b1 * b1|f16+=e5m2*e4m3 |
|              |f32+=f16 *f16 | s32+=u8*u8 |              |f16+=e4m3*e5m2 |
|              |f32+=bf16*bf16| s32+=u8*u8 |              |f16+=e4m3*e5m2 |
|              |f32+=bf16*bf16| s32+=s8*u8 |              |f32+=e4m3*e4m3 |
|              |              | s32+=u8*s8 |              |f32+=e5m2*e5m2 |
|              |              |            |              |f32+=e4m3*e5m2 |
|              |              |            |              |f32+=e4m3*e5m2 |
|--------------|--------------|------------|--------------|---------------|
|   .m64n8k8   |  .m64n8k16   | .m64n8k32  | .m64n8k256   | .m64n8k32     |
|   .m64n16k8  |  .m64n16k16  | .m64n16k32 | .m64n16k256  | .m64n16k32    |
|   .m64n24k8  |  .m64n24k16  | .m64n24k32 | .m64n24k256  | .m64n24k32    |
|   .m64n32k8  |  .m64n32k16  | .m64n32k32 | .m64n32k256  | .m64n32k32    |
|   .m64n40k8  |  .m64n40k16  | .m64n48k32 | .m64n48k256  | .m64n40k32    |
|   .m64n48k8  |  .m64n48k16  | .m64n64k32 | .m64n64k256  | .m64n48k32    |
|   .m64n56k8  |  .m64n56k16  | .m64n80k32 | .m64n80k256  | .m64n56k32    |
|   .m64n64k8  |  .m64n64k16  | .m64n96k32 | .m64n96k256  | .m64n64k32    |
|   .m64n72k8  |  .m64n72k16  | .m64n112k32| .m64n112k256 | .m64n72k32    |
|   .m64n80k8  |  .m64n80k16  | .m64n128k32| .m64n128k256 | .m64n80k32    |
|   .m64n88k8  |  .m64n88k16  | .m64n144k32| .m64n144k256 | .m64n88k32    |
|   .m64n96k8  |  .m64n96k16  | .m64n160k32| .m64n160k256 | .m64n96k32    |
|   .m64n104k8 |  .m64n104k16 | .m64n176k32| .m64n176k256 | .m64n104k32   |
|   .m64n112k8 |  .m64n112k16 | .m64n192k32| .m64n192k256 | .m64n112k32   |
|   .m64n120k8 |  .m64n120k16 | .m64n208k32| .m64n208k256 | .m64n120k32   |
|   .m64n128k8 |  .m64n128k16 | .m64n224k32| .m64n224k256 | .m64n128k32   |
|   .m64n136k8 |  .m64n136k16 | .m64n240k32| .m64n240k256 | .m64n136k32   |
|   .m64n144k8 |  .m64n144k16 | .m64n256k32| .m64n256k256 | .m64n144k32   |
|   .m64n152k8 |  .m64n152k16 |            |              | .m64n152k32   |
|   .m64n160k8 |  .m64n160k16 |            |              | .m64n160k32   |
|   .m64n168k8 |  .m64n168k16 |            |              | .m64n168k32   |
|   .m64n176k8 |  .m64n176k16 |            |              | .m64n176k32   |
|   .m64n184k8 |  .m64n184k16 |            |              | .m64n184k32   |
|   .m64n192k8 |  .m64n192k16 |            |              | .m64n192k32   |
|   .m64n200k8 |  .m64n200k16 |            |              | .m64n200k32   |
|   .m64n208k8 |  .m64n208k16 |            |              | .m64n208k32   |
|   .m64n216k8 |  .m64n216k16 |            |              | .m64n216k32   |
|   .m64n224k8 |  .m64n224k16 |            |              | .m64n224k32   |
|   .m64n232k8 |  .m64n232k16 |            |              | .m64n232k32   |
|   .m64n240k8 |  .m64n240k16 |            |              | .m64n240k32   |
|   .m64n248k8 |  .m64n248k16 |            |              | .m64n248k32   |
|   .m64n256k8 |  .m64n256k16 |            |              | .m64n256k32   |
|--------------|--------------|------------|--------------|---------------|

For more information, see PTX ISA

Interfaces: BasicPtxBuilderInterface

Attributes: 

AttributeMLIR TypeDescription
shape::mlir::NVVM::MMAShapeAttrAttribute for MMA operation shape.
typeA::mlir::NVVM::WGMMATypesAttrNVVM WGMMA types
typeB::mlir::NVVM::WGMMATypesAttrNVVM WGMMA types
typeD::mlir::NVVM::WGMMATypesAttrNVVM WGMMA types
scaleD::mlir::NVVM::WGMMAScaleOutAttrWGMMA input predicate
scaleA::mlir::NVVM::WGMMAScaleInAttrWGMMA overflow options
scaleB::mlir::NVVM::WGMMAScaleInAttrWGMMA overflow options
layoutA::mlir::NVVM::MMALayoutAttrNVVM MMA layout
layoutB::mlir::NVVM::MMALayoutAttrNVVM MMA layout
satfinite::mlir::NVVM::MMAIntOverflowAttrMMA overflow options

Operands: 

OperandDescription
inoutsLLVM structure type
descriptorA64-bit signless integer
descriptorB64-bit signless integer

Results: 

ResultDescription
resultsLLVM structure type

nvvm.wgmma.wait.group.sync.aligned (NVVM::WgmmaWaitGroupSyncOp) 

Syntax:

operation ::= `nvvm.wgmma.wait.group.sync.aligned` attr-dict $group

Signal the completion of a preceding warpgroup operation.

For more information, see PTX ISA

Traits: NVVMRequiresSMa<90>

Attributes: 

AttributeMLIR TypeDescription
group::mlir::IntegerAttr64-bit signless integer attribute