MLIR  19.0.0git
Classes | Enumerations | Functions
mlir::nvgpu Namespace Reference

Classes

struct  WarpMatrixInfo
 Collects information about a warp-level matrix operand represented by a VectorType. More...
 
struct  FragmentElementInfo
 Specifies information about the registers which compose a matrix fragment according to the PTX documentation. More...
 
struct  LdMatrixParams
 Encapsulates the parameters needed to lower a nvgpu.ldmatrix operation to nvvm.ldmatrix. More...
 

Enumerations

enum class  MmaSyncF32Lowering { TF32 = 0 , TF32x3 = 1 , Unkown = 2 }
 Rewrites patterns. More...
 
enum class  MatMulOperandRole : int32_t { A = 0 , B , C }
 Represents the role of an operand in an MMA instruction: result := matmul(A, B) + C More...
 

Functions

Attribute getMbarrierMemorySpace (MLIRContext *context, MBarrierGroupType barrierType)
 Returns the memory space attribute of the mbarrier object. More...
 
MemRefType getMBarrierMemrefType (MLIRContext *context, MBarrierGroupType barrierType)
 Return the memref type that can be used to represent an mbarrier object. More...
 
void registerTransformDialectExtension (DialectRegistry &registry)
 
std::unique_ptr< PasscreateOptimizeSharedMemoryPass ()
 Create a pass to optimize shared memory reads and writes. More...
 
mlir::LogicalResult optimizeSharedMemoryReadsAndWrites (Operation *parentOp, Value memrefValue)
 Passes. More...
 
void populateMmaSyncF32ToTF32Patterns (RewritePatternSet &patterns, nvgpu::MmaSyncF32Lowering precision=nvgpu::MmaSyncF32Lowering::TF32)
 Collect patterns to convert mma.sync on f32 input and rewrite to use tensor cores with user provided level of accuracy: (a) tf32 (1 mma.sync per warp-level matrix-multiply-accumulate) (b) tf32x3 (3 mma.sync per warp-level matrix-multiply-accumulate) Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits. More...
 
void createAsyncGroups (RewriterBase &rewriter, Operation *op, bool bypassL1)
 Convert global->shared vector transfers to async device copies. More...
 
Operation::operand_range getIndices (Operation *op)
 Get the indices that the given load/store operation is operating on. More...
 
void setIndices (Operation *op, ArrayRef< Value > indices)
 Set the indices that the given load/store operation is operating on. More...
 
Value getValueStored (Operation *op)
 Get the value that is stored by the given store operation. More...
 
Value getMemrefOperand (Operation *op)
 Get the memref that is loaded from/stored into by the given load/store operation. More...
 
FailureOr< vector::ContractionOp > getUserContract (Operation *op)
 Returns the first user of the op that is vector.contract. More...
 
FailureOr< WarpMatrixInfogetWarpMatrixInfo (Operation *op)
 If op is a vector.transfer_write, return the WarpMatrixInfo for the vector operand. More...
 
int64_t inferTileWidthInBits (const WarpMatrixInfo &type)
 Returns the number of bits in a single tile row. More...
 
FailureOr< FragmentElementInfogetMmaSyncRegisterType (const WarpMatrixInfo &type)
 Returns a FragmentElementInfo struct describing the register types for the given matrix fragment type. More...
 
FailureOr< AffineMapgetLaneIdAndValueIdToOperandCoord (OpBuilder &builder, Location loc, const WarpMatrixInfo &fragmentType)
 Returns an AffineMap which maps a two dimensions representing (laneId, logicalValueId) and returns two results representing offsets within a matrix operand. More...
 
FailureOr< LdMatrixParamsgetLdMatrixParams (const WarpMatrixInfo &type, bool transpose)
 Given type that contains info for a warp-matrix operand and whether or not the load is a transposed load, return the LdMatrixParams. More...
 
FailureOr< AffineMapgetLaneIdToLdMatrixMatrixCoord (OpBuilder &builder, Location loc, const LdMatrixParams &params)
 Returns an AffineMap which maps a single dimension representing the laneId to two results representing offsets within the matrix operand that should be the pointer locations a thread should pass to the ldmatrix instruction. More...
 
bool canLowerToWarpMatrixOperation (vector::TransferReadOp op)
 Returns whether the vector.transfer_read instruction can be interpreted as a warp-level cooperative matrix load operation. More...
 
bool canLowerToWarpMatrixOperation (vector::TransferWriteOp op)
 Returns whether the vector.transfer_write instruction can be interpreted as a warp-level cooperative matrix store operation. More...
 

Enumeration Type Documentation

◆ MatMulOperandRole

enum mlir::nvgpu::MatMulOperandRole : int32_t
strong

Represents the role of an operand in an MMA instruction: result := matmul(A, B) + C

Enumerator

Definition at line 26 of file MMAUtils.h.

◆ MmaSyncF32Lowering

Rewrites patterns.

Enum to control the lowering of nvgpu.mmasync.

Enumerator
TF32 
TF32x3 
Unkown 

Definition at line 58 of file Transforms.h.

Function Documentation

◆ canLowerToWarpMatrixOperation() [1/2]

bool mlir::nvgpu::canLowerToWarpMatrixOperation ( vector::TransferReadOp  op)

Returns whether the vector.transfer_read instruction can be interpreted as a warp-level cooperative matrix load operation.

This function is meant to be used to establish whether op is part of a chain of such warp-level operations.

Definition at line 276 of file MMAUtils.cpp.

◆ canLowerToWarpMatrixOperation() [2/2]

bool mlir::nvgpu::canLowerToWarpMatrixOperation ( vector::TransferWriteOp  op)

Returns whether the vector.transfer_write instruction can be interpreted as a warp-level cooperative matrix store operation.

This function is meant to be used to establish whether op is part of a chain of such warp-level operations.

Definition at line 303 of file MMAUtils.cpp.

◆ createAsyncGroups()

void mlir::nvgpu::createAsyncGroups ( RewriterBase rewriter,
Operation op,
bool  bypassL1 
)

Convert global->shared vector transfers to async device copies.

This function looks for suitable vector transfers within the specified op and converts them to "nvgpu.device_async_copy" ops. Consecutive copies are put into the same sync group. If bypassL1 is set, the "bypassL1" attribute is set for suitable (i.e., transfer size 16 bytes) transfers.

Definition at line 153 of file CreateAsyncGroups.cpp.

◆ createOptimizeSharedMemoryPass()

std::unique_ptr< Pass > mlir::nvgpu::createOptimizeSharedMemoryPass ( )

Create a pass to optimize shared memory reads and writes.

Definition at line 242 of file OptimizeSharedMemory.cpp.

◆ getIndices()

Operation::operand_range mlir::nvgpu::getIndices ( Operation op)

Get the indices that the given load/store operation is operating on.

Definition at line 18 of file Utils.cpp.

Referenced by optimizeSharedMemoryReadsAndWrites().

◆ getLaneIdAndValueIdToOperandCoord()

FailureOr< AffineMap > mlir::nvgpu::getLaneIdAndValueIdToOperandCoord ( OpBuilder builder,
Location  loc,
const WarpMatrixInfo fragmentType 
)

Returns an AffineMap which maps a two dimensions representing (laneId, logicalValueId) and returns two results representing offsets within a matrix operand.

The offsets point to the values the thread is responsible for (AKA the matrix fragment values) during a warp-collective matrix operation. For a visual reference of this LaneId -> (row, col) mapping, please see NVIDIA's PTX documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-mma

Definition at line 173 of file MMAUtils.cpp.

References mlir::bindDims(), mlir::failed(), mlir::failure(), mlir::AffineExpr::floorDiv(), mlir::AffineMap::get(), mlir::Builder::getContext(), mlir::Type::getIntOrFloatBitWidth(), getMmaSyncRegisterType(), getRegisterIndexToTileOffsetMap(), mlir::AffineMap::getResult(), inferTileWidthInBits(), isAccumulatorOrResult(), kThreadsPerRow, mlir::nvgpu::WarpMatrixInfo::operandRole, and mlir::nvgpu::WarpMatrixInfo::vectorType.

◆ getLaneIdToLdMatrixMatrixCoord()

FailureOr< AffineMap > mlir::nvgpu::getLaneIdToLdMatrixMatrixCoord ( OpBuilder builder,
Location  loc,
const LdMatrixParams params 
)

Returns an AffineMap which maps a single dimension representing the laneId to two results representing offsets within the matrix operand that should be the pointer locations a thread should pass to the ldmatrix instruction.

Definition at line 238 of file MMAUtils.cpp.

References mlir::nvgpu::LdMatrixParams::contiguousDimType, mlir::failure(), mlir::AffineExpr::floorDiv(), mlir::nvgpu::LdMatrixParams::fragmentType, mlir::AffineMap::get(), mlir::getAffineDimExpr(), and mlir::Builder::getContext().

◆ getLdMatrixParams()

FailureOr< nvgpu::LdMatrixParams > mlir::nvgpu::getLdMatrixParams ( const WarpMatrixInfo type,
bool  transpose 
)

◆ getMbarrierMemorySpace()

Attribute mlir::nvgpu::getMbarrierMemorySpace ( MLIRContext context,
MBarrierGroupType  barrierType 
)

Returns the memory space attribute of the mbarrier object.

◆ getMBarrierMemrefType()

MemRefType mlir::nvgpu::getMBarrierMemrefType ( MLIRContext context,
MBarrierGroupType  barrierType 
)

Return the memref type that can be used to represent an mbarrier object.

◆ getMemrefOperand()

Value mlir::nvgpu::getMemrefOperand ( Operation op)

Get the memref that is loaded from/stored into by the given load/store operation.

Definition at line 68 of file Utils.cpp.

◆ getMmaSyncRegisterType()

FailureOr< FragmentElementInfo > mlir::nvgpu::getMmaSyncRegisterType ( const WarpMatrixInfo type)

◆ getUserContract()

FailureOr< vector::ContractionOp > mlir::nvgpu::getUserContract ( Operation op)

Returns the first user of the op that is vector.contract.

If no vector.contract user exists, return failure.

Definition at line 50 of file MMAUtils.cpp.

◆ getValueStored()

Value mlir::nvgpu::getValueStored ( Operation op)

Get the value that is stored by the given store operation.

Definition at line 58 of file Utils.cpp.

◆ getWarpMatrixInfo()

FailureOr< WarpMatrixInfo > mlir::nvgpu::getWarpMatrixInfo ( Operation op)

If op is a vector.transfer_write, return the WarpMatrixInfo for the vector operand.

If op is a vector.transfer_read, vector.contraction, or arith.constant, return the WarpMatrixInfo corresponding to the result. Otherwise, return failure.

Definition at line 58 of file MMAUtils.cpp.

◆ inferTileWidthInBits()

int64_t mlir::nvgpu::inferTileWidthInBits ( const WarpMatrixInfo type)

Returns the number of bits in a single tile row.

It is either 128, 256, or 512 bits depending on the data type and` whether the operand is an accumulator/result operand

Definition at line 87 of file MMAUtils.cpp.

References mlir::Type::getIntOrFloatBitWidth(), isAccumulatorOrResult(), mlir::nvgpu::WarpMatrixInfo::operandRole, and mlir::nvgpu::WarpMatrixInfo::vectorType.

Referenced by getLaneIdAndValueIdToOperandCoord(), and inferNumRegistersPerMatrixFragment().

◆ optimizeSharedMemoryReadsAndWrites()

mlir::LogicalResult mlir::nvgpu::optimizeSharedMemoryReadsAndWrites ( Operation parentOp,
Value  memrefValue 
)

Passes.

Optimizes vectorized accesses to a shared memory buffer specified by memrefValue. This transformation assumes the following: 1) All relevant accesses to memrefValue are contained with parentOp. 2) The function will fail precondition checks if any subviews are taken of memrefValue. All reads/writes to memrefValue should occur through memrefValue directly.

Shared memory bank conflicts occur when multiple threads attempt to read or write locations assigned to the same shared memory bank. For 2^N byte vectorized accesses, we need to be concerned with conflicts among threads identified as (tid) -> tid.floordiv(2^{7-N}). As such, this transformation changes any indexed memory access (vector.load, memref.load, nvgpu.ldmatrix, etc) such that the final dimension's index value is permuted such that newColIndex = oldColIndex % vectorSize + perm[rowIndex](oldColIndex/vectorSize, rowIndex) where rowIndex is the index for the second-to last dimension and perm[rowIndex] is a permutation function that depends on the row Index. The permutation function is chosen to ensure that sequential distributed+vectorized reads/writes down a single dimension of the memref have minimal conflicts.

Definition at line 150 of file OptimizeSharedMemory.cpp.

References mlir::failed(), mlir::failure(), mlir::Operation::getContext(), getIndices(), mlir::Operation::getLoc(), getShmReadAndWriteOps(), mlir::Value::getType(), kDefaultVectorSizeBits, kSharedMemoryLineSizeBytes, setIndices(), mlir::OpBuilder::setInsertionPoint(), mlir::success(), transformIndices(), and mlir::Operation::walk().

◆ populateMmaSyncF32ToTF32Patterns()

void mlir::nvgpu::populateMmaSyncF32ToTF32Patterns ( RewritePatternSet patterns,
nvgpu::MmaSyncF32Lowering  precision = nvgpu::MmaSyncF32Lowering::TF32 
)

Collect patterns to convert mma.sync on f32 input and rewrite to use tensor cores with user provided level of accuracy: (a) tf32 (1 mma.sync per warp-level matrix-multiply-accumulate) (b) tf32x3 (3 mma.sync per warp-level matrix-multiply-accumulate) Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits.

While f32 has 23 precision bits, tf32 has only 10 precision bits. tf32x3 aims to recover the precision bits by spliting each operand into two tf32 values and issue three mma.sync tensor core operations.

Definition at line 71 of file MmaSyncTF32Transform.cpp.

References mlir::RewritePatternSet::add(), and mlir::RewritePatternSet::getContext().

◆ registerTransformDialectExtension()

void mlir::nvgpu::registerTransformDialectExtension ( DialectRegistry registry)

◆ setIndices()

void mlir::nvgpu::setIndices ( Operation op,
ArrayRef< Value indices 
)

Set the indices that the given load/store operation is operating on.

Definition at line 38 of file Utils.cpp.

Referenced by optimizeSharedMemoryReadsAndWrites().