MLIR 22.0.0git
mlir::nvgpu Namespace Reference

Namespaces

namespace  impl

Classes

struct  WarpMatrixInfo
 Collects information about a warp-level matrix operand represented by a VectorType. More...

Enumerations

enum class  MmaSyncF32Lowering { TF32 = 0 , TF32x3 = 1 , Unkown = 2 }
 Rewrites patterns. More...
enum class  MatMulOperandRole : int32_t { A = 0 , B , C }
 Represents the role of an operand in an MMA instruction: result := matmul(A, B) + C More...

Functions

Attribute getMbarrierMemorySpace (MLIRContext *context, MBarrierGroupType barrierType)
 Returns the memory space attribute of the mbarrier object.
MemRefType getMBarrierMemrefType (MLIRContext *context, MBarrierGroupType barrierType)
 Return the memref type that can be used to represent an mbarrier object.
void registerTransformDialectExtension (DialectRegistry &registry)
std::unique_ptr< PasscreateOptimizeSharedMemoryPass ()
 Create a pass to optimize shared memory reads and writes.
llvm::LogicalResult optimizeSharedMemoryReadsAndWrites (Operation *parentOp, Value memrefValue)
 Passes.
void populateMmaSyncF32ToTF32Patterns (RewritePatternSet &patterns, nvgpu::MmaSyncF32Lowering precision=nvgpu::MmaSyncF32Lowering::TF32)
 Collect patterns to convert mma.sync on f32 input and rewrite to use tensor cores with user provided level of accuracy: (a) tf32 (1 mma.sync per warp-level matrix-multiply-accumulate) (b) tf32x3 (3 mma.sync per warp-level matrix-multiply-accumulate) Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits.
void createAsyncGroups (RewriterBase &rewriter, Operation *op, bool bypassL1)
 Convert global->shared vector transfers to async device copies.
Operation::operand_range getIndices (Operation *op)
 Get the indices that the given load/store operation is operating on.
void setIndices (Operation *op, ArrayRef< Value > indices)
 Set the indices that the given load/store operation is operating on.
Value getValueStored (Operation *op)
 Get the value that is stored by the given store operation.
Value getMemrefOperand (Operation *op)
 Get the memref that is loaded from/stored into by the given load/store operation.
FailureOr< vector::ContractionOp > getUserContract (Operation *op)
 Returns the first user of the op that is vector.contract.
FailureOr< WarpMatrixInfogetWarpMatrixInfo (Operation *op)
 If op is a vector.transfer_write, return the WarpMatrixInfo for the vector operand.
bool canLowerToWarpMatrixOperation (vector::TransferWriteOp op)
 Returns the number of bits in a single tile row.
void registerOptimizeSharedMemory ()
void registerOptimizeSharedMemoryPass ()
void registerNVGPUPasses ()

Enumeration Type Documentation

◆ MatMulOperandRole

enum class mlir::nvgpu::MatMulOperandRole : int32_t
strong

Represents the role of an operand in an MMA instruction: result := matmul(A, B) + C

Enumerator

Definition at line 26 of file MMAUtils.h.

◆ MmaSyncF32Lowering

Rewrites patterns.

Enum to control the lowering of nvgpu.mmasync.

Enumerator
TF32 
TF32x3 
Unkown 

Definition at line 57 of file Transforms.h.

Function Documentation

◆ canLowerToWarpMatrixOperation()

bool mlir::nvgpu::canLowerToWarpMatrixOperation ( vector::TransferWriteOp op)

Returns the number of bits in a single tile row.

It is either 128, 256, or 512 bits depending on the data type and` whether the operand is an / accumulator/result operand int64_t inferTileWidthInBits(const WarpMatrixInfo &type);

/ Specifies information about the registers which compose a matrix fragment / according to the PTX documentation. struct FragmentElementInfo { Type registerLLVMType; int64_t elementsPerRegister; int64_t registerWidthBits; int64_t numRegistersPerFragment; };

/ Returns a FragmentElementInfo struct describing the register types for the / given matrix fragment type. FailureOr<FragmentElementInfo> getMmaSyncRegisterType(const WarpMatrixInfo &type);

/ Returns an AffineMap which maps a two dimensions representing (laneId, / logicalValueId) and returns two results representing offsets within a / matrix operand. The offsets point to the values the thread is responsible / for (AKA the matrix fragment values) during a warp-collective matrix / operation. For a visual reference of this LaneId -> (row, col) mapping, / please see NVIDIA's PTX documentation: / https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-mma FailureOr<AffineMap> getLaneIdAndValueIdToOperandCoord(OpBuilder &builder, Location loc, const WarpMatrixInfo &fragmentType);

/ Encapsulates the parameters needed to lower a nvgpu.ldmatrix operation to / nvvm.ldmatrix. struct LdMatrixParams { VectorType fragmentType; bool isAccum; int64_t numTiles; vector::IteratorType contiguousDimType; NVVM::MMALayout targetLayout; };

/ Given type that contains info for a warp-matrix operand and whether or not / the load is a transposed load, return the LdMatrixParams. FailureOr<LdMatrixParams> getLdMatrixParams(const WarpMatrixInfo &type, bool transpose); / Returns an AffineMap which maps a single dimension representing the laneId / to two results representing offsets within the matrix operand that should / be the pointer locations a thread should pass to the ldmatrix instruction. FailureOr<AffineMap> getLaneIdToLdMatrixMatrixCoord(OpBuilder &builder, Location loc, const LdMatrixParams &params);

/ Returns whether the vector.transfer_read instruction can be interpreted / as a warp-level cooperative matrix load operation. This function is meant to / be used to establish whether op is part of a chain of such warp-level / operations. bool canLowerToWarpMatrixOperation(vector::TransferReadOp op);

/ Returns whether the vector.transfer_write instruction can be interpreted / as a warp-level cooperative matrix store operation. This function is meant / to be used to establish whether op is part of a chain of such warp-level operations.

Definition at line 296 of file MMAUtils.cpp.

Referenced by supportsMMaMatrixType().

◆ createAsyncGroups()

void mlir::nvgpu::createAsyncGroups ( RewriterBase & rewriter,
Operation * op,
bool bypassL1 )

Convert global->shared vector transfers to async device copies.

This function looks for suitable vector transfers within the specified op and converts them to "nvgpu.device_async_copy" ops. Consecutive copies are put into the same sync group. If bypassL1 is set, the "bypassL1" attribute is set for suitable (i.e., transfer size 16 bytes) transfers.

Definition at line 152 of file CreateAsyncGroups.cpp.

References buildNumReadElements(), mlir::RewriterBase::eraseOp(), mlir::getConstantIntValue(), mlir::Operation::getContext(), mlir::Value::getDefiningOp(), mlir::Builder::getIndexAttr(), getIndices(), mlir::Operation::getLoc(), getMaskOp(), getMemrefOperand(), mlir::Value::getType(), mlir::Builder::getUnitAttr(), getValueStored(), mlir::Operation::hasTrait(), isContiguousRead(), isContiguousStore(), mlir::Operation::remove(), resultsInSupportedAsyncCopy(), mlir::OpBuilder::setInsertionPoint(), and mlir::Operation::walk().

◆ createOptimizeSharedMemoryPass()

std::unique_ptr< Pass > mlir::nvgpu::createOptimizeSharedMemoryPass ( )

Create a pass to optimize shared memory reads and writes.

Definition at line 245 of file OptimizeSharedMemory.cpp.

◆ getIndices()

Operation::operand_range mlir::nvgpu::getIndices ( Operation * op)

Get the indices that the given load/store operation is operating on.

Definition at line 18 of file Utils.cpp.

Referenced by createAsyncGroups(), getShmReadAndWriteOps(), and optimizeSharedMemoryReadsAndWrites().

◆ getMbarrierMemorySpace()

Attribute mlir::nvgpu::getMbarrierMemorySpace ( MLIRContext * context,
MBarrierGroupType barrierType )

Returns the memory space attribute of the mbarrier object.

◆ getMBarrierMemrefType()

MemRefType mlir::nvgpu::getMBarrierMemrefType ( MLIRContext * context,
MBarrierGroupType barrierType )

Return the memref type that can be used to represent an mbarrier object.

◆ getMemrefOperand()

Value mlir::nvgpu::getMemrefOperand ( Operation * op)

Get the memref that is loaded from/stored into by the given load/store operation.

Definition at line 68 of file Utils.cpp.

Referenced by createAsyncGroups(), and isContiguousXferOp().

◆ getUserContract()

FailureOr< vector::ContractionOp > mlir::nvgpu::getUserContract ( Operation * op)

Returns the first user of the op that is vector.contract.

If no vector.contract user exists, return failure.

Definition at line 48 of file MMAUtils.cpp.

References mlir::Operation::getUsers().

Referenced by extractStridedSliceSupportsMMAMatrixType(), and getWarpMatrixInfo().

◆ getValueStored()

Value mlir::nvgpu::getValueStored ( Operation * op)

Get the value that is stored by the given store operation.

Definition at line 58 of file Utils.cpp.

Referenced by createAsyncGroups().

◆ getWarpMatrixInfo()

FailureOr< WarpMatrixInfo > mlir::nvgpu::getWarpMatrixInfo ( Operation * op)

If op is a vector.transfer_write, return the WarpMatrixInfo for the vector operand.

If op is a vector.transfer_read, vector.contraction, or arith.constant, return the WarpMatrixInfo corresponding to the result. Otherwise, return failure.

Definition at line 56 of file MMAUtils.cpp.

References A, B, C, mlir::Operation::emitError(), mlir::Operation::getResult(), mlir::Value::getType(), getUserContract(), mlir::nvgpu::WarpMatrixInfo::operandRole, and mlir::nvgpu::WarpMatrixInfo::vectorType.

Referenced by convertConstantOpMmaSync(), convertExtractStridedSlice(), convertTransferReadToLoads(), convertTransferWriteToStores(), createNonLdMatrixLoads(), creatLdMatrixCompatibleLoads(), and extractStridedSliceSupportsMMAMatrixType().

◆ optimizeSharedMemoryReadsAndWrites()

llvm::LogicalResult mlir::nvgpu::optimizeSharedMemoryReadsAndWrites ( Operation * parentOp,
Value memrefValue )

Passes.

Optimizes vectorized accesses to a shared memory buffer specified by memrefValue. This transformation assumes the following: 1) All relevant accesses to memrefValue are contained with parentOp. 2) The function will fail precondition checks if any subviews are taken of memrefValue. All reads/writes to memrefValue should occur through memrefValue directly.

Shared memory bank conflicts occur when multiple threads attempt to read or write locations assigned to the same shared memory bank. For 2^N byte vectorized accesses, we need to be concerned with conflicts among threads identified as (tid) -> tid.floordiv(2^{7-N}). As such, this transformation changes any indexed memory access (vector.load, memref.load, nvgpu.ldmatrix, etc) such that the final dimension's index value is permuted such that newColIndex = oldColIndex % vectorSize + / perm[rowIndex](oldColIndex/vectorSize, rowIndex) where rowIndex is the index for the second-to last dimension and perm[rowIndex] is a permutation function that depends on the row Index. The permutation function is chosen to ensure that sequential distributed+vectorized reads/writes down a single dimension of the memref have minimal conflicts.

Definition at line 149 of file OptimizeSharedMemory.cpp.

References mlir::Operation::getContext(), getIndices(), mlir::Operation::getLoc(), getShmReadAndWriteOps(), mlir::Value::getType(), indices, kDefaultVectorSizeBits, kSharedMemoryLineSizeBytes, setIndices(), mlir::OpBuilder::setInsertionPoint(), success(), transformIndices(), and mlir::Operation::walk().

◆ populateMmaSyncF32ToTF32Patterns()

void mlir::nvgpu::populateMmaSyncF32ToTF32Patterns ( RewritePatternSet & patterns,
nvgpu::MmaSyncF32Lowering precision = nvgpu::MmaSyncF32Lowering::TF32 )

Collect patterns to convert mma.sync on f32 input and rewrite to use tensor cores with user provided level of accuracy: (a) tf32 (1 mma.sync per warp-level matrix-multiply-accumulate) (b) tf32x3 (3 mma.sync per warp-level matrix-multiply-accumulate) Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits.

While f32 has 23 precision bits, tf32 has only 10 precision bits. tf32x3 aims to recover the precision bits by spliting each operand into two tf32 values and issue three mma.sync tensor core operations.

Definition at line 65 of file MmaSyncTF32Transform.cpp.

References mlir::patterns.

◆ registerNVGPUPasses()

void mlir::nvgpu::registerNVGPUPasses ( )
inline

Definition at line 103 of file OptimizeSharedMemory.cpp.

References indices, and permuteVectorOffset().

◆ registerOptimizeSharedMemory()

void mlir::nvgpu::registerOptimizeSharedMemory ( )
inline

Definition at line 82 of file OptimizeSharedMemory.cpp.

References b, and mlir::arith::ConstantIndexOp::create().

◆ registerOptimizeSharedMemoryPass()

void mlir::nvgpu::registerOptimizeSharedMemoryPass ( )
inline

Definition at line 89 of file OptimizeSharedMemory.cpp.

References b.

◆ registerTransformDialectExtension()

void mlir::nvgpu::registerTransformDialectExtension ( DialectRegistry & registry)

◆ setIndices()

void mlir::nvgpu::setIndices ( Operation * op,
ArrayRef< Value > indices )

Set the indices that the given load/store operation is operating on.

Definition at line 38 of file Utils.cpp.

References indices.

Referenced by optimizeSharedMemoryReadsAndWrites().