MLIR  16.0.0git
Classes | Enumerations | Functions
mlir::nvgpu Namespace Reference

Classes

struct  FragmentElementInfo
 Specifies information about the registers which compose a matrix fragment according to the PTX documentation. More...
 
struct  LdMatrixParams
 
struct  PrepareContractToGPUMMASync
 
struct  WarpMatrixInfo
 Collects information about a warp-level matrix operand represented by a VectorType. More...
 

Enumerations

enum  MmaSyncF32Lowering { MmaSyncF32Lowering::TF32 = 0, MmaSyncF32Lowering::TF32x3 = 1, MmaSyncF32Lowering::Unkown = 2 }
 Rewrites patterns. More...
 
enum  MatMulOperandRole : int32_t { MatMulOperandRole::A = 0, MatMulOperandRole::B, MatMulOperandRole::C }
 

Functions

std::unique_ptr< PasscreateOptimizeSharedMemoryPass ()
 Create a pass to optimize shared memory reads and writes. More...
 
mlir::LogicalResult optimizeSharedMemoryReadsAndWrites (Operation *parentOp, Value memrefValue)
 Passes. More...
 
void populateMmaSyncF32ToTF32Patterns (RewritePatternSet &patterns, nvgpu::MmaSyncF32Lowering precision=nvgpu::MmaSyncF32Lowering::TF32)
 Collect patterns to convert mma.sync on f32 input and rewrite to use tensor cores with user provided level of accuracy: (a) tf32 (1 mma.sync per warp-level matrix-multiply-accumulate) (b) tf32x3 (3 mma.sync per warp-level matrix-multiply-accumulate) Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits. More...
 
FailureOr< WarpMatrixInfogetWarpMatrixInfo (Operation *op)
 Given an op that operates on a VectorType representing a warp-level matrix operand, the function returns a struct containing relevant type information. More...
 
int64_t inferTileWidthInBits (const WarpMatrixInfo &type)
 Returns the number of bits in a single tile row. More...
 
FailureOr< FragmentElementInfogetMmaSyncRegisterType (const WarpMatrixInfo &type)
 Returns a FragmentElementInfo struct describing the register types for the given matrix fragment type. More...
 
static AffineMap getRegisterIndexToTileOffsetMap (int64_t lineSize, Type elementType, ArrayRef< int64_t > operandShape, bool isAccumulator, int64_t elementsPerRegister, AffineExpr logicalValueId)
 
FailureOr< AffineMapgetLaneIdAndValueIdToOperandCoord (Location loc, OpBuilder &builder, const WarpMatrixInfo &fragmentType)
 Returns an AffineMap which maps a two dimensions representing (laneId, logicalValueId) and returns two results representing offsets within a matrix operand. More...
 
FailureOr< nvgpu::LdMatrixParamsgetLdMatrixParams (const WarpMatrixInfo &type, bool transpose)
 
FailureOr< AffineMapgetLaneIdToLdMatrixMatrixCoord (Location loc, OpBuilder &builder, const LdMatrixParams &params)
 Returns an AffineMap which maps a single dimension representing the laneId to two results representing offsets within the matrix operand that should be the pointer locations a thread should pass to the ldmatrix instruction. More...
 

Enumeration Type Documentation

◆ MatMulOperandRole

enum mlir::nvgpu::MatMulOperandRole : int32_t
strong
Enumerator

Definition at line 27 of file NvGpuSupport.h.

◆ MmaSyncF32Lowering

Rewrites patterns.

Enum to control the lowering of nvgpu.mmasync.

Enumerator
TF32 
TF32x3 
Unkown 

Definition at line 56 of file Transforms.h.

Function Documentation

◆ createOptimizeSharedMemoryPass()

std::unique_ptr< Pass > mlir::nvgpu::createOptimizeSharedMemoryPass ( )

Create a pass to optimize shared memory reads and writes.

Definition at line 278 of file OptimizeSharedMemory.cpp.

◆ getLaneIdAndValueIdToOperandCoord()

FailureOr< AffineMap > mlir::nvgpu::getLaneIdAndValueIdToOperandCoord ( Location  loc,
OpBuilder builder,
const WarpMatrixInfo fragmentType 
)

Returns an AffineMap which maps a two dimensions representing (laneId, logicalValueId) and returns two results representing offsets within a matrix operand.

The offsets point to the values the thread is responsible for (AKA the matrix fragment values) during a warp-collective matrix operation. For a visual reference of this LaneId -> (row, col) mapping, please see NVIDIA's PTX documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-mma

Definition at line 173 of file NvGpuSupport.cpp.

References mlir::bindDims(), mlir::failed(), mlir::failure(), mlir::AffineExpr::floorDiv(), mlir::AffineMap::get(), mlir::Builder::getContext(), mlir::Type::getIntOrFloatBitWidth(), getMmaSyncRegisterType(), getRegisterIndexToTileOffsetMap(), mlir::AffineMap::getResult(), inferTileWidthInBits(), mlir::nvgpu::WarpMatrixInfo::operandRole, and mlir::nvgpu::WarpMatrixInfo::vectorType.

Referenced by convertTransferWriteToStores(), and createNonLdMatrixLoads().

◆ getLaneIdToLdMatrixMatrixCoord()

FailureOr< AffineMap > mlir::nvgpu::getLaneIdToLdMatrixMatrixCoord ( Location  loc,
OpBuilder builder,
const LdMatrixParams params 
)

Returns an AffineMap which maps a single dimension representing the laneId to two results representing offsets within the matrix operand that should be the pointer locations a thread should pass to the ldmatrix instruction.

Definition at line 238 of file NvGpuSupport.cpp.

References mlir::nvgpu::LdMatrixParams::contiguousDimType, mlir::failure(), mlir::AffineExpr::floorDiv(), mlir::nvgpu::LdMatrixParams::fragmentType, mlir::AffineMap::get(), mlir::getAffineDimExpr(), and mlir::Builder::getContext().

Referenced by creatLdMatrixCompatibleLoads().

◆ getLdMatrixParams()

FailureOr< LdMatrixParams > mlir::nvgpu::getLdMatrixParams ( const WarpMatrixInfo type,
bool  transpose 
)

◆ getMmaSyncRegisterType()

FailureOr< FragmentElementInfo > mlir::nvgpu::getMmaSyncRegisterType ( const WarpMatrixInfo type)

◆ getRegisterIndexToTileOffsetMap()

static AffineMap mlir::nvgpu::getRegisterIndexToTileOffsetMap ( int64_t  lineSize,
Type  elementType,
ArrayRef< int64_t >  operandShape,
bool  isAccumulator,
int64_t  elementsPerRegister,
AffineExpr  logicalValueId 
)
static

◆ getWarpMatrixInfo()

FailureOr< WarpMatrixInfo > mlir::nvgpu::getWarpMatrixInfo ( Operation op)

◆ inferTileWidthInBits()

int64_t mlir::nvgpu::inferTileWidthInBits ( const WarpMatrixInfo type)

Returns the number of bits in a single tile row.

It is either 128, 256, or 512 bits depending on the data type and` whether the operand is an accumulator/result operand

Definition at line 87 of file NvGpuSupport.cpp.

References mlir::Type::getIntOrFloatBitWidth(), mlir::nvgpu::WarpMatrixInfo::operandRole, and mlir::nvgpu::WarpMatrixInfo::vectorType.

Referenced by convertTransferReadToLoads(), and getLaneIdAndValueIdToOperandCoord().

◆ optimizeSharedMemoryReadsAndWrites()

mlir::LogicalResult mlir::nvgpu::optimizeSharedMemoryReadsAndWrites ( Operation parentOp,
Value  memrefValue 
)

Passes.

Optimizes vectorized accesses to a shared memory buffer specified by memrefValue. This transformation assumes the following: 1) All relevant accesses to memrefValue are contained with parentOp. 2) The function will fail precondition checks if any subviews are taken of memrefValue. All reads/writes to memrefValue should occur through memrefValue directly.

Shared memory bank conflicts occur when multiple threads attempt to read or write locations assigned to the same shared memory bank. For 2^N byte vectorized accesses, we need to be concerned with conflicts among threads identified as (tid) -> tid.floordiv(2^{7-N}). As such, this transformation changes any indexed memory access (vector.load, memref.load, nvgpu.ldmatrix, etc) such that the final dimension's index value is permuted such that newColIndex = oldColIndex % vectorSize + perm[rowIndex](oldColIndex/vectorSize, rowIndex) where rowIndex is the index for the second-to last dimension and perm[rowIndex] is a permutation function that depends on the row Index. The permutation function is chosen to ensure that sequential distributed+vectorized reads/writes down a single dimension of the memref have minimal conflicts.

Definition at line 181 of file OptimizeSharedMemory.cpp.

References mlir::Type::dyn_cast(), mlir::failed(), mlir::failure(), mlir::Operation::getContext(), getIndices(), mlir::Operation::getLoc(), mlir::detail::getMemorySpaceAsInt(), getShmReadAndWriteOps(), mlir::Value::getType(), kDefaultVectorSizeBits, kSharedMemoryLineSizeBytes, setIndices(), mlir::success(), transformIndices(), and mlir::Operation::walk().

◆ populateMmaSyncF32ToTF32Patterns()

void mlir::nvgpu::populateMmaSyncF32ToTF32Patterns ( RewritePatternSet patterns,
nvgpu::MmaSyncF32Lowering  precision = nvgpu::MmaSyncF32Lowering::TF32 
)

Collect patterns to convert mma.sync on f32 input and rewrite to use tensor cores with user provided level of accuracy: (a) tf32 (1 mma.sync per warp-level matrix-multiply-accumulate) (b) tf32x3 (3 mma.sync per warp-level matrix-multiply-accumulate) Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits.

While f32 has 23 precision bits, tf32 has only 10 precision bits. tf32x3 aims to recover the precision bits by spliting each operand into two tf32 values and issue three mma.sync tensor core operations.

Definition at line 69 of file MmaSyncTF32Transform.cpp.

References mlir::RewritePatternSet::add(), and mlir::RewritePatternSet::getContext().