MLIR
20.0.0git
|
Classes | |
struct | WarpMatrixInfo |
Collects information about a warp-level matrix operand represented by a VectorType. More... | |
struct | FragmentElementInfo |
Specifies information about the registers which compose a matrix fragment according to the PTX documentation. More... | |
struct | LdMatrixParams |
Encapsulates the parameters needed to lower a nvgpu.ldmatrix operation to nvvm.ldmatrix . More... | |
Enumerations | |
enum class | MmaSyncF32Lowering { TF32 = 0 , TF32x3 = 1 , Unkown = 2 } |
Rewrites patterns. More... | |
enum class | MatMulOperandRole : int32_t { A = 0 , B , C } |
Represents the role of an operand in an MMA instruction: result := matmul(A, B) + C More... | |
Functions | |
Attribute | getMbarrierMemorySpace (MLIRContext *context, MBarrierGroupType barrierType) |
Returns the memory space attribute of the mbarrier object. More... | |
MemRefType | getMBarrierMemrefType (MLIRContext *context, MBarrierGroupType barrierType) |
Return the memref type that can be used to represent an mbarrier object. More... | |
void | registerTransformDialectExtension (DialectRegistry ®istry) |
std::unique_ptr< Pass > | createOptimizeSharedMemoryPass () |
Create a pass to optimize shared memory reads and writes. More... | |
llvm::LogicalResult | optimizeSharedMemoryReadsAndWrites (Operation *parentOp, Value memrefValue) |
Passes. More... | |
void | populateMmaSyncF32ToTF32Patterns (RewritePatternSet &patterns, nvgpu::MmaSyncF32Lowering precision=nvgpu::MmaSyncF32Lowering::TF32) |
Collect patterns to convert mma.sync on f32 input and rewrite to use tensor cores with user provided level of accuracy: (a) tf32 (1 mma.sync per warp-level matrix-multiply-accumulate) (b) tf32x3 (3 mma.sync per warp-level matrix-multiply-accumulate) Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits. More... | |
void | createAsyncGroups (RewriterBase &rewriter, Operation *op, bool bypassL1) |
Convert global->shared vector transfers to async device copies. More... | |
Operation::operand_range | getIndices (Operation *op) |
Get the indices that the given load/store operation is operating on. More... | |
void | setIndices (Operation *op, ArrayRef< Value > indices) |
Set the indices that the given load/store operation is operating on. More... | |
Value | getValueStored (Operation *op) |
Get the value that is stored by the given store operation. More... | |
Value | getMemrefOperand (Operation *op) |
Get the memref that is loaded from/stored into by the given load/store operation. More... | |
FailureOr< vector::ContractionOp > | getUserContract (Operation *op) |
Returns the first user of the op that is vector.contract. More... | |
FailureOr< WarpMatrixInfo > | getWarpMatrixInfo (Operation *op) |
If op is a vector.transfer_write , return the WarpMatrixInfo for the vector operand. More... | |
int64_t | inferTileWidthInBits (const WarpMatrixInfo &type) |
Returns the number of bits in a single tile row. More... | |
FailureOr< FragmentElementInfo > | getMmaSyncRegisterType (const WarpMatrixInfo &type) |
Returns a FragmentElementInfo struct describing the register types for the given matrix fragment type. More... | |
FailureOr< AffineMap > | getLaneIdAndValueIdToOperandCoord (OpBuilder &builder, Location loc, const WarpMatrixInfo &fragmentType) |
Returns an AffineMap which maps a two dimensions representing (laneId, logicalValueId) and returns two results representing offsets within a matrix operand. More... | |
FailureOr< LdMatrixParams > | getLdMatrixParams (const WarpMatrixInfo &type, bool transpose) |
Given type that contains info for a warp-matrix operand and whether or not the load is a transposed load, return the LdMatrixParams. More... | |
FailureOr< AffineMap > | getLaneIdToLdMatrixMatrixCoord (OpBuilder &builder, Location loc, const LdMatrixParams ¶ms) |
Returns an AffineMap which maps a single dimension representing the laneId to two results representing offsets within the matrix operand that should be the pointer locations a thread should pass to the ldmatrix instruction. More... | |
bool | canLowerToWarpMatrixOperation (vector::TransferReadOp op) |
Returns whether the vector.transfer_read instruction can be interpreted as a warp-level cooperative matrix load operation. More... | |
bool | canLowerToWarpMatrixOperation (vector::TransferWriteOp op) |
Returns whether the vector.transfer_write instruction can be interpreted as a warp-level cooperative matrix store operation. More... | |
|
strong |
Represents the role of an operand in an MMA instruction: result := matmul(A, B) + C
Enumerator | |
---|---|
A | |
B | |
C |
Definition at line 26 of file MMAUtils.h.
|
strong |
Rewrites patterns.
Enum to control the lowering of nvgpu.mmasync
.
Enumerator | |
---|---|
TF32 | |
TF32x3 | |
Unkown |
Definition at line 57 of file Transforms.h.
bool mlir::nvgpu::canLowerToWarpMatrixOperation | ( | vector::TransferReadOp | op | ) |
Returns whether the vector.transfer_read
instruction can be interpreted as a warp-level cooperative matrix load operation.
This function is meant to be used to establish whether op
is part of a chain of such warp-level operations.
Definition at line 276 of file MMAUtils.cpp.
bool mlir::nvgpu::canLowerToWarpMatrixOperation | ( | vector::TransferWriteOp | op | ) |
Returns whether the vector.transfer_write
instruction can be interpreted as a warp-level cooperative matrix store operation.
This function is meant to be used to establish whether op
is part of a chain of such warp-level operations.
Definition at line 303 of file MMAUtils.cpp.
void mlir::nvgpu::createAsyncGroups | ( | RewriterBase & | rewriter, |
Operation * | op, | ||
bool | bypassL1 | ||
) |
Convert global->shared vector transfers to async device copies.
This function looks for suitable vector transfers within the specified op and converts them to "nvgpu.device_async_copy" ops. Consecutive copies are put into the same sync group. If bypassL1
is set, the "bypassL1" attribute is set for suitable (i.e., transfer size 16 bytes) transfers.
Definition at line 153 of file CreateAsyncGroups.cpp.
std::unique_ptr< Pass > mlir::nvgpu::createOptimizeSharedMemoryPass | ( | ) |
Create a pass to optimize shared memory reads and writes.
Definition at line 241 of file OptimizeSharedMemory.cpp.
Operation::operand_range mlir::nvgpu::getIndices | ( | Operation * | op | ) |
Get the indices that the given load/store operation is operating on.
Definition at line 18 of file Utils.cpp.
Referenced by mlirSparseElementsAttrGetIndices(), and optimizeSharedMemoryReadsAndWrites().
FailureOr< AffineMap > mlir::nvgpu::getLaneIdAndValueIdToOperandCoord | ( | OpBuilder & | builder, |
Location | loc, | ||
const WarpMatrixInfo & | fragmentType | ||
) |
Returns an AffineMap which maps a two dimensions representing (laneId, logicalValueId) and returns two results representing offsets within a matrix operand.
The offsets point to the values the thread is responsible for (AKA the matrix fragment values) during a warp-collective matrix operation. For a visual reference of this LaneId -> (row, col) mapping, please see NVIDIA's PTX documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-mma
Definition at line 173 of file MMAUtils.cpp.
References mlir::bindDims(), mlir::AffineExpr::floorDiv(), mlir::AffineMap::get(), mlir::Builder::getContext(), mlir::Type::getIntOrFloatBitWidth(), getMmaSyncRegisterType(), getRegisterIndexToTileOffsetMap(), mlir::AffineMap::getResult(), inferTileWidthInBits(), isAccumulatorOrResult(), kThreadsPerRow, mlir::nvgpu::WarpMatrixInfo::operandRole, and mlir::nvgpu::WarpMatrixInfo::vectorType.
FailureOr< AffineMap > mlir::nvgpu::getLaneIdToLdMatrixMatrixCoord | ( | OpBuilder & | builder, |
Location | loc, | ||
const LdMatrixParams & | params | ||
) |
Returns an AffineMap which maps a single dimension representing the laneId to two results representing offsets within the matrix operand that should be the pointer locations a thread should pass to the ldmatrix instruction.
Definition at line 238 of file MMAUtils.cpp.
References mlir::nvgpu::LdMatrixParams::contiguousDimType, mlir::AffineExpr::floorDiv(), mlir::nvgpu::LdMatrixParams::fragmentType, mlir::AffineMap::get(), mlir::getAffineDimExpr(), and mlir::Builder::getContext().
FailureOr< nvgpu::LdMatrixParams > mlir::nvgpu::getLdMatrixParams | ( | const WarpMatrixInfo & | type, |
bool | transpose | ||
) |
Given type
that contains info for a warp-matrix operand and whether or not the load is a transposed load, return the LdMatrixParams.
Definition at line 209 of file MMAUtils.cpp.
References A, C, mlir::nvgpu::LdMatrixParams::contiguousDimType, mlir::nvgpu::LdMatrixParams::fragmentType, mlir::Type::getIntOrFloatBitWidth(), kNumRowsPerTile, mlir::nvgpu::LdMatrixParams::numTiles, mlir::nvgpu::WarpMatrixInfo::operandRole, mlir::nvgpu::LdMatrixParams::targetLayout, mlir::xegpu::transpose(), and mlir::nvgpu::WarpMatrixInfo::vectorType.
Attribute mlir::nvgpu::getMbarrierMemorySpace | ( | MLIRContext * | context, |
MBarrierGroupType | barrierType | ||
) |
Returns the memory space attribute of the mbarrier object.
MemRefType mlir::nvgpu::getMBarrierMemrefType | ( | MLIRContext * | context, |
MBarrierGroupType | barrierType | ||
) |
Return the memref type that can be used to represent an mbarrier object.
FailureOr< FragmentElementInfo > mlir::nvgpu::getMmaSyncRegisterType | ( | const WarpMatrixInfo & | type | ) |
Returns a FragmentElementInfo struct describing the register types for the given matrix fragment type.
Definition at line 100 of file MMAUtils.cpp.
References mlir::get(), mlir::LLVM::getFixedVectorType(), inferNumRegistersPerMatrixFragment(), isAccumulatorOrResult(), mlir::Type::isF16(), mlir::Type::isF32(), mlir::Type::isF64(), mlir::Type::isInteger(), mlir::nvgpu::WarpMatrixInfo::operandRole, and mlir::nvgpu::WarpMatrixInfo::vectorType.
Referenced by getLaneIdAndValueIdToOperandCoord().
FailureOr< vector::ContractionOp > mlir::nvgpu::getUserContract | ( | Operation * | op | ) |
Returns the first user of the op
that is vector.contract.
If no vector.contract user exists, return failure.
Definition at line 50 of file MMAUtils.cpp.
FailureOr< WarpMatrixInfo > mlir::nvgpu::getWarpMatrixInfo | ( | Operation * | op | ) |
If op
is a vector.transfer_write
, return the WarpMatrixInfo
for the vector operand.
If op is a vector.transfer_read
, vector.contraction
, or arith.constant
, return the WarpMatrixInfo
corresponding to the result. Otherwise, return failure.
Definition at line 58 of file MMAUtils.cpp.
int64_t mlir::nvgpu::inferTileWidthInBits | ( | const WarpMatrixInfo & | type | ) |
Returns the number of bits in a single tile row.
It is either 128, 256, or 512 bits depending on the data type and` whether the operand is an accumulator/result operand
Definition at line 87 of file MMAUtils.cpp.
References mlir::Type::getIntOrFloatBitWidth(), isAccumulatorOrResult(), mlir::nvgpu::WarpMatrixInfo::operandRole, and mlir::nvgpu::WarpMatrixInfo::vectorType.
Referenced by getLaneIdAndValueIdToOperandCoord(), and inferNumRegistersPerMatrixFragment().
llvm::LogicalResult mlir::nvgpu::optimizeSharedMemoryReadsAndWrites | ( | Operation * | parentOp, |
Value | memrefValue | ||
) |
Passes.
Optimizes vectorized accesses to a shared memory buffer specified by memrefValue. This transformation assumes the following: 1) All relevant accesses to memrefValue
are contained with parentOp
. 2) The function will fail precondition checks if any subviews are taken of memrefValue
. All reads/writes to memrefValue
should occur through memrefValue
directly.
Shared memory bank conflicts occur when multiple threads attempt to read or write locations assigned to the same shared memory bank. For 2^N
byte vectorized accesses, we need to be concerned with conflicts among threads identified as (tid) -> tid.floordiv(2^{7-N})
. As such, this transformation changes any indexed memory access (vector.load, memref.load, nvgpu.ldmatrix, etc) such that the final dimension's index value is permuted such that newColIndex = oldColIndex % vectorSize + perm[rowIndex](oldColIndex/vectorSize, rowIndex)
where rowIndex
is the index for the second-to last dimension and perm[rowIndex]
is a permutation function that depends on the row Index. The permutation function is chosen to ensure that sequential distributed+vectorized reads/writes down a single dimension of the memref have minimal conflicts.
Definition at line 149 of file OptimizeSharedMemory.cpp.
References mlir::Operation::getContext(), getIndices(), mlir::Operation::getLoc(), getShmReadAndWriteOps(), mlir::Value::getType(), kDefaultVectorSizeBits, kSharedMemoryLineSizeBytes, setIndices(), mlir::OpBuilder::setInsertionPoint(), transformIndices(), and mlir::Operation::walk().
void mlir::nvgpu::populateMmaSyncF32ToTF32Patterns | ( | RewritePatternSet & | patterns, |
nvgpu::MmaSyncF32Lowering | precision = nvgpu::MmaSyncF32Lowering::TF32 |
||
) |
Collect patterns to convert mma.sync on f32 input and rewrite to use tensor cores with user provided level of accuracy: (a) tf32 (1 mma.sync per warp-level matrix-multiply-accumulate) (b) tf32x3 (3 mma.sync per warp-level matrix-multiply-accumulate) Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits.
While f32 has 23 precision bits, tf32 has only 10 precision bits. tf32x3 aims to recover the precision bits by spliting each operand into two tf32 values and issue three mma.sync tensor core operations.
Definition at line 70 of file MmaSyncTF32Transform.cpp.
References mlir::RewritePatternSet::add(), and mlir::RewritePatternSet::getContext().
void mlir::nvgpu::registerTransformDialectExtension | ( | DialectRegistry & | registry | ) |
Referenced by mlir::registerAllExtensions().