Classes
struct	WarpMatrixInfo
	Collects information about a warp-level matrix operand represented by a VectorType. More...

struct	FragmentElementInfo
	Specifies information about the registers which compose a matrix fragment according to the PTX documentation. More...

struct	LdMatrixParams
	Encapsulates the parameters needed to lower a `nvgpu.ldmatrix` operation to `nvvm.ldmatrix`. More...

Enumerations
enum class	MmaSyncF32Lowering { TF32 = 0 , TF32x3 = 1 , Unkown = 2 }
	Rewrites patterns. More...

enum class	MatMulOperandRole : int32_t { A = 0 , B , C }
	Represents the role of an operand in an MMA instruction: `result := matmul(A, B) + C` More...

Functions
Attribute	getMbarrierMemorySpace (MLIRContext *context, MBarrierGroupType barrierType)
	Returns the memory space attribute of the mbarrier object. More...

MemRefType	getMBarrierMemrefType (MLIRContext *context, MBarrierGroupType barrierType)
	Return the memref type that can be used to represent an mbarrier object. More...

void	registerTransformDialectExtension (DialectRegistry &registry)

std::unique_ptr< Pass >	createOptimizeSharedMemoryPass ()
	Create a pass to optimize shared memory reads and writes. More...

llvm::LogicalResult	optimizeSharedMemoryReadsAndWrites (Operation *parentOp, Value memrefValue)
	Passes. More...

void	populateMmaSyncF32ToTF32Patterns (RewritePatternSet &patterns, nvgpu::MmaSyncF32Lowering precision=nvgpu::MmaSyncF32Lowering::TF32)
	Collect patterns to convert mma.sync on f32 input and rewrite to use tensor cores with user provided level of accuracy: (a) tf32 (1 mma.sync per warp-level matrix-multiply-accumulate) (b) tf32x3 (3 mma.sync per warp-level matrix-multiply-accumulate) Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits. More...

void	createAsyncGroups (RewriterBase &rewriter, Operation *op, bool bypassL1)
	Convert global->shared vector transfers to async device copies. More...

Operation::operand_range	getIndices (Operation *op)
	Get the indices that the given load/store operation is operating on. More...

void	setIndices (Operation *op, ArrayRef< Value > indices)
	Set the indices that the given load/store operation is operating on. More...

Value	getValueStored (Operation *op)
	Get the value that is stored by the given store operation. More...

Value	getMemrefOperand (Operation *op)
	Get the memref that is loaded from/stored into by the given load/store operation. More...

FailureOr< vector::ContractionOp >	getUserContract (Operation *op)
	Returns the first user of the `op` that is vector.contract. More...

FailureOr< WarpMatrixInfo >	getWarpMatrixInfo (Operation *op)
	If `op` is a `vector.transfer_write`, return the `WarpMatrixInfo` for the vector operand. More...

int64_t	inferTileWidthInBits (const WarpMatrixInfo &type)
	Returns the number of bits in a single tile row. More...

FailureOr< FragmentElementInfo >	getMmaSyncRegisterType (const WarpMatrixInfo &type)
	Returns a FragmentElementInfo struct describing the register types for the given matrix fragment type. More...

FailureOr< AffineMap >	getLaneIdAndValueIdToOperandCoord (OpBuilder &builder, Location loc, const WarpMatrixInfo &fragmentType)
	Returns an AffineMap which maps a two dimensions representing (laneId, logicalValueId) and returns two results representing offsets within a matrix operand. More...

FailureOr< LdMatrixParams >	getLdMatrixParams (const WarpMatrixInfo &type, bool transpose)
	Given `type` that contains info for a warp-matrix operand and whether or not the load is a transposed load, return the LdMatrixParams. More...

FailureOr< AffineMap >	getLaneIdToLdMatrixMatrixCoord (OpBuilder &builder, Location loc, const LdMatrixParams &params)
	Returns an AffineMap which maps a single dimension representing the laneId to two results representing offsets within the matrix operand that should be the pointer locations a thread should pass to the ldmatrix instruction. More...

bool	canLowerToWarpMatrixOperation (vector::TransferReadOp op)
	Returns whether the `vector.transfer_read` instruction can be interpreted as a warp-level cooperative matrix load operation. More...

bool	canLowerToWarpMatrixOperation (vector::TransferWriteOp op)
	Returns whether the `vector.transfer_write` instruction can be interpreted as a warp-level cooperative matrix store operation. More...

Enumeration Type Documentation

◆ MatMulOperandRole

enum mlir::nvgpu::MatMulOperandRole : int32_t

strong

Represents the role of an operand in an MMA instruction: result := matmul(A, B) + C

Enumerator
A
B
C

Definition at line 26 of file MMAUtils.h.

◆ MmaSyncF32Lowering

enum mlir::nvgpu::MmaSyncF32Lowering

strong

Rewrites patterns.

Enum to control the lowering of nvgpu.mmasync.

Enumerator
TF32
TF32x3
Unkown

Definition at line 57 of file Transforms.h.

Function Documentation

◆ canLowerToWarpMatrixOperation() [1/2]

bool mlir::nvgpu::canLowerToWarpMatrixOperation ( vector::TransferReadOp op )

Returns whether the vector.transfer_read instruction can be interpreted as a warp-level cooperative matrix load operation.

This function is meant to be used to establish whether op is part of a chain of such warp-level operations.

Definition at line 269 of file MMAUtils.cpp.

Referenced by supportsMMaMatrixType().

◆ canLowerToWarpMatrixOperation() [2/2]

bool mlir::nvgpu::canLowerToWarpMatrixOperation ( vector::TransferWriteOp op )

Returns whether the vector.transfer_write instruction can be interpreted as a warp-level cooperative matrix store operation.

This function is meant to be used to establish whether op is part of a chain of such warp-level operations.

Definition at line 296 of file MMAUtils.cpp.

◆ createAsyncGroups()

void mlir::nvgpu::createAsyncGroups	(	RewriterBase &	rewriter,
		Operation *	op,
		bool	bypassL1
	)

Convert global->shared vector transfers to async device copies.

This function looks for suitable vector transfers within the specified op and converts them to "nvgpu.device_async_copy" ops. Consecutive copies are put into the same sync group. If bypassL1 is set, the "bypassL1" attribute is set for suitable (i.e., transfer size 16 bytes) transfers.

Definition at line 152 of file CreateAsyncGroups.cpp.

References buildNumReadElements(), mlir::RewriterBase::eraseOp(), mlir::remark::failed(), mlir::get(), mlir::getConstantIntValue(), mlir::Operation::getContext(), mlir::Value::getDefiningOp(), mlir::Builder::getIndexAttr(), getIndices(), mlir::Operation::getLoc(), getMaskOp(), getMemrefOperand(), mlir::Value::getType(), mlir::Builder::getUnitAttr(), getValueStored(), mlir::Operation::hasTrait(), isContiguousRead(), isContiguousStore(), mlir::Operation::remove(), resultsInSupportedAsyncCopy(), mlir::OpBuilder::setInsertionPoint(), and mlir::Operation::walk().

◆ createOptimizeSharedMemoryPass()

std::unique_ptr< Pass > mlir::nvgpu::createOptimizeSharedMemoryPass ( )

Create a pass to optimize shared memory reads and writes.

Definition at line 245 of file OptimizeSharedMemory.cpp.

◆ getIndices()

Operation::operand_range mlir::nvgpu::getIndices ( Operation * op )

Get the indices that the given load/store operation is operating on.

Definition at line 18 of file Utils.cpp.

Referenced by createAsyncGroups(), getShmReadAndWriteOps(), mlirSparseElementsAttrGetIndices(), and optimizeSharedMemoryReadsAndWrites().

◆ getLaneIdAndValueIdToOperandCoord()

FailureOr< AffineMap > mlir::nvgpu::getLaneIdAndValueIdToOperandCoord	(	OpBuilder &	builder,
		Location	loc,
		const WarpMatrixInfo &	fragmentType
	)

Returns an AffineMap which maps a two dimensions representing (laneId, logicalValueId) and returns two results representing offsets within a matrix operand.

The offsets point to the values the thread is responsible for (AKA the matrix fragment values) during a warp-collective matrix operation. For a visual reference of this LaneId -> (row, col) mapping, please see NVIDIA's PTX documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-mma

Definition at line 167 of file MMAUtils.cpp.

References mlir::bindDims(), mlir::remark::failed(), mlir::AffineExpr::floorDiv(), mlir::AffineMap::get(), mlir::Builder::getContext(), mlir::Type::getIntOrFloatBitWidth(), getMmaSyncRegisterType(), getRegisterIndexToTileOffsetMap(), mlir::AffineMap::getResult(), inferTileWidthInBits(), isAccumulatorOrResult(), kThreadsPerRow, mlir::nvgpu::WarpMatrixInfo::operandRole, and mlir::nvgpu::WarpMatrixInfo::vectorType.

Referenced by convertTransferWriteToStores(), and createNonLdMatrixLoads().

◆ getLaneIdToLdMatrixMatrixCoord()

FailureOr< AffineMap > mlir::nvgpu::getLaneIdToLdMatrixMatrixCoord	(	OpBuilder &	builder,
		Location	loc,
		const LdMatrixParams &	params
	)

Returns an AffineMap which maps a single dimension representing the laneId to two results representing offsets within the matrix operand that should be the pointer locations a thread should pass to the ldmatrix instruction.

Definition at line 231 of file MMAUtils.cpp.

References mlir::nvgpu::LdMatrixParams::contiguousDimType, mlir::AffineExpr::floorDiv(), mlir::nvgpu::LdMatrixParams::fragmentType, mlir::AffineMap::get(), mlir::getAffineDimExpr(), and mlir::Builder::getContext().

Referenced by creatLdMatrixCompatibleLoads().

◆ getLdMatrixParams()

FailureOr< LdMatrixParams > mlir::nvgpu::getLdMatrixParams	(	const WarpMatrixInfo &	type,
		bool	transpose
	)

Given type that contains info for a warp-matrix operand and whether or not the load is a transposed load, return the LdMatrixParams.

Definition at line 201 of file MMAUtils.cpp.

References A, C, mlir::nvgpu::LdMatrixParams::contiguousDimType, mlir::nvgpu::LdMatrixParams::fragmentType, mlir::Type::getIntOrFloatBitWidth(), kNumRowsPerTile, mlir::nvgpu::LdMatrixParams::numTiles, mlir::nvgpu::WarpMatrixInfo::operandRole, mlir::nvgpu::LdMatrixParams::targetLayout, and mlir::nvgpu::WarpMatrixInfo::vectorType.

Referenced by creatLdMatrixCompatibleLoads().

◆ getMbarrierMemorySpace()

Attribute mlir::nvgpu::getMbarrierMemorySpace	(	MLIRContext *	context,
		MBarrierGroupType	barrierType
	)

Returns the memory space attribute of the mbarrier object.

◆ getMBarrierMemrefType()

MemRefType mlir::nvgpu::getMBarrierMemrefType	(	MLIRContext *	context,
		MBarrierGroupType	barrierType
	)

Return the memref type that can be used to represent an mbarrier object.

◆ getMemrefOperand()

Value mlir::nvgpu::getMemrefOperand ( Operation * op )

Get the memref that is loaded from/stored into by the given load/store operation.

Definition at line 68 of file Utils.cpp.

Referenced by createAsyncGroups(), and isContiguousXferOp().

◆ getMmaSyncRegisterType()

FailureOr< FragmentElementInfo > mlir::nvgpu::getMmaSyncRegisterType ( const WarpMatrixInfo & type )

Returns a FragmentElementInfo struct describing the register types for the given matrix fragment type.

Definition at line 98 of file MMAUtils.cpp.

References mlir::get(), inferNumRegistersPerMatrixFragment(), isAccumulatorOrResult(), mlir::Type::isF16(), mlir::Type::isF32(), mlir::Type::isF64(), mlir::Type::isInteger(), mlir::nvgpu::WarpMatrixInfo::operandRole, and mlir::nvgpu::WarpMatrixInfo::vectorType.

Referenced by convertConstantOpMmaSync(), convertExtractStridedSlice(), convertTransferWriteToStores(), createNonLdMatrixLoads(), creatLdMatrixCompatibleLoads(), and getLaneIdAndValueIdToOperandCoord().

◆ getUserContract()

FailureOr< vector::ContractionOp > mlir::nvgpu::getUserContract ( Operation * op )

Returns the first user of the op that is vector.contract.

If no vector.contract user exists, return failure.

Definition at line 48 of file MMAUtils.cpp.

References mlir::Operation::getUsers().

Referenced by extractStridedSliceSupportsMMAMatrixType(), and getWarpMatrixInfo().

◆ getValueStored()

Value mlir::nvgpu::getValueStored ( Operation * op )

Get the value that is stored by the given store operation.

Definition at line 58 of file Utils.cpp.

Referenced by createAsyncGroups().

◆ getWarpMatrixInfo()

FailureOr< WarpMatrixInfo > mlir::nvgpu::getWarpMatrixInfo ( Operation * op )

If op is a vector.transfer_write, return the WarpMatrixInfo for the vector operand.

If op is a vector.transfer_read, vector.contraction, or arith.constant, return the WarpMatrixInfo corresponding to the result. Otherwise, return failure.

Definition at line 56 of file MMAUtils.cpp.

References A, B, C, mlir::Operation::emitError(), mlir::remark::failed(), mlir::Operation::getResult(), mlir::Value::getType(), getUserContract(), mlir::nvgpu::WarpMatrixInfo::operandRole, and mlir::nvgpu::WarpMatrixInfo::vectorType.

Referenced by convertConstantOpMmaSync(), convertExtractStridedSlice(), convertTransferReadToLoads(), convertTransferWriteToStores(), createNonLdMatrixLoads(), creatLdMatrixCompatibleLoads(), and extractStridedSliceSupportsMMAMatrixType().

◆ inferTileWidthInBits()

int64_t mlir::nvgpu::inferTileWidthInBits ( const WarpMatrixInfo & type )

Returns the number of bits in a single tile row.

It is either 128, 256, or 512 bits depending on the data type and` whether the operand is an accumulator/result operand

Definition at line 85 of file MMAUtils.cpp.

References mlir::Type::getIntOrFloatBitWidth(), isAccumulatorOrResult(), mlir::nvgpu::WarpMatrixInfo::operandRole, and mlir::nvgpu::WarpMatrixInfo::vectorType.

Referenced by convertTransferReadToLoads(), getLaneIdAndValueIdToOperandCoord(), and inferNumRegistersPerMatrixFragment().

◆ optimizeSharedMemoryReadsAndWrites()

llvm::LogicalResult mlir::nvgpu::optimizeSharedMemoryReadsAndWrites	(	Operation *	parentOp,
		Value	memrefValue
	)

Passes.

Optimizes vectorized accesses to a shared memory buffer specified by memrefValue. This transformation assumes the following: 1) All relevant accesses to memrefValue are contained with parentOp. 2) The function will fail precondition checks if any subviews are taken of memrefValue. All reads/writes to memrefValue should occur through memrefValue directly.

Shared memory bank conflicts occur when multiple threads attempt to read or write locations assigned to the same shared memory bank. For 2^N byte vectorized accesses, we need to be concerned with conflicts among threads identified as (tid) -> tid.floordiv(2^{7-N}). As such, this transformation changes any indexed memory access (vector.load, memref.load, nvgpu.ldmatrix, etc) such that the final dimension's index value is permuted such that newColIndex = oldColIndex % vectorSize + perm[rowIndex](oldColIndex/vectorSize, rowIndex) where rowIndex is the index for the second-to last dimension and perm[rowIndex] is a permutation function that depends on the row Index. The permutation function is chosen to ensure that sequential distributed+vectorized reads/writes down a single dimension of the memref have minimal conflicts.

Definition at line 149 of file OptimizeSharedMemory.cpp.

References mlir::remark::failed(), mlir::Operation::getContext(), getIndices(), mlir::Operation::getLoc(), getShmReadAndWriteOps(), mlir::Value::getType(), kDefaultVectorSizeBits, kSharedMemoryLineSizeBytes, setIndices(), mlir::OpBuilder::setInsertionPoint(), transformIndices(), and mlir::Operation::walk().

◆ populateMmaSyncF32ToTF32Patterns()

void mlir::nvgpu::populateMmaSyncF32ToTF32Patterns	(	RewritePatternSet &	patterns,
		nvgpu::MmaSyncF32Lowering	precision = `nvgpu::MmaSyncF32Lowering::TF32`
	)

Collect patterns to convert mma.sync on f32 input and rewrite to use tensor cores with user provided level of accuracy: (a) tf32 (1 mma.sync per warp-level matrix-multiply-accumulate) (b) tf32x3 (3 mma.sync per warp-level matrix-multiply-accumulate) Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits.

While f32 has 23 precision bits, tf32 has only 10 precision bits. tf32x3 aims to recover the precision bits by spliting each operand into two tf32 values and issue three mma.sync tensor core operations.

Definition at line 65 of file MmaSyncTF32Transform.cpp.

References mlir::patterns.

◆ registerTransformDialectExtension()

void mlir::nvgpu::registerTransformDialectExtension ( DialectRegistry & registry )

◆ setIndices()

void mlir::nvgpu::setIndices	(	Operation *	op,
		ArrayRef< Value >	indices
	)

Set the indices that the given load/store operation is operating on.

Definition at line 38 of file Utils.cpp.

Referenced by optimizeSharedMemoryReadsAndWrites().

Classes

Enumerations

Functions