MLIR  19.0.0git
Transforms.h
Go to the documentation of this file.
1 //===- Transforms.h - NVGPU Dialect transformations --------------*- C++-*-===//
2 //
3 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4 // See https://llvm.org/LICENSE.txt for license information.
5 // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6 //
7 //===----------------------------------------------------------------------===//
8 //
9 // This file declares functions that assist transformations for the nvgpu
10 // dialect.
11 //
12 //===----------------------------------------------------------------------===//
13 #ifndef MLIR_DIALECT_NVGPU_TRANSFORMS_TRANSFORMS_H_
14 #define MLIR_DIALECT_NVGPU_TRANSFORMS_TRANSFORMS_H_
15 
16 #include "mlir/IR/Operation.h"
18 
19 namespace mlir {
20 class RewriterBase;
21 
22 namespace nvgpu {
23 
24 ///
25 /// Passes
26 ///
27 
28 /// Optimizes vectorized accesses to a shared memory buffer specified by
29 /// memrefValue. This transformation assumes the following:
30 /// 1) All relevant accesses to `memrefValue` are contained with `parentOp`.
31 /// 2) The function will fail precondition checks if any subviews are
32 /// taken of `memrefValue`. All reads/writes to `memrefValue` should occur
33 /// through `memrefValue` directly.
34 ///
35 /// Shared memory bank conflicts occur when multiple threads attempt to read or
36 /// write locations assigned to the same shared memory bank. For `2^N` byte
37 /// vectorized accesses, we need to be concerned with conflicts among threads
38 /// identified as `(tid) -> tid.floordiv(2^{7-N})`. As such, this transformation
39 /// changes any indexed memory access (vector.load, memref.load, nvgpu.ldmatrix,
40 /// etc) such that the final dimension's index value is permuted such that
41 /// `newColIndex = oldColIndex % vectorSize +
42 /// perm[rowIndex](oldColIndex/vectorSize, rowIndex)` where `rowIndex` is the
43 /// index for the second-to last dimension and `perm[rowIndex]` is a permutation
44 /// function that depends on the row Index. The permutation function is chosen
45 /// to ensure that sequential distributed+vectorized reads/writes down a single
46 /// dimension of the memref have minimal conflicts.
48  Value memrefValue);
49 
50 ///
51 /// Rewrites patterns
52 ///
53 
54 //===----------------------------------------------------------------------===//
55 // NVGPU transformation options exposed as auxiliary structs.
56 //===----------------------------------------------------------------------===//
57 /// Enum to control the lowering of `nvgpu.mmasync`.
58 enum class MmaSyncF32Lowering { TF32 = 0, TF32x3 = 1, Unkown = 2 };
59 
60 /// Collect patterns to convert mma.sync on f32 input and rewrite
61 /// to use tensor cores with user provided level of accuracy:
62 /// (a) tf32 (1 mma.sync per warp-level matrix-multiply-accumulate)
63 /// (b) tf32x3 (3 mma.sync per warp-level matrix-multiply-accumulate)
64 /// Typically, tf32 tensor core acceleration comes at a cost
65 /// of accuracy from missing precision bits. While f32 has 23 precision
66 /// bits, tf32 has only 10 precision bits. tf32x3 aims to recover the
67 /// precision bits by spliting each operand into two tf32 values
68 /// and issue three mma.sync tensor core operations.
70  RewritePatternSet &patterns,
72 
73 /// Convert global->shared vector transfers to async device copies. This
74 /// function looks for suitable vector transfers within the specified op and
75 /// converts them to "nvgpu.device_async_copy" ops. Consecutive copies are put
76 /// into the same sync group. If `bypassL1` is set, the "bypassL1" attribute is
77 /// set for suitable (i.e., transfer size 16 bytes) transfers.
78 void createAsyncGroups(RewriterBase &rewriter, Operation *op, bool bypassL1);
79 
80 } // namespace nvgpu
81 } // namespace mlir
82 
83 #endif // MLIR_DIALECT_NVGPU_TRANSFORMS_TRANSFORMS_H_
Operation is the basic unit of execution within MLIR.
Definition: Operation.h:88
This class coordinates the application of a rewrite on a set of IR, providing a way for clients to tr...
Definition: PatternMatch.h:400
mlir::LogicalResult optimizeSharedMemoryReadsAndWrites(Operation *parentOp, Value memrefValue)
Passes.
MmaSyncF32Lowering
Rewrites patterns.
Definition: Transforms.h:58
void populateMmaSyncF32ToTF32Patterns(RewritePatternSet &patterns, nvgpu::MmaSyncF32Lowering precision=nvgpu::MmaSyncF32Lowering::TF32)
Collect patterns to convert mma.sync on f32 input and rewrite to use tensor cores with user provided ...
void createAsyncGroups(RewriterBase &rewriter, Operation *op, bool bypassL1)
Convert global->shared vector transfers to async device copies.
Include the generated interface declarations.
This class represents an efficient way to signal success or failure.
Definition: LogicalResult.h:26