# 'vector' Dialect

MLIR supports multi-dimensional `vector`

types and custom operations on those
types. A generic, retargetable, higher-order `vector`

type (`n-D`

with `n > 1`

)
is a structured type, that carries semantic information useful for
transformations. This document discusses retargetable abstractions that exist in
MLIR today and operate on ssa-values of type `vector`

along with pattern
rewrites and lowerings that enable targeting specific instructions on concrete
targets. These abstractions serve to separate concerns between operations on
`memref`

(a.k.a buffers) and operations on `vector`

values. This is not a new
proposal but rather a textual documentation of existing MLIR components along
with a rationale.

## Positioning in the Codegen Infrastructure ¶

The following diagram, recently presented with the StructuredOps abstractions , captures the current codegen paths implemented in MLIR in the various existing lowering paths.

The following diagram seeks to isolate `vector`

dialects from the complexity of
the codegen paths and focus on the payload-carrying ops that operate on std and
`vector`

types. This diagram is not to be taken as set in stone and
representative of what exists today but rather illustrates the layering of
abstractions in MLIR.

This separates concerns related to (a) defining efficient operations on
`vector`

types from (b) program analyses + transformations on `memref`

, loops
and other types of structured ops (be they `HLO`

, `LHLO`

, `Linalg`

or other ).
Looking a bit forward in time, we can put a stake in the ground and venture that
the higher level of `vector`

-level primitives we build and target from codegen
(or some user/language level), the simpler our task will be, the more complex
patterns can be expressed and the better performance will be.

## Components of a Generic Retargetable Vector-Level Dialect ¶

The existing MLIR `vector`

-level dialects are related to the following bottom-up
abstractions:

- Representation in
`LLVMIR`

via data structures, instructions and intrinsics. This is referred to as the`LLVM`

level. - Set of machine-specific operations and types that are built to translate
almost 1-1 with the HW ISA. This is referred to as the Hardware Vector
level; a.k.a
`HWV`

. For instance, we have (a) the`NVVM`

dialect (for`CUDA`

) with tensor core ops, (b) accelerator-specific dialects (internal), a potential (future)`CPU`

dialect to capture`LLVM`

intrinsics more closely and other dialects for specific hardware. Ideally this should be auto-generated as much as possible from the`LLVM`

level. - Set of virtual, machine-agnostic, operations that are informed by costs at
the
`HWV`

-level. This is referred to as the Virtual Vector level; a.k.a`VV`

. This is the level that higher-level abstractions (codegen, automatic vectorization, potential vector language, …) targets.

The existing generic, retargetable, `vector`

-level dialect is related to the
following top-down rewrites and conversions:

- MLIR Rewrite Patterns applied by the MLIR
`PatternRewrite`

infrastructure to progressively lower to implementations that match closer and closer to the`HWV`

. Some patterns are “in-dialect”`VV -> VV`

and some are conversions`VV -> HWV`

. `Virtual Vector -> Hardware Vector`

lowering is specified as a set of MLIR lowering patterns that are specified manually for now.`Hardware Vector -> LLVM`

lowering is a mechanical process that is written manually at the moment and that should be automated, following the`LLVM -> Hardware Vector`

ops generation as closely as possible.

## Short Description of the Existing Infrastructure ¶

### LLVM level ¶

On CPU, the `n-D`

`vector`

type currently lowers to `!llvm<array<vector>>`

. More
concretely, `vector<4x8x128xf32>`

lowers to `!llvm<[4 x [ 8 x [ 128 x float ]]]>`

. There are tradeoffs involved related to how one can access subvectors and
how one uses `llvm.extractelement`

, `llvm.insertelement`

and
`llvm.shufflevector`

. A
deeper dive section
discusses the current
lowering choices and tradeoffs.

### Hardware Vector Ops ¶

Hardware Vector Ops are implemented as one dialect per target. For internal
hardware, we are auto-generating the specific HW dialects. For `GPU`

, the `NVVM`

dialect adds operations such as `mma.sync`

, `shfl`

and tests. For `CPU`

things
are somewhat in-flight because the abstraction is close to `LLVMIR`

. The jury is
still out on whether a generic `CPU`

dialect is concretely needed, but it seems
reasonable to have the same levels of abstraction for all targets and perform
cost-based lowering decisions in MLIR even for `LLVM`

. Specialized `CPU`

dialects that would capture specific features not well captured by LLVM peephole
optimizations of on different types that core MLIR supports (e.g. Scalable
Vectors) are welcome future extensions.

### Virtual Vector Ops ¶

Some existing Standard and Vector Dialect on `n-D`

`vector`

types comprise:

```
%2 = arith.addf %0, %1 : vector<3x7x8xf32> // -> vector<3x7x8xf32> %2 =
arith.mulf %0, %1 : vector<3x7x8xf32> // -> vector<3x7x8xf32> %2 = std.splat
%1 : vector<3x7x8xf32> // -> vector<3x7x8xf32>
%1 = vector.extract %0[1]: vector<3x7x8xf32> // -> vector<7x8xf32> %1 =
vector.extract %0[1, 5]: vector<3x7x8xf32> // -> vector<8xf32> %2 =
vector.outerproduct %0, %1: vector<4xf32>, vector<8xf32> // -> vector<4x8xf32>
%3 = vector.outerproduct %0, %1, %2: vector<4xf32>, vector<8xf32> // fma when
adding %2 %3 = vector.strided_slice %0 {offsets = [2, 2], sizes = [2, 2],
strides = [1, 1]}: vector<4x8x16xf32> // Returns a slice of type
vector<2x2x16xf32>
%2 = vector.transfer_read %A[%0, %1] {permutation_map = (d0, d1) -> (d0)}:
memref<7x?xf32>, vector<4xf32>
vector.transfer_write %f1, %A[%i0, %i1, %i2, %i3] {permutation_map = (d0, d1,
d2, d3) -> (d3, d1, d0)} : vector<5x4x3xf32>, memref<?x?x?x?xf32>
```

The list of Vector is currently undergoing evolutions and is best kept track of
by following the evolution of the
VectorOps.td
ODS file (markdown documentation is automatically generated locally when
building and populates the
Vector doc
).
Recent extensions are driven by concrete use cases of interest. A notable such
use case is the `vector.contract`

op which applies principles of the
StructuredOps abstraction to `vector`

types.

### Virtual Vector Rewrite Patterns ¶

The following rewrite patterns exist at the `VV->VV`

level:

- The now retired
`MaterializeVector`

pass used to legalize ops on a coarse-grained virtual`vector`

to a finer-grained virtual`vector`

by unrolling. This has been rewritten as a retargetable unroll-and-jam pattern on`vector`

ops and`vector`

types. - The lowering of
`vector_transfer`

ops legalizes`vector`

load/store ops to permuted loops over scalar load/stores. This should evolve to loops over`vector`

load/stores +`mask`

operations as they become available`vector`

ops at the`VV`

level.

The general direction is to add more Virtual Vector level ops and implement more
useful `VV -> VV`

rewrites as composable patterns that the PatternRewrite
infrastructure can apply iteratively.

### Virtual Vector to Hardware Vector Lowering ¶

For now, `VV -> HWV`

are specified in C++ (see for instance the
SplatOpLowering for n-D vectors
or the
VectorOuterProductOp lowering
).

Simple
conversion tests
are available for the `LLVM`

target starting from the Virtual Vector Level.

## Rationale ¶

### Hardware as `vector`

Machines of Minimum Granularity ¶

Higher-dimensional `vector`

s are ubiquitous in modern HPC hardware. One way to
think about Generic Retargetable `vector`

-Level Dialect is that it operates on
`vector`

types that are multiples of a “good” `vector`

size so the HW can
efficiently implement a set of high-level primitives (e.g.
`vector<8x8x8x16xf32>`

when HW `vector`

size is say `vector<4x8xf32>`

).

Some notable `vector`

sizes of interest include:

- CPU:
`vector<HW_vector_size * k>`

,`vector<core_count * k’ x HW_vector_size * k>`

and`vector<socket_count x core_count * k’ x HW_vector_size * k>`

- GPU:
`vector<warp_size * k>`

,`vector<warp_size * k x float4>`

and`vector<warp_size * k x 4 x 4 x 4>`

for tensor_core sizes, - Other accelerators: n-D
`vector`

as first-class citizens in the HW.

Depending on the target, ops on sizes that are not multiples of the HW `vector`

size may either produce slow code (e.g. by going through `LLVM`

legalization) or
may not legalize at all (e.g. some unsupported accelerator X combination of ops
and types).

### Transformations Problems Avoided ¶

A `vector<16x32x64xf32>`

virtual `vector`

is a coarse-grained type that can be
“unrolled” to HW-specific sizes. The multi-dimensional unrolling factors are
carried in the IR by the `vector`

type. After unrolling, traditional
instruction-level scheduling can be run.

The following key transformations (along with the supporting analyses and
structural constraints) are completely avoided by operating on a `vector`

`ssa-value`

abstraction:

- Loop unroll and unroll-and-jam.
- Loop and load-store restructuring for register reuse.
- Load to store forwarding and Mem2reg.
- Coarsening (raising) from finer-grained
`vector`

form.

Note that “unrolling” in the context of `vector`

s corresponds to partial loop
unroll-and-jam and not full unrolling. As a consequence this is expected to
compose with SW pipelining where applicable and does not result in ICache blow
up.

### The Big Out-Of-Scope Piece: Automatic Vectorization ¶

One important piece not discussed here is automatic vectorization (automatically
raising from scalar to n-D `vector`

ops and types). The TL;DR is that when the
first “super-vectorization” prototype was implemented, MLIR was nowhere near as
mature as it is today. As we continue building more abstractions in `VV -> HWV`

,
there is an opportunity to revisit vectorization in MLIR.

Since this topic touches on codegen abstractions, it is technically out of the
scope of this survey document but there is a lot to discuss in light of
structured op type representations and how a vectorization transformation can be
reused across dialects. In particular, MLIR allows the definition of dialects at
arbitrary levels of granularity and lends itself favorably to progressive
lowering. The argument can be made that automatic vectorization on a loops + ops
abstraction is akin to raising structural information that has been lost.
Instead, it is possible to revisit vectorization as simple pattern rewrites,
provided the IR is in a suitable form. For instance, vectorizing a
`linalg.generic`

op whose semantics match a `matmul`

can be done
quite easily with a pattern
.
In fact this pattern is trivial to generalize to any type of contraction when
targeting the `vector.contract`

op, as well as to any field (`+/*`

, `min/+`

,
`max/+`

, `or/and`

, `logsumexp/+`

…) . In other words, by operating on a higher
level of generic abstractions than affine loops, non-trivial transformations
become significantly simpler and composable at a finer granularity.

Irrespective of the existence of an auto-vectorizer, one can build a notional
vector language based on the VectorOps dialect and build end-to-end models with
expressing `vector`

s in the IR directly and simple pattern-rewrites.
EDSC
s
provide a simple way of driving such a notional language directly in C++.

## Bikeshed Naming Discussion ¶

There are arguments against naming an n-D level of abstraction `vector`

because
most people associate it with 1-D `vector`

s. On the other hand, `vector`

s are
first-class n-D values in MLIR. The alternative name Tile has been proposed,
which conveys higher-D meaning. But it also is one of the most overloaded terms
in compilers and hardware. For now, we generally use the `n-D`

`vector`

name and
are open to better suggestions.

## DeeperDive ¶

This section describes the tradeoffs involved in lowering the MLIR n-D vector type and operations on it to LLVM-IR. Putting aside the LLVM Matrix proposal for now, this assumes LLVM only has built-in support for 1-D vector. The relationship with the LLVM Matrix proposal is discussed at the end of this document.

MLIR does not currently support dynamic vector sizes (i.e. SVE style) so the
discussion is limited to static rank and static vector sizes (e.g.
`vector<4x8x16x32xf32>`

). This section discusses operations on vectors in LLVM
and MLIR.

LLVM instructions are prefixed by the `llvm.`

dialect prefix (e.g.
`llvm.insertvalue`

). Such ops operate exclusively on 1-D vectors and aggregates
following the
LLVM LangRef
. MLIR
operations are prefixed by the `vector.`

dialect prefix (e.g.
`vector.insertelement`

). Such ops operate exclusively on MLIR `n-D`

`vector`

types.

### Alternatives For Lowering an n-D Vector Type to LLVM ¶

Consider a vector of rank n with static sizes `{s_0, ... s_{n-1}}`

(i.e. an MLIR
`vector<s_0x...s_{n-1}xf32>`

). Lowering such an `n-D`

MLIR vector type to an
LLVM descriptor can be done by either:

- Flattening to a
`1-D`

vector:`!llvm<"(s_0*...*s_{n-1})xfloat">`

in the MLIR LLVM dialect. - Nested aggregate type of
`1-D`

vector:`!llvm."[s_0x[s_1x[...<s_{n-1}xf32>]]]">`

in the MLIR LLVM dialect. - A mix of both.

There are multiple tradeoffs involved in choosing one or the other that we
discuss. It is important to note that “a mix of both” immediately reduces to
“nested aggregate type of 1-D vector” with a `vector.cast %0: vector<4x8x16x32xf32> to vector<4x4096xf32>`

operation, that flattens the most
“k” minor dimensions.

### Constraints Inherited from LLVM (see LangRef) ¶

The first constraint was already mentioned: LLVM only supports `1-D`

`vector`

types natively. Additional constraints are related to the difference in LLVM
between vector and aggregate types: `“Aggregate Types are a subset of derived types that can contain multiple member types. Arrays and structs are aggregate types. Vectors are not considered to be aggregate types.”.`

This distinction is also reflected in some of the operations. For `1-D`

vectors,
the operations `llvm.extractelement`

, `llvm.insertelement`

, and
`llvm.shufflevector`

apply, with direct support for dynamic indices. For `n-D`

vectors with `n>1`

, and thus aggregate types at LLVM level, the more restrictive
operations `llvm.extractvalue`

and `llvm.insertvalue`

apply, which only accept
static indices. There is no direct shuffling support for aggregate types.

The next sentence illustrates a recurrent tradeoff, also found in MLIR, between
“value types” (subject to SSA use-def chains) and “memory types” (subject to
aliasing and side-effects): `“Structures in memory are accessed using ‘load’ and ‘store’ by getting a pointer to a field with the llvm.getelementptr instruction. Structures in registers are accessed using the llvm.extractvalue and llvm.insertvalue instructions.”`

When transposing this to MLIR, `llvm.getelementptr`

works on pointers to `n-D`

vectors in memory. For `n-D`

, vectors values that live in registers we can use
`vector.extract`

and `vector.insert`

which do not accept dynamic indices. Note
that this is consistent with hardware considerations as discussed below.

An alternative is to use an LLVM `1-D`

`vector`

type for which one can use
`llvm.extractelement`

, `llvm.insertelement`

and `llvm.shufflevector`

. These
operations accept dynamic indices. The implication is that one has to use a
flattened lowering of an MLIR n-D vector to an LLVM 1-D vector.

There are multiple tradeoffs involved that mix implications on the programming model, execution on actual HW and what is visible or hidden from codegen. They are discussed in the following sections.

### Nested Aggregate ¶

Pros:

- Natural encoding n-D vector -> (n-1)-D aggregate over 1-D vector.
- No need for linearization / delinearization logic inserted everywhere.
`llvm.insertvalue`

,`llvm.extractvalue`

of`(n-k)-D`

aggregate is natural.`llvm.insertelement`

,`llvm.extractelement`

,`llvm.shufflevector`

over`1-D`

vector type is natural.

Cons:

`llvm.insertvalue`

/`llvm.extractvalue`

does not accept dynamic indices but only static ones.- Dynamic indexing on the non-most-minor dimension requires roundtrips to memory.
- Special intrinsics and native instructions in LLVM operate on
`1-D`

vectors. This is not expected to be a practical limitation thanks to a`vector.cast %0: vector<4x8x16x32xf32> to vector<4x4096xf32>`

operation, that flattens the most minor dimensions (see the bigger picture in implications on codegen).

### Flattened 1-D Vector Type ¶

Pros:

`insertelement`

/`extractelement`

/`shufflevector`

with dynamic indexing is possible over the whole lowered`n-D`

vector type.- Supports special intrinsics and native operations.

Cons:

- Requires linearization/delinearization logic everywhere, translations are complex.
- Hides away the real HW structure behind dynamic indexing: at the end of the day, HW vector sizes are generally fixed and multiple vectors will be needed to hold a vector that is larger than the HW.
- Unlikely peephole optimizations will result in good code: arbitrary dynamic accesses, especially at HW vector boundaries unlikely to result in regular patterns.

### Discussion ¶

#### HW Vectors and Implications on the SW and the Programming Model ¶

As of today, the LLVM model only support `1-D`

vector types. This is
unsurprising because historically, the vast majority of HW only supports `1-D`

vector registers. We note that multiple HW vendors are in the process of
evolving to higher-dimensional physical vectors.

In the following discussion, let’s assume the HW vector size is `1-D`

and the SW
vector size is `n-D`

, with `n >= 1`

. The same discussion would apply with `2-D`

HW `vector`

size and `n >= 2`

. In this context, most HW exhibit a vector
register file. The number of such vectors is fixed. Depending on the rank and
sizes of the SW vector abstraction and the HW vector sizes and number of
registers, an `n-D`

SW vector type may be materialized by a mix of multiple
`1-D`

HW vector registers + memory locations at a given point in time.

The implication of the physical HW constraints on the programming model are that
one cannot index dynamically across hardware registers: a register file can
generally not be indexed dynamically. This is because the register number is
fixed and one either needs to unroll explicitly to obtain fixed register numbers
or go through memory. This is a constraint familiar to CUDA programmers: when
declaring a `private float a[4]`

; and subsequently indexing with a *dynamic*
value results in so-called **local memory** usage (i.e. roundtripping to
memory).

#### Implication on codegen ¶

MLIR `n-D`

vector types are currently represented as `(n-1)-D`

arrays of `1-D`

vectors when lowered to LLVM. This introduces the consequences on static vs
dynamic indexing discussed previously: `extractelement`

, `insertelement`

and
`shufflevector`

on `n-D`

vectors in MLIR only support static indices. Dynamic
indices are only supported on the most minor `1-D`

vector but not the outer
`(n-1)-D`

. For other cases, explicit load / stores are required.

The implications on codegen are as follows:

- Loops around
`vector`

values are indirect addressing of vector values, they must operate on explicit load / store operations over`n-D`

vector types. - Once an
`n-D`

`vector`

type is loaded into an SSA value (that may or may not live in`n`

registers, with or without spilling, when eventually lowered), it may be unrolled to smaller`k-D`

`vector`

types and operations that correspond to the HW. This level of MLIR codegen is related to register allocation and spilling that occur much later in the LLVM pipeline. - HW may support >1-D vectors with intrinsics for indirect addressing within
these vectors. These can be targeted thanks to explicit
`vector_cast`

operations from MLIR`k-D`

vector types and operations to LLVM`1-D`

vectors + intrinsics.

Alternatively, we argue that directly lowering to a linearized abstraction hides away the codegen complexities related to memory accesses by giving a false impression of magical dynamic indexing across registers. Instead we prefer to make those very explicit in MLIR and allow codegen to explore tradeoffs. Different HW will require different tradeoffs in the sizes involved in steps 1., 2. and 3.

Decisions made at the MLIR level will have implications at a much later stage in
LLVM (after register allocation). We do not envision to expose concerns related
to modeling of register allocation and spilling to MLIR explicitly. Instead,
each target will expose a set of “good” target operations and `n-D`

vector
types, associated with costs that `PatterRewriters`

at the MLIR level will be
able to target. Such costs at the MLIR level will be abstract and used for
ranking, not for accurate performance modeling. In the future such costs will be
learned.

#### Implication on Lowering to Accelerators ¶

To target accelerators that support higher dimensional vectors natively, we can
start from either `1-D`

or `n-D`

vectors in MLIR and use `vector.cast`

to
flatten the most minor dimensions to `1-D`

`vector<Kxf32>`

where `K`

is an
appropriate constant. Then, the existing lowering to LLVM-IR immediately
applies, with extensions for accelerator-specific intrinsics.

It is the role of an Accelerator-specific vector dialect (see codegen flow in
the figure above) to lower the `vector.cast`

. Accelerator -> LLVM lowering would
then consist of a bunch of `Accelerator -> Accelerator`

rewrites to perform the
casts composed with `Accelerator -> LLVM`

conversions + intrinsics that operate
on `1-D`

`vector<Kxf32>`

.

Some of those rewrites may need extra handling, especially if a reduction is
involved. For example, `vector.cast %0: vector<K1x...xKnxf32> to vector<Kxf32>`

when `K != K1 * … * Kn`

and some arbitrary irregular `vector.cast %0: vector<4x4x17xf32> to vector<Kxf32>`

may introduce masking and intra-vector
shuffling that may not be worthwhile or even feasible, i.e. infinite cost.

However `vector.cast %0: vector<K1x...xKnxf32> to vector<Kxf32>`

when `K = K1 * … * Kn`

should be close to a noop.

As we start building accelerator-specific abstractions, we hope to achieve retargetable codegen: the same infra is used for CPU, GPU and accelerators with extra MLIR patterns and costs.

#### Implication on calling external functions that operate on vectors ¶

It is possible (likely) that we additionally need to linearize when calling an external function.

### Relationship to LLVM matrix type proposal. ¶

The LLVM matrix proposal was formulated 1 year ago but seemed to be somewhat stalled until recently. In its current form, it is limited to 2-D matrix types and operations are implemented with LLVM intrinsics. In contrast, MLIR sits at a higher level of abstraction and allows the lowering of generic operations on generic n-D vector types from MLIR to aggregates of 1-D LLVM vectors. In the future, it could make sense to lower to the LLVM matrix abstraction also for CPU even though MLIR will continue needing higher level abstractions.

On the other hand, one should note that as MLIR is moving to LLVM, this document could become the unifying abstraction that people should target for 1-D vectors and the LLVM matrix proposal can be viewed as a subset of this work.

### Conclusion ¶

The flattened 1-D vector design in the LLVM matrix proposal is good in a HW-specific world with special intrinsics. This is a good abstraction for register allocation, Instruction-Level-Parallelism and SoftWare-Pipelining/Modulo Scheduling optimizations at the register level. However MLIR codegen operates at a higher level of abstraction where we want to target operations on coarser-grained vectors than the HW size and on which unroll-and-jam is applied and patterns across multiple HW vectors can be matched.

This makes “nested aggregate type of 1-D vector” an appealing abstraction for lowering from MLIR because:

- it does not hide complexity related to the buffer vs value semantics and the memory subsystem and
- it does not rely on LLVM to magically make all the things work from a too low-level abstraction.

The use of special intrinsics in a `1-D`

LLVM world is still available thanks to
an explicit `vector.cast`

op.

## Operations ¶

`vector.bitcast`

(::mlir::vector::BitCastOp) ¶

bitcast casts between vectors

Syntax:

```
operation ::= `vector.bitcast` $source attr-dict `:` type($source) `to` type($result)
```

The bitcast operation casts between vectors of the same rank, the minor 1-D vector size is casted to a vector with a different element type but same bitwidth.

Example:

```
// Example casting to a smaller element type.
%1 = vector.bitcast %0 : vector<5x1x4x3xf32> to vector<5x1x4x6xi16>
// Example casting to a bigger element type.
%3 = vector.bitcast %2 : vector<10x12x8xi8> to vector<10x12x2xi32>
// Example casting to an element type of the same size.
%5 = vector.bitcast %4 : vector<5x1x4x3xf32> to vector<5x1x4x3xi32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Operands: ¶

Operand | Description |
---|---|

`source` | vector of any type values |

#### Results: ¶

Result | Description |
---|---|

`result` | vector of any type values |

`vector.broadcast`

(::mlir::vector::BroadcastOp) ¶

broadcast operation

Syntax:

```
operation ::= `vector.broadcast` $source attr-dict `:` type($source) `to` type($vector)
```

Broadcasts the scalar or k-D vector value in the source operand to a n-D result vector such that the broadcast makes sense, i.e., the source operand is duplicated to match the given rank and sizes in the result vector. The legality rules are:

- the source operand must have the same element type as the result type
- a k-D vector <s_1 x .. x s_k x type> can be broadcast to
a n-D vector <t_1 x .. x t_n x type> if
- k <= n, and
- the sizes in the trailing dimensions n-k < i <= n with j=i+k-n match exactly as s_j = t_i or s_j = 1:

`t_1 x .. t_n-k x t_n-k+1 x .. x t_i x .. x t_n s_1 x .. x s_j x .. x s_k <duplication> <potential stretch>`

The source operand is duplicated over all the missing leading dimensions and stretched over the trailing dimensions where the source has a non-equal dimension of 1. These rules imply that any scalar broadcast (k=0) to any shaped vector with the same element type is always legal.

Example:

```
%0 = arith.constant 0.0 : f32
%1 = vector.broadcast %0 : f32 to vector<16xf32>
%2 = vector.broadcast %1 : vector<16xf32> to vector<4x16xf32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Operands: ¶

Operand | Description |
---|---|

`source` | any type |

#### Results: ¶

Result | Description |
---|---|

`vector` | vector of any type values |

`vector.compressstore`

(::mlir::vector::CompressStoreOp) ¶

writes elements selectively from a vector as defined by a mask

Syntax:

```
operation ::= `vector.compressstore` $base `[` $indices `]` `,` $mask `,` $valueToStore attr-dict `:` type($base) `,` type($mask) `,` type($valueToStore)
```

The compress store operation writes elements from a 1-D vector into memory as defined by a base with indices and a 1-D mask vector. When the mask is set, the corresponding element from the vector is written next to memory. Otherwise, no action is taken for the element. Informally the semantics are:

```
index = i
if (mask[0]) base[index++] = value[0]
if (mask[1]) base[index++] = value[1]
etc.
```

Note that the index increment is done conditionally.

The compress store can be used directly where applicable, or can be used
during progressively lowering to bring other memory operations closer to
hardware ISA support for a compress. The semantics of the operation closely
correspond to those of the `llvm.masked.compressstore`

intrinsic
.

Examples:

```
vector.compressstore %base[%i], %mask, %value
: memref<?xf32>, vector<8xi1>, vector<8xf32>
vector.compressstore %base[%i, %j], %mask, %value
: memref<?x?xf32>, vector<16xi1>, vector<16xf32>
```

#### Operands: ¶

Operand | Description |
---|---|

`base` | memref of any type values |

`indices` | index |

`mask` | vector of 1-bit signless integer values of ranks 1 |

`valueToStore` | of ranks 1 |

`vector.constant_mask`

(::mlir::vector::ConstantMaskOp) ¶

creates a constant vector mask

Syntax:

```
operation ::= `vector.constant_mask` $mask_dim_sizes attr-dict `:` type(results)
```

Creates and returns a vector mask where elements of the result vector are set to ‘0’ or ‘1’, based on whether the element indices are contained within a hyper-rectangular region specified by the ‘mask_dim_sizes’ array attribute argument. Each element of the ‘mask_dim_sizes’ array, specifies an exclusive upper bound [0, mask-dim-size-element-value) for a unique dimension in the vector result. The conjunction of the ranges define a hyper-rectangular region within which elements values are set to 1 (otherwise element values are set to 0).

Example:

```
// create a constant vector mask of size 4x3xi1 with elements in range
// 0 <= row <= 2 and 0 <= col <= 1 are set to 1 (others to 0).
%1 = vector.constant_mask [3, 2] : vector<4x3xi1>
print %1
columns
0 1 2
|------------
0 | 1 1 0
rows 1 | 1 1 0
2 | 1 1 0
3 | 0 0 0
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`mask_dim_sizes` | ::mlir::ArrayAttr | 64-bit integer array attribute |

#### Results: ¶

Result | Description |
---|---|

«unnamed» | vector of 1-bit signless integer values |

`vector.contract`

(::mlir::vector::ContractionOp) ¶

vector contraction operation

Computes the sum of products of vector elements along contracting dimension pairs from 2 vectors of rank M and N respectively, adds this intermediate result to the accumulator argument of rank K, and returns a vector result of rank K (where K = num_lhs_free_dims + num_rhs_free_dims + num_batch_dims (see dimension type descriptions below)). For K = 0 (no free or batch dimensions), the accumulator and output are a scalar.

Optional vector mask arguments (produced by CreateMaskOp or ConstantMaskOp) specify the dynamic dimension sizes of valid data within the lhs/rhs vector arguments.

An iterator type attribute list must be specified, where each element of the list represents an iterator with one of the following types:

*) “reduction”: reduction dimensions are present in the lhs and rhs arguments but not in the output (and accumulator argument). These are the dimensions along which the vector contraction op computes the sum of products, and contracting dimension pair dimension sizes must match between lhs/rhs. *) “parallel”: Batch dimensions are iterator type “parallel”, and are non-contracting dimensions present in the lhs, rhs and output. The lhs/rhs co-iterate along the batch dimensions, which should be expressed in their indexing maps.

```
Free dimensions are iterator type "parallel", and are
non-contraction, non-batch dimensions accessed by either the
lhs or rhs (but not both). The lhs and rhs free dimensions
are unrelated to each other and do not co-iterate, which
should be expressed in their indexing maps.
```

An indexing map attribute list must be specified with an entry for lhs, rhs and acc arguments. An indexing map attribute specifies a mapping from each iterator in the iterator type list, to each dimension of an N-D vector.

An optional kind attribute may be used to specify the combining function between the intermediate result and accumulator argument of rank K. This attribute can take the values add/mul/min/max for int/fp, and/or/xor for int only. The default is “add”.

Example:

```
// Simple DOT product (K = 0).
#contraction_accesses = [
affine_map<(i) -> (i)>,
affine_map<(i) -> (i)>,
affine_map<(i) -> ()>
]
#contraction_trait = {
indexing_maps = #contraction_accesses,
iterator_types = ["reduction"]
}
%3 = vector.contract #contraction_trait %0, %1, %2
: vector<10xf32>, vector<10xf32> into f32
// 2D vector contraction with one contracting dimension (matmul, K = 2).
#contraction_accesses = [
affine_map<(i, j, k) -> (i, k)>,
affine_map<(i, j, k) -> (k, j)>,
affine_map<(i, j, k) -> (i, j)>
]
#contraction_trait = {
indexing_maps = #contraction_accesses,
iterator_types = ["parallel", "parallel", "reduction"]
}
%3 = vector.contract #contraction_trait %0, %1, %2
: vector<4x3xf32>, vector<3x7xf32> into vector<4x7xf32>
// 4D to 3D vector contraction with two contracting dimensions and
// one batch dimension (K = 3).
#contraction_accesses = [
affine_map<(b0, f0, f1, c0, c1) -> (c0, b0, c1, f0)>,
affine_map<(b0, f0, f1, c0, c1) -> (b0, c1, c0, f1)>,
affine_map<(b0, f0, f1, c0, c1) -> (b0, f0, f1)>
]
#contraction_trait = {
indexing_maps = #contraction_accesses,
iterator_types = ["parallel", "parallel", "parallel",
"reduction", "reduction"]
}
%4 = vector.contract #contraction_trait %0, %1, %2
: vector<7x8x16x15xf32>, vector<8x16x7x5xf32> into vector<8x15x5xf32>
// 4D vector contraction with two contracting dimensions and optional
// vector mask arguments.
%lhs_mask = vector.constant_mask [7, 8, 16, 15] : vector<7x8x16x15xi1>
%rhs_mask = vector.constant_mask [8, 16, 7, 5] : vector<8x16x7x5xi1>
%5 = vector.contract #contraction_trait %0, %1, %2, %lhs_mask, %rhs_mask
: vector<7x8x16x15xf32>, vector<8x16x7x5xf32> into vector<8x15x8x5xf32>
// Vector contraction with mixed typed. lhs/rhs have different element
// types than accumulator/result.
%6 = vector.contract #contraction_trait %0, %1, %2
: vector<10xf16>, vector<10xf16> into f32
// Contract with max (K = 0).
#contraction_accesses = [
affine_map<(i) -> (i)>,
affine_map<(i) -> (i)>,
affine_map<(i) -> ()>
]
#contraction_trait = {
indexing_maps = #contraction_accesses,
iterator_types = ["reduction"],
kind = #vector.kind<max>
}
%7 = vector.contract #contraction_trait %0, %1, %2
: vector<10xf32>, vector<10xf32> into f32
```

Interfaces: NoSideEffect (MemoryEffectOpInterface), VectorUnrollOpInterface

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`indexing_maps` | ::mlir::ArrayAttr | AffineMap array attribute |

`iterator_types` | ::mlir::ArrayAttr | array attribute |

`kind` | ::mlir::vector::CombiningKindAttr | Kind of combining function for contractions and reductions |

#### Operands: ¶

Operand | Description |
---|---|

`lhs` | vector of any type values |

`rhs` | vector of any type values |

`acc` | any type |

`masks` | vector of 1-bit signless integer values |

#### Results: ¶

Result | Description |
---|---|

«unnamed» | any type |

`vector.create_mask`

(::mlir::vector::CreateMaskOp) ¶

creates a vector mask

Syntax:

```
operation ::= `vector.create_mask` $operands attr-dict `:` type(results)
```

Creates and returns a vector mask where elements of the result vector are set to ‘0’ or ‘1’, based on whether the element indices are contained within a hyper-rectangular region specified by the operands. Specifically, each operand specifies a range [0, operand-value) for a unique dimension in the vector result. The conjunction of the operand ranges define a hyper-rectangular region within which elements values are set to 1 (otherwise element values are set to 0).

Example:

```
// create a vector mask of size 4x3xi1 where elements in range
// 0 <= row <= 2 and 0 <= col <= 1 are set to 1 (others to 0).
%1 = vector.create_mask %c3, %c2 : vector<4x3xi1>
print %1
columns
0 1 2
|------------
0 | 1 1 0
rows 1 | 1 1 0
2 | 1 1 0
3 | 0 0 0
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Operands: ¶

Operand | Description |
---|---|

`operands` | index |

#### Results: ¶

Result | Description |
---|---|

«unnamed» | vector of 1-bit signless integer values |

`vector.expandload`

(::mlir::vector::ExpandLoadOp) ¶

reads elements from memory and spreads them into a vector as defined by a mask

Syntax:

```
operation ::= `vector.expandload` $base `[` $indices `]` `,` $mask `,` $pass_thru attr-dict `:` type($base) `,` type($mask) `,` type($pass_thru) `into` type($result)
```

The expand load reads elements from memory into a 1-D vector as defined by a base with indices and a 1-D mask vector. When the mask is set, the next element is read from memory. Otherwise, the corresponding element is taken from a 1-D pass-through vector. Informally the semantics are:

```
index = i
result[0] := mask[0] ? base[index++] : pass_thru[0]
result[1] := mask[1] ? base[index++] : pass_thru[1]
etc.
```

Note that the index increment is done conditionally.

The expand load can be used directly where applicable, or can be used
during progressively lowering to bring other memory operations closer to
hardware ISA support for an expand. The semantics of the operation closely
correspond to those of the `llvm.masked.expandload`

intrinsic
.

Examples:

```
%0 = vector.expandload %base[%i], %mask, %pass_thru
: memref<?xf32>, vector<8xi1>, vector<8xf32> into vector<8xf32>
%1 = vector.expandload %base[%i, %j], %mask, %pass_thru
: memref<?x?xf32>, vector<16xi1>, vector<16xf32> into vector<16xf32>
```

#### Operands: ¶

Operand | Description |
---|---|

`base` | memref of any type values |

`indices` | index |

`mask` | vector of 1-bit signless integer values of ranks 1 |

`pass_thru` | of ranks 1 |

#### Results: ¶

Result | Description |
---|---|

`result` | of ranks 1 |

`vector.extractelement`

(::mlir::vector::ExtractElementOp) ¶

extractelement operation

Syntax:

```
operation ::= `vector.extractelement` $vector `[` ($position^ `:` type($position))? `]` attr-dict `:` type($vector)
```

Takes a 0-D or 1-D vector and a optional dynamic index position and extracts the scalar at that position.

Note that this instruction resembles vector.extract, but is restricted to 0-D and 1-D vectors and relaxed to dynamic indices. If the vector is 0-D, the position must be llvm::None.

It is meant to be closer to LLVM’s version: https://llvm.org/docs/LangRef.html#extractelement-instruction

Example:

```
%c = arith.constant 15 : i32
%1 = vector.extractelement %0[%c : i32]: vector<16xf32>
%2 = vector.extractelement %z[]: vector<f32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Operands: ¶

Operand | Description |
---|---|

`vector` | vector of any type values |

`position` | signless integer or index |

#### Results: ¶

Result | Description |
---|---|

`result` | any type |

`vector.extract_map`

(::mlir::vector::ExtractMapOp) ¶

vector extract map operation

Syntax:

```
operation ::= `vector.extract_map` $vector `[` $ids `]` attr-dict `:` type($vector) `to` type(results)
```

Takes an N-D vector and extracts a sub-part of the vector starting at id along each dimension.

The dimension associated to each element of `ids`

used to extract are
implicitly deduced from the destination type. For each dimension the
multiplicity is the destination dimension size divided by the source
dimension size, each dimension with a multiplicity greater than 1 is
associated to the next id, following ids order.
For example if the source type is `vector<64x4x32xf32>`

and the destination
type is `vector<4x4x2xf32>`

, the first id maps to dimension 0 and the second
id to dimension 2.

Similarly to vector.tuple_get, this operation is used for progressive lowering and should be folded away before converting to LLVM.

It is different than `vector.extract_slice`

and
`vector.extract_strided_slice`

as it takes a Value as index instead of an
attribute. Also in the future it is meant to support extracting along any
dimensions and not only the most major ones.

For instance:

```
// dynamic computation producing the value 0 of index type
%idx0 = ... : index
// dynamic computation producing the value 1 of index type
%idx1 = ... : index
%0 = arith.constant dense<0, 1, 2, 3>: vector<4xi32>
// extracts values [0, 1]
%1 = vector.extract_map %0[%idx0] : vector<4xi32> to vector<2xi32>
// extracts values [1, 2]
%2 = vector.extract_map %0[%idx1] : vector<4xi32> to vector<2xi32>
```

Example:

```
%ev = vector.extract_map %v[%id] : vector<32xf32> to vector<1xf32>
%ev1 = vector.extract_map %v1[%id1, %id2] : vector<64x4x32xf32>
to vector<4x4x2xf32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Operands: ¶

Operand | Description |
---|---|

`vector` | vector of any type values |

`ids` | index |

#### Results: ¶

Result | Description |
---|---|

«unnamed» | vector of any type values |

`vector.extract`

(::mlir::vector::ExtractOp) ¶

extract operation

Takes an n-D vector and a k-D position and extracts the (n-k)-D vector at the proper position. Degenerates to an element type in the 0-D case.

Example:

```
%1 = vector.extract %0[3]: vector<4x8x16xf32>
%2 = vector.extract %0[3, 3, 3]: vector<4x8x16xf32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`position` | ::mlir::ArrayAttr | 64-bit integer array attribute |

#### Operands: ¶

Operand | Description |
---|---|

`vector` | vector of any type values |

#### Results: ¶

Result | Description |
---|---|

«unnamed» | any type |

`vector.extract_strided_slice`

(::mlir::vector::ExtractStridedSliceOp) ¶

extract_strided_slice operation

Syntax:

```
operation ::= `vector.extract_strided_slice` $vector attr-dict `:` type($vector) `to` type(results)
```

Takes an n-D vector, k-D `offsets`

integer array attribute, a k-sized
`sizes`

integer array attribute, a k-sized `strides`

integer array
attribute and extracts the n-D subvector at the proper offset.

At the moment strides must contain only 1s. // TODO: support non-1 strides.

Returns an n-D vector where the first k-D dimensions match the `sizes`

attribute. The returned subvector contains the elements starting at offset
`offsets`

and ending at `offsets + sizes`

.

Example:

```
%1 = vector.extract_strided_slice %0
{offsets = [0, 2], sizes = [2, 4], strides = [1, 1]}:
vector<4x8x16xf32> to vector<2x4x16xf32>
// TODO: Evolve to a range form syntax similar to:
%1 = vector.extract_strided_slice %0[0:2:1][2:4:1]
vector<4x8x16xf32> to vector<2x4x16xf32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`offsets` | ::mlir::ArrayAttr | 64-bit integer array attribute |

`sizes` | ::mlir::ArrayAttr | 64-bit integer array attribute |

`strides` | ::mlir::ArrayAttr | 64-bit integer array attribute |

#### Operands: ¶

Operand | Description |
---|---|

`vector` | vector of any type values |

#### Results: ¶

Result | Description |
---|---|

«unnamed» | vector of any type values |

`vector.fma`

(::mlir::vector::FMAOp) ¶

vector fused multiply-add

Syntax:

```
operation ::= `vector.fma` $lhs `,` $rhs `,` $acc attr-dict `:` type($lhs)
```

Multiply-add expressions operate on n-D vectors and compute a fused
pointwise multiply-and-accumulate: `$result = `

$lhs * $rhs + $acc`. All operands and result have the same vector type. The semantics of the operation correspond to those of the `

llvm.fma`[intrinsic](https://llvm.org/docs/LangRef.html#int-fma). In the particular case of lowering to LLVM, this is guaranteed to lower to the `

llvm.fma.*` intrinsic.

Example:

```
%3 = vector.fma %0, %1, %2: vector<8x16xf32>
```

Traits: Elementwise, Scalarizable, Tensorizable, Vectorizable

Interfaces: NoSideEffect (MemoryEffectOpInterface), VectorUnrollOpInterface

Effects: MemoryEffects::Effect{}

#### Operands: ¶

Operand | Description |
---|---|

`lhs` | vector of any type values |

`rhs` | vector of any type values |

`acc` | vector of any type values |

#### Results: ¶

Result | Description |
---|---|

`result` | vector of any type values |

`vector.flat_transpose`

(::mlir::vector::FlatTransposeOp) ¶

Vector matrix transposition on flattened 1-D MLIR vectors

Syntax:

```
operation ::= `vector.flat_transpose` $matrix attr-dict `:` type($matrix) `->` type($res)
```

This is the counterpart of llvm.matrix.transpose in MLIR. It serves the purposes of more progressive lowering and localized type conversion. Higher levels typically lower matrix tranpositions into ‘vector.transpose’ operations. Subsequent rewriting rule progressively lower these operations into ‘vector.flat_transpose’ operations to bring the operations closer to the hardware ISA.

The ‘vector.flat_transpose’ op treats the 1-D input `matrix`

as
a 2-D matrix with

Also see:

http://llvm.org/docs/LangRef.html#llvm-matrix-transpose-intrinsic

Example:

```
%1 = vector.flat_transpose %0 { rows = 4: i32, columns = 4: i32 }
: (vector<16xf32>) -> vector<16xf32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`rows` | ::mlir::IntegerAttr | 32-bit signless integer attribute |

`columns` | ::mlir::IntegerAttr | 32-bit signless integer attribute |

#### Operands: ¶

Operand | Description |
---|---|

`matrix` | vector of signless integer or signed integer or index or floating-point values of ranks 1 |

#### Results: ¶

Result | Description |
---|---|

`res` | vector of signless integer or signed integer or index or floating-point values of ranks 1 |

`vector.gather`

(::mlir::vector::GatherOp) ¶

gathers elements from memory into a vector as defined by an index vector and mask

Syntax:

```
operation ::= `vector.gather` $base `[` $indices `]` `[` $index_vec `]` `,` $mask `,` $pass_thru attr-dict `:` type($base) `,` type($index_vec) `,` type($mask) `,` type($pass_thru) `into` type($result)
```

The gather operation gathers elements from memory into a 1-D vector as defined by a base with indices and an additional 1-D index vector, but only if the corresponding bit is set in a 1-D mask vector. Otherwise, the element is taken from a 1-D pass-through vector. Informally the semantics are:

```
result[0] := mask[0] ? base[index[0]] : pass_thru[0]
result[1] := mask[1] ? base[index[1]] : pass_thru[1]
etc.
```

The vector dialect leaves out-of-bounds behavior undefined.

The gather operation can be used directly where applicable, or can be used
during progressively lowering to bring other memory operations closer to
hardware ISA support for a gather. The semantics of the operation closely
correspond to those of the `llvm.masked.gather`

intrinsic
.

Examples:

```
%0 = vector.gather %base[%c0][%v], %mask, %pass_thru
: memref<?xf32>, vector<16xi32>, vector<16xi1>, vector<16xf32> into vector<16xf32>
%1 = vector.gather %base[%i, %j][%v], %mask, %pass_thru
: memref<16x16xf32>, vector<16xi32>, vector<16xi1>, vector<16xf32> into vector<16xf32>
```

#### Operands: ¶

Operand | Description |
---|---|

`base` | memref of any type values |

`indices` | index |

`index_vec` | vector of integer or index values of ranks 1 |

`mask` | vector of 1-bit signless integer values of ranks 1 |

`pass_thru` | of ranks 1 |

#### Results: ¶

Result | Description |
---|---|

`result` | of ranks 1 |

`vector.insertelement`

(::mlir::vector::InsertElementOp) ¶

insertelement operation

Syntax:

```
operation ::= `vector.insertelement` $source `,` $dest `[` ($position^ `:` type($position))? `]` attr-dict `:`
type($result)
```

Takes a scalar source, a 0-D or 1-D destination vector and a dynamic index position and inserts the source into the destination at the proper position.

Note that this instruction resembles vector.insert, but is restricted to 0-D and 1-D vectors and relaxed to dynamic indices.

It is meant to be closer to LLVM’s version: https://llvm.org/docs/LangRef.html#insertelement-instruction

Example:

```
%c = arith.constant 15 : i32
%f = arith.constant 0.0f : f32
%1 = vector.insertelement %f, %0[%c : i32]: vector<16xf32>
%2 = vector.insertelement %f, %z[]: vector<f32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Operands: ¶

Operand | Description |
---|---|

`source` | any type |

`dest` | vector of any type values |

`position` | signless integer or index |

#### Results: ¶

Result | Description |
---|---|

`result` | vector of any type values |

`vector.insert_map`

(::mlir::vector::InsertMapOp) ¶

vector insert map operation

Syntax:

```
operation ::= `vector.insert_map` $vector `,` $dest `[` $ids `]` attr-dict
`:` type($vector) `into` type($result)
```

Inserts a N-D vector and within a larger vector starting at id. The new vector created will have the same size as the destination operand vector.

The dimension associated to each element of `ids`

used to insert is
implicitly deduced from the source type (see `ExtractMapOp`

for details).
For example if source type is `vector<4x4x2xf32>`

and the destination type
is `vector<64x4x32xf32>`

, the first id maps to dimension 0 and the second id
to dimension 2.

Similarly to vector.tuple_get, this operation is used for progressive lowering and should be folded away before converting to LLVM.

It is different than `vector.insert`

and `vector.insert_strided_slice`

as it
takes a Value as index instead of an attribute. Also in the future it is
meant to support inserting along any dimensions and not only the most major
ones.

This operations is meant to be used in combination with vector.extract_map.

For instance:

```
// dynamic computation producing the value 0 of index type
%idx0 = ... : index
// dynamic computation producing the value 1 of index type
%idx1 = ... : index /
%0 = arith.constant dense<0, 1, 2, 3>: vector<4xi32>
// extracts values [0, 1]
%1 = vector.extract_map %0[%idx0] : vector<4xi32> to vector<2xi32>
// extracts values [1, 2]
%2 = vector.extract_map %0[%idx1] : vector<4xi32> to vector<2xi32>
// insert [0, 1] into [x, x, x, x] and produce [0, 1, x, x]
%3 = vector.insert_map %1, %0[%idx0] : vector<2xi32> into vector<4xi32>
// insert [1, 2] into [x, x, x, x] and produce [x, 1, 2, x]
%4 = vector.insert_map %2, %0[%idx1] : vector<2xi32> into vector<4xi32>
```

Example:

```
%v = vector.insert_map %ev %v[%id] : vector<1xf32> into vector<32xf32>
%v1 = vector.insert_map %ev1, %v1[%arg0, %arg1] : vector<2x4x1xf32>
into vector<64x4x32xf32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Operands: ¶

Operand | Description |
---|---|

`vector` | vector of any type values |

`dest` | vector of any type values |

`ids` | index |

#### Results: ¶

Result | Description |
---|---|

`result` | vector of any type values |

`vector.insert`

(::mlir::vector::InsertOp) ¶

insert operation

Syntax:

```
operation ::= `vector.insert` $source `,` $dest $position attr-dict `:` type($source) `into` type($dest)
```

Takes an n-D source vector, an (n+k)-D destination vector and a k-D position and inserts the n-D source into the (n+k)-D destination at the proper position. Degenerates to a scalar source type when n = 0.

Example:

```
%2 = vector.insert %0, %1[3] : vector<8x16xf32> into vector<4x8x16xf32>
%5 = vector.insert %3, %4[3, 3, 3] : f32 into vector<4x8x16xf32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`position` | ::mlir::ArrayAttr | 64-bit integer array attribute |

#### Operands: ¶

Operand | Description |
---|---|

`source` | any type |

`dest` | vector of any type values |

#### Results: ¶

Result | Description |
---|---|

`res` | vector of any type values |

`vector.insert_strided_slice`

(::mlir::vector::InsertStridedSliceOp) ¶

strided_slice operation

Syntax:

```
operation ::= `vector.insert_strided_slice` $source `,` $dest attr-dict `:` type($source) `into` type($dest)
```

Takes a k-D source vector, an n-D destination vector (n >= k), n-sized
`offsets`

integer array attribute, a k-sized `strides`

integer array attribute
and inserts the k-D source vector as a strided subvector at the proper offset
into the n-D destination vector.

At the moment strides must contain only 1s.

Returns an n-D vector that is a copy of the n-D destination vector in which the last k-D dimensions contain the k-D source vector elements strided at the proper location as specified by the offsets.

Example:

```
%2 = vector.insert_strided_slice %0, %1
{offsets = [0, 0, 2], strides = [1, 1]}:
vector<2x4xf32> into vector<16x4x8xf32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`offsets` | ::mlir::ArrayAttr | 64-bit integer array attribute |

`strides` | ::mlir::ArrayAttr | 64-bit integer array attribute |

#### Operands: ¶

Operand | Description |
---|---|

`source` | vector of any type values |

`dest` | vector of any type values |

#### Results: ¶

Result | Description |
---|---|

`res` | vector of any type values |

`vector.load`

(::mlir::vector::LoadOp) ¶

reads an n-D slice of memory into an n-D vector

Syntax:

```
operation ::= `vector.load` $base `[` $indices `]` attr-dict `:` type($base) `,` type($result)
```

The ‘vector.load’ operation reads an n-D slice of memory into an n-D vector. It takes a ‘base’ memref, an index for each memref dimension and a result vector type as arguments. It returns a value of the result vector type. The ‘base’ memref and indices determine the start memory address from which to read. Each index provides an offset for each memref dimension based on the element type of the memref. The shape of the result vector type determines the shape of the slice read from the start memory address. The elements along each dimension of the slice are strided by the memref strides. Only unit strides are allowed along the most minor memref dimension. These constraints guarantee that elements read along the first dimension of the slice are contiguous in memory.

The memref element type can be a scalar or a vector type. If the memref element type is a scalar, it should match the element type of the result vector. If the memref element type is vector, it should match the result vector type.

Example 1: 1-D vector load on a scalar memref.

```
%result = vector.load %base[%i, %j] : memref<100x100xf32>, vector<8xf32>
```

Example 2: 1-D vector load on a vector memref.

```
%result = vector.load %memref[%i, %j] : memref<200x100xvector<8xf32>>, vector<8xf32>
```

Example 3: 2-D vector load on a scalar memref.

```
%result = vector.load %memref[%i, %j] : memref<200x100xf32>, vector<4x8xf32>
```

Example 4: 2-D vector load on a vector memref.

```
%result = vector.load %memref[%i, %j] : memref<200x100xvector<4x8xf32>>, vector<4x8xf32>
```

Representation-wise, the ‘vector.load’ operation permits out-of-bounds reads. Support and implementation of out-of-bounds vector loads is target-specific. No assumptions should be made on the value of elements loaded out of bounds. Not all targets may support out-of-bounds vector loads.

Example 5: Potential out-of-bound vector load.

```
%result = vector.load %memref[%index] : memref<?xf32>, vector<8xf32>
```

Example 6: Explicit out-of-bound vector load.

```
%result = vector.load %memref[%c0] : memref<7xf32>, vector<8xf32>
```

#### Operands: ¶

Operand | Description |
---|---|

`base` | memref of any type values |

`indices` | index |

#### Results: ¶

Result | Description |
---|---|

`result` | vector of any type values |

`vector.maskedload`

(::mlir::vector::MaskedLoadOp) ¶

loads elements from memory into a vector as defined by a mask vector

Syntax:

```
operation ::= `vector.maskedload` $base `[` $indices `]` `,` $mask `,` $pass_thru attr-dict `:` type($base) `,` type($mask) `,` type($pass_thru) `into` type($result)
```

The masked load reads elements from memory into a 1-D vector as defined by a base with indices and a 1-D mask vector. When the mask is set, the element is read from memory. Otherwise, the corresponding element is taken from a 1-D pass-through vector. Informally the semantics are:

```
result[0] := mask[0] ? base[i+0] : pass_thru[0]
result[1] := mask[1] ? base[i+1] : pass_thru[1]
etc.
```

The masked load can be used directly where applicable, or can be used
during progressively lowering to bring other memory operations closer to
hardware ISA support for a masked load. The semantics of the operation
closely correspond to those of the `llvm.masked.load`

intrinsic
.

Examples:

```
%0 = vector.maskedload %base[%i], %mask, %pass_thru
: memref<?xf32>, vector<8xi1>, vector<8xf32> into vector<8xf32>
%1 = vector.maskedload %base[%i, %j], %mask, %pass_thru
: memref<?x?xf32>, vector<16xi1>, vector<16xf32> into vector<16xf32>
```

#### Operands: ¶

Operand | Description |
---|---|

`base` | memref of any type values |

`indices` | index |

`mask` | vector of 1-bit signless integer values of ranks 1 |

`pass_thru` | of ranks 1 |

#### Results: ¶

Result | Description |
---|---|

`result` | of ranks 1 |

`vector.maskedstore`

(::mlir::vector::MaskedStoreOp) ¶

stores elements from a vector into memory as defined by a mask vector

Syntax:

```
operation ::= `vector.maskedstore` $base `[` $indices `]` `,` $mask `,` $valueToStore attr-dict `:` type($base) `,` type($mask) `,` type($valueToStore)
```

The masked store operation writes elements from a 1-D vector into memory as defined by a base with indices and a 1-D mask vector. When the mask is set, the corresponding element from the vector is written to memory. Otherwise, no action is taken for the element. Informally the semantics are:

```
if (mask[0]) base[i+0] = value[0]
if (mask[1]) base[i+1] = value[1]
etc.
```

The masked store can be used directly where applicable, or can be used
during progressively lowering to bring other memory operations closer to
hardware ISA support for a masked store. The semantics of the operation
closely correspond to those of the `llvm.masked.store`

intrinsic
.

Examples:

```
vector.maskedstore %base[%i], %mask, %value
: memref<?xf32>, vector<8xi1>, vector<8xf32>
vector.maskedstore %base[%i, %j], %mask, %value
: memref<?x?xf32>, vector<16xi1>, vector<16xf32>
```

#### Operands: ¶

Operand | Description |
---|---|

`base` | memref of any type values |

`indices` | index |

`mask` | vector of 1-bit signless integer values of ranks 1 |

`valueToStore` | of ranks 1 |

`vector.matrix_multiply`

(::mlir::vector::MatmulOp) ¶

Vector matrix multiplication op that operates on flattened 1-D MLIR vectors

Syntax:

```
operation ::= `vector.matrix_multiply` $lhs `,` $rhs attr-dict `:` `(` type($lhs) `,` type($rhs) `)` `->` type($res)
```

This is the counterpart of llvm.matrix.multiply in MLIR. It serves the purposes of more progressive lowering and localized type conversion. Higher levels typically lower matrix multiplications into ‘vector.contract’ operations. Subsequent rewriting rule progressively lower these operations into ‘vector.matrix_multiply’ operations to bring the operations closer to the hardware ISA.

The ‘vector.matrix_multiply’ op treats `lhs`

as matrix with <lhs_rows> rows
and <lhs_columns> columns, `rhs`

as matrix with <lhs_columns> rows and
<rhs_columns> and multiplies them. The result matrix is returned embedded in
the result vector.

Also see:

http://llvm.org/docs/LangRef.html#llvm-matrix-multiply-intrinsic

Example:

```
%C = vector.matrix_multiply %A, %B
{ lhs_rows = 4: i32, lhs_columns = 16: i32 , rhs_columns = 3: i32 } :
(vector<64xf64>, vector<48xf64>) -> vector<12xf64>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`lhs_rows` | ::mlir::IntegerAttr | 32-bit signless integer attribute |

`lhs_columns` | ::mlir::IntegerAttr | 32-bit signless integer attribute |

`rhs_columns` | ::mlir::IntegerAttr | 32-bit signless integer attribute |

#### Operands: ¶

Operand | Description |
---|---|

`lhs` | vector of signless integer or signed integer or index or floating-point values of ranks 1 |

`rhs` | vector of signless integer or signed integer or index or floating-point values of ranks 1 |

#### Results: ¶

Result | Description |
---|---|

`res` | vector of signless integer or signed integer or index or floating-point values of ranks 1 |

`vector.multi_reduction`

(::mlir::vector::MultiDimReductionOp) ¶

Multi-dimensional reduction operation

Syntax:

```
operation ::= `vector.multi_reduction` $kind `,` $source attr-dict $reduction_dims `:` type($source) `to` type($dest)
```

Reduces an n-D vector into an (n-k)-D vector (or a scalar when k == n) using the given operation (add/mul/min/max for int/fp and and/or/xor for int only).

Example:

```
%1 = vector.multi_reduction "add", %0 [1, 3] :
vector<4x8x16x32xf32> into vector<4x16xf32>
%2 = vector.multi_reduction "add", %1 [0, 1] :
vector<4x16xf32> into f32
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`kind` | ::mlir::vector::CombiningKindAttr | Kind of combining function for contractions and reductions |

`reduction_dims` | ::mlir::ArrayAttr | 64-bit integer array attribute |

#### Operands: ¶

Operand | Description |
---|---|

`source` | vector of any type values |

#### Results: ¶

Result | Description |
---|---|

`dest` | any type |

`vector.outerproduct`

(::mlir::vector::OuterProductOp) ¶

vector outerproduct with optional fused add

Takes 2 1-D vectors and returns the 2-D vector containing the outer-product, as illustrated below:

```
outer | [c, d]
------+------------
[a, | [ [a*c, a*d],
b] | [b*c, b*d] ]
```

This operation also accepts a 1-D vector lhs and a scalar rhs. In this case a simple AXPY operation is performed, which returns a 1-D vector.

```
[a, b] * c = [a*c, b*c]
```

An optional extra vector argument with the same shape as the output
vector may be specified in which case the operation returns the sum of
the outer-product and the extra vector. In this multiply-accumulate
scenario for floating-point arguments, the rounding mode is enforced
by guaranteeing that a fused-multiply add operation is emitted. When
lowered to the LLVMIR dialect, this form emits `llvm.intr.fma`

, which
is guaranteed to lower to actual `fma`

instructions on x86.

An optional kind attribute may be specified to be add/mul/min/max for int/fp, and and/or/xor for int only. The default is “add”, in which case the operation returns a fused multiply-add. In other cases it returns a multiply followed by the appropriate operation (for example, a compare and select for “max”).

Example:

```
%2 = vector.outerproduct %0, %1: vector<4xf32>, vector<8xf32>
return %2: vector<4x8xf32>
%3 = vector.outerproduct %0, %1, %2:
vector<4xf32>, vector<8xf32>, vector<4x8xf32>
return %3: vector<4x8xf32>
%4 = vector.outerproduct %0, %1, %2 {kind = #vector.kind<max>}:
vector<4xf32>, vector<8xf32>, vector<4x8xf32>
return %3: vector<4x8xf32>
%6 = vector.outerproduct %4, %5: vector<10xf32>, f32
return %6: vector<10xf32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`kind` | ::mlir::vector::CombiningKindAttr | Kind of combining function for contractions and reductions |

#### Operands: ¶

Operand | Description |
---|---|

`lhs` | vector of any type values |

`rhs` | any type |

`acc` | vector of any type values |

#### Results: ¶

Result | Description |
---|---|

«unnamed» | vector of any type values |

`vector.print`

(::mlir::vector::PrintOp) ¶

print operation (for testing and debugging)

Syntax:

```
operation ::= `vector.print` $source attr-dict `:` type($source)
```

Prints the source vector (or scalar) to stdout in human readable format (for testing and debugging). No return value.

Example:

```
%0 = arith.constant 0.0 : f32
%1 = vector.broadcast %0 : f32 to vector<4xf32>
vector.print %1 : vector<4xf32>
when lowered to LLVM, the vector print is unrolled into
elementary printing method calls that at runtime will yield
( 0.0, 0.0, 0.0, 0.0 )
on stdout when linked with a small runtime support library,
which only needs to provide a few printing methods (single
value for all data types, opening/closing bracket, comma,
newline).
```

#### Operands: ¶

Operand | Description |
---|---|

`source` | any type |

`vector.reduction`

(::mlir::vector::ReductionOp) ¶

reduction operation

Reduces an 1-D vector “horizontally” into a scalar using the given operation (add/mul/min/max for int/fp and and/or/xor for int only). Some reductions (add/mul for fp) also allow an optional fused accumulator.

Note that these operations are restricted to 1-D vectors to remain close to the corresponding LLVM intrinsics:

http://llvm.org/docs/LangRef.html#vector-reduction-intrinsics

Example:

```
%1 = vector.reduction "add", %0 : vector<16xf32> into f32
%3 = vector.reduction "xor", %2 : vector<4xi32> into i32
%4 = vector.reduction "mul", %0, %1 : vector<16xf32> into f32
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`kind` | ::mlir::StringAttr | string attribute |

#### Operands: ¶

Operand | Description |
---|---|

`vector` | vector of any type values |

`acc` | any type |

#### Results: ¶

Result | Description |
---|---|

`dest` | any type |

`vector.reshape`

(::mlir::vector::ReshapeOp) ¶

vector reshape operation

Syntax:

```
operation ::= `vector.reshape` $vector `,` `[` $input_shape `]` `,` `[` $output_shape `]` `,`
$fixed_vector_sizes attr-dict `:` type($vector) `to` type($result)
```

Reshapes its vector operand from ‘input_shape’ to ‘output_shape’ maintaining fixed vector dimension ‘fixed_vector_sizes’ on the innermost vector dimensions.

The parameters ‘input_shape’ and ‘output_shape’ represent valid data shapes across fixed vector shapes. For example, if a vector has a valid data shape [6] with fixed vector size [8], then the valid data elements are assumed to be stored at the beginning of the vector with the remaining vector elements undefined.

In the examples below, valid data elements are represented by an alphabetic character, and undefined data elements are represented by ‘-’.

Example

vector<1x8xf32> with valid data shape [6], fixed vector sizes [8]

```
input: [a, b, c, d, e, f]
layout map: (d0) -> (d0 floordiv 8, d0 mod 8)
vector layout: [a, b, c, d, e, f, -, -]
```

Example

vector<2x8xf32> with valid data shape [10], fixed vector sizes [8]

```
input: [a, b, c, d, e, f, g, h, i, j]
layout map: (d0) -> (d0 floordiv 8, d0 mod 8)
vector layout: [[a, b, c, d, e, f, g, h],
[i, j, -, -, -, -, -, -]]
```

Example

vector<2x2x2x3xf32> with valid data shape [3, 5], fixed vector sizes [2, 3]

```
input: [[a, b, c, d, e],
[f, g, h, i, j],
[k, l, m, n, o]]
layout map: (d0, d1) -> (d0 floordiv 3, d1 floordiv 5,
d0 mod 3, d1 mod 5)
vector layout: [[[[a, b, c],
[f, g, h]]
[[d, e, -],
[i, j, -]]],
[[[k, l, m],
[-, -, -]]
[[n, o, -],
[-, -, -]]]]
```

Example

%1 = vector.reshape %0, [%c3, %c6], [%c2, %c9], [4] : vector<3x2x4xf32> to vector<2x3x4xf32>

```
input: [[a, b, c, d, e, f],
[g, h, i, j, k, l],
[m, n, o, p, q, r]]
layout map: (d0, d1) -> (d0, d1 floordiv 4, d1 mod 4)
```

Input vector: [[[a, b, c, d], [e, f, -, -]], [[g, h, i, j], [k, l, -, -]], [[m, n, o, p], [q, r, -, -]]]

Output vector: [[[a, b, c, d], [e, f, g, h], [i, -, -, -]], [[j, k, l, m], [n, o, p, q], [r, -, -, -]]]

Traits: AttrSizedOperandSegments

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`fixed_vector_sizes` | ::mlir::ArrayAttr | 64-bit integer array attribute |

#### Operands: ¶

Operand | Description |
---|---|

`vector` | vector of any type values |

`input_shape` | index |

`output_shape` | index |

#### Results: ¶

Result | Description |
---|---|

`result` | vector of any type values |

`vector.scatter`

(::mlir::vector::ScatterOp) ¶

scatters elements from a vector into memory as defined by an index vector and mask

Syntax:

```
operation ::= `vector.scatter` $base `[` $indices `]` `[` $index_vec `]` `,` $mask `,` $valueToStore attr-dict `:` type($base) `,` type($index_vec) `,` type($mask) `,` type($valueToStore)
```

The scatter operation scatters elements from a 1-D vector into memory as defined by a base with indices and an additional 1-D index vector, but only if the corresponding bit in a 1-D mask vector is set. Otherwise, no action is taken for that element. Informally the semantics are:

```
if (mask[0]) base[index[0]] = value[0]
if (mask[1]) base[index[1]] = value[1]
etc.
```

The vector dialect leaves out-of-bounds and repeated index behavior undefined. Underlying implementations may enforce strict sequential semantics for the latter, though. TODO: enforce the latter always?

The scatter operation can be used directly where applicable, or can be used
during progressively lowering to bring other memory operations closer to
hardware ISA support for a scatter. The semantics of the operation closely
correspond to those of the `llvm.masked.scatter`

intrinsic
.

Examples:

```
vector.scatter %base[%c0][%v], %mask, %value
: memref<?xf32>, vector<16xi32>, vector<16xi1>, vector<16xf32>
vector.scatter %base[%i, %j][%v], %mask, %value
: memref<16x16xf32>, vector<16xi32>, vector<16xi1>, vector<16xf32>
```

#### Operands: ¶

Operand | Description |
---|---|

`base` | memref of any type values |

`indices` | index |

`index_vec` | vector of integer or index values of ranks 1 |

`mask` | vector of 1-bit signless integer values of ranks 1 |

`valueToStore` | of ranks 1 |

`vector.shape_cast`

(::mlir::vector::ShapeCastOp) ¶

shape_cast casts between vector shapes

Syntax:

```
operation ::= `vector.shape_cast` $source attr-dict `:` type($source) `to` type($result)
```

The shape_cast operation casts between an n-D source vector shape and a k-D result vector shape (the element type remains the same).

If reducing rank (n > k), result dimension sizes must be a product of contiguous source dimension sizes. If expanding rank (n < k), source dimensions must factor into a contiguous sequence of destination dimension sizes. Each source dim is expanded (or contiguous sequence of source dims combined) in source dimension list order (i.e. 0 <= i < n), to produce a contiguous sequence of result dims (or a single result dim), in result dimension list order (i.e. 0 <= j < k). The product of all source dimension sizes and all result dimension sizes must match.

It is currently assumed that this operation does not require moving data, and that it will be folded away before lowering vector operations.

There is an exception to the folding expectation when targeting llvm.intr.matrix operations. We need a type conversion back and forth from a 2-D MLIR vector to a 1-D flattened LLVM vector.shape_cast lowering to LLVM is supported in that particular case, for now.

Example:

```
// Example casting to a lower vector rank.
%1 = vector.shape_cast %0 : vector<5x1x4x3xf32> to vector<20x3xf32>
// Example casting to a higher vector rank.
%3 = vector.shape_cast %2 : vector<10x12x8xf32> to vector<5x2x3x4x8xf32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Operands: ¶

Operand | Description |
---|---|

`source` | vector of any type values |

#### Results: ¶

Result | Description |
---|---|

`result` | vector of any type values |

`vector.shuffle`

(::mlir::vector::ShuffleOp) ¶

shuffle operation

The shuffle operation constructs a permutation (or duplication) of elements from two input vectors, returning a vector with the same element type as the input and a length that is the same as the shuffle mask. The two input vectors must have the same element type, rank, and trailing dimension sizes and shuffles their values in the leading dimension (which may differ in size) according to the given mask. The legality rules are:

- the two operands must have the same element type as the result
- the two operands and the result must have the same rank and trailing dimension sizes, viz. given two k-D operands v1 : <s_1 x s_2 x .. x s_k x type> and v2 : <t_1 x t_2 x .. x t_k x type> we have s_i = t_i for all 1 < i <= k
- the mask length equals the leading dimension size of the result
- numbering the input vector indices left to right across the operands, all mask values must be within range, viz. given two k-D operands v1 and v2 above, all mask values are in the range [0,s_1+t_1)

Example:

```
%0 = vector.shuffle %a, %b[0, 3]
: vector<2xf32>, vector<2xf32> ; yields vector<2xf32>
%1 = vector.shuffle %c, %b[0, 1, 2]
: vector<2x16xf32>, vector<1x16xf32> ; yields vector<3x16xf32>
%2 = vector.shuffle %a, %b[3, 2, 1, 0]
: vector<2xf32>, vector<2xf32> ; yields vector<4xf32>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`mask` | ::mlir::ArrayAttr | 64-bit integer array attribute |

#### Operands: ¶

Operand | Description |
---|---|

`v1` | vector of any type values |

`v2` | vector of any type values |

#### Results: ¶

Result | Description |
---|---|

`vector` | vector of any type values |

`vector.store`

(::mlir::vector::StoreOp) ¶

writes an n-D vector to an n-D slice of memory

Syntax:

```
operation ::= `vector.store` $valueToStore `,` $base `[` $indices `]` attr-dict `:` type($base) `,` type($valueToStore)
```

The ‘vector.store’ operation writes an n-D vector to an n-D slice of memory. It takes the vector value to be stored, a ‘base’ memref and an index for each memref dimension. The ‘base’ memref and indices determine the start memory address from which to write. Each index provides an offset for each memref dimension based on the element type of the memref. The shape of the vector value to store determines the shape of the slice written from the start memory address. The elements along each dimension of the slice are strided by the memref strides. Only unit strides are allowed along the most minor memref dimension. These constraints guarantee that elements written along the first dimension of the slice are contiguous in memory.

The memref element type can be a scalar or a vector type. If the memref element type is a scalar, it should match the element type of the value to store. If the memref element type is vector, it should match the type of the value to store.

Example 1: 1-D vector store on a scalar memref.

```
vector.store %valueToStore, %memref[%i, %j] : memref<200x100xf32>, vector<8xf32>
```

Example 2: 1-D vector store on a vector memref.

```
vector.store %valueToStore, %memref[%i, %j] : memref<200x100xvector<8xf32>>, vector<8xf32>
```

Example 3: 2-D vector store on a scalar memref.

```
vector.store %valueToStore, %memref[%i, %j] : memref<200x100xf32>, vector<4x8xf32>
```

Example 4: 2-D vector store on a vector memref.

```
vector.store %valueToStore, %memref[%i, %j] : memref<200x100xvector<4x8xf32>>, vector<4x8xf32>
```

Representation-wise, the ‘vector.store’ operation permits out-of-bounds writes. Support and implementation of out-of-bounds vector stores are target-specific. No assumptions should be made on the memory written out of bounds. Not all targets may support out-of-bounds vector stores.

Example 5: Potential out-of-bounds vector store.

```
vector.store %valueToStore, %memref[%index] : memref<?xf32>, vector<8xf32>
```

Example 6: Explicit out-of-bounds vector store.

```
vector.store %valueToStore, %memref[%c0] : memref<7xf32>, vector<8xf32>
```

#### Operands: ¶

Operand | Description |
---|---|

`valueToStore` | vector of any type values |

`base` | memref of any type values |

`indices` | index |

`vector.transfer_read`

(::mlir::vector::TransferReadOp) ¶

Reads a supervector from memory into an SSA vector value.

The `vector.transfer_read`

op performs a read from a slice within a
MemRef
or a Ranked
Tensor
supplied as its first operand into a
vector
of the same base elemental type.

A memref/tensor operand with vector element type, must have its vector element type match a suffix (shape and element type) of the vector (e.g. memref<3x2x6x4x3xf32>, vector<1x1x4x3xf32>).

The slice is further defined by a full-rank index within the MemRef/Tensor,
supplied as the operands `2 .. 1 + rank(memref/tensor)`

.

The permutation_map attribute is an affine-map which specifies the transposition on the slice to match the vector shape. The permutation map may be implicit and omitted from parsing and printing if it is the canonical minor identity map (i.e. if it does not permute or broadcast any dimension).

The size of the slice is specified by the size of the vector, given as the return type.

An SSA value `padding`

of the same elemental type as the MemRef/Tensor is
provided to specify a fallback value in the case of out-of-bounds accesses
and/or masking.

An optional SSA value `mask`

of the same shape as the vector type may be
specified to mask out elements. Such elements will be replaces with
`padding`

. Elements whose corresponding mask element is `0`

are masked out.

An optional boolean array attribute is provided to specify which dimensions
of the transfer are guaranteed to be within bounds. The length of the array
must equal the rank of the vector type. Broadcast dimensions must always be
in-bounds. The absence of this optional `in_bounds`

attribute signifies that
any dimension of the transfer (except for broadcasts) may be out-of-bounds.
A `vector.transfer_read`

can be lowered to a simple load if all dimensions
are specified to be within bounds and no `mask`

was specified.

Note that `in_bounds`

is specified for result dimensions and not input
dimensions. The starting point of the transfer, i.e.,
`%A[%expr1, %expr2, %expr3, %expr4]`

in the example below, is expected to
be in-bounds and as indices are increasing, accesses may run out-of-bounds.

This operation is called ‘read’ by opposition to ‘load’ because the
super-vector granularity is generally not representable with a single
hardware register. A `vector.transfer_read`

is thus a mid-level abstraction
that supports super-vectorization with non-effecting padding for full-tile
only operations.

More precisely, let’s dive deeper into the permutation_map for the following MLIR:

```
vector.transfer_read %A[%expr1, %expr2, %expr3, %expr4]
{ permutation_map : (d0,d1,d2,d3) -> (d2,0,d0) } :
memref<?x?x?x?xf32>, vector<3x4x5xf32>
```

This operation always reads a slice starting at `%A[%expr1, %expr2, %expr3, %expr4]`

. The size of the slice is 3 along d2 and 5 along d0, so the slice
is: `%A[%expr1 : %expr1 + 5, %expr2, %expr3:%expr3 + 3, %expr4]`

That slice needs to be read into a `vector<3x4x5xf32>`

. Since the
permutation map is not full rank, there must be a broadcast along vector
dimension `1`

.

A notional lowering of vector.transfer_read could generate code resembling:

```
// %expr1, %expr2, %expr3, %expr4 defined before this point
%tmp = alloc() : vector<3x4x5xf32>
%view_in_tmp = "element_type_cast"(%tmp) : memref<1xvector<3x4x5xf32>>
for %i = 0 to 3 {
affine.for %j = 0 to 4 {
affine.for %k = 0 to 5 {
%a = load %A[%expr1 + %k, %expr2, %expr3 + %i, %expr4] :
memref<?x?x?x?xf32>
store %tmp[%i, %j, %k] : vector<3x4x5xf32>
}}}
%c0 = arith.constant 0 : index
%vec = load %view_in_tmp[%c0] : vector<3x4x5xf32>
```

On a GPU one could then map `i`

, `j`

, `k`

to blocks and threads. Notice that
the temporary storage footprint is `3 * 5`

values but `3 * 4 * 5`

values are
actually transferred between `%A`

and `%tmp`

.

Alternatively, if a notional vector broadcast operation were available, the lowered code would resemble:

```
// %expr1, %expr2, %expr3, %expr4 defined before this point
%tmp = alloc() : vector<3x4x5xf32>
%view_in_tmp = "element_type_cast"(%tmp) : memref<1xvector<3x4x5xf32>>
for %i = 0 to 3 {
affine.for %k = 0 to 5 {
%a = load %A[%expr1 + %k, %expr2, %expr3 + %i, %expr4] :
memref<?x?x?x?xf32>
store %tmp[%i, 0, %k] : vector<3x4x5xf32>
}}
%c0 = arith.constant 0 : index
%tmpvec = load %view_in_tmp[%c0] : vector<3x4x5xf32>
%vec = broadcast %tmpvec, 1 : vector<3x4x5xf32>
```

where `broadcast`

broadcasts from element 0 to all others along the
specified dimension. This time, the temporary storage footprint is `3 * 5`

values which is the same amount of data as the `3 * 5`

values transferred.
An additional `1`

broadcast is required. On a GPU this broadcast could be
implemented using a warp-shuffle if loop `j`

were mapped to `threadIdx.x`

.

Syntax

```
operation ::= ssa-id `=` `vector.transfer_read` ssa-use-list
`{` attribute-entry `} :` memref-type `,` vector-type
```

Example:

```
// Read the slice `%A[%i0, %i1:%i1+256, %i2:%i2+32]` into vector<32x256xf32>
// and pad with %f0 to handle the boundary case:
%f0 = arith.constant 0.0f : f32
for %i0 = 0 to %0 {
affine.for %i1 = 0 to %1 step 256 {
affine.for %i2 = 0 to %2 step 32 {
%v = vector.transfer_read %A[%i0, %i1, %i2], (%f0)
{permutation_map: (d0, d1, d2) -> (d2, d1)} :
memref<?x?x?xf32>, vector<32x256xf32>
}}}
// Read the slice `%A[%i0, %i1]` (i.e. the element `%A[%i0, %i1]`) into
// vector<128xf32>. The underlying implementation will require a 1-D vector
// broadcast:
for %i0 = 0 to %0 {
affine.for %i1 = 0 to %1 {
%3 = vector.transfer_read %A[%i0, %i1]
{permutation_map: (d0, d1) -> (0)} :
memref<?x?xf32>, vector<128xf32>
}
}
// Read from a memref with vector element type.
%4 = vector.transfer_read %arg1[%c3, %c3], %vf0
{permutation_map = (d0, d1)->(d0, d1)}
: memref<?x?xvector<4x3xf32>>, vector<1x1x4x3xf32>
// Read from a tensor with vector element type.
%4 = vector.transfer_read %arg1[%c3, %c3], %vf0
{permutation_map = (d0, d1)->(d0, d1)}
: tensor<?x?xvector<4x3xf32>>, vector<1x1x4x3xf32>
// Special encoding for 0-d transfer with 0-d tensor/memref, vector shape
// {1} and permutation_map () -> (0).
%0 = vector.transfer_read %arg0[], %f0 {permutation_map = affine_map<()->(0)>} :
tensor<f32>, vector<1xf32>
```

Traits: AttrSizedOperandSegments

Interfaces: MemoryEffectOpInterface, VectorTransferOpInterface, VectorUnrollOpInterface

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`permutation_map` | ::mlir::AffineMapAttr | AffineMap attribute |

`in_bounds` | ::mlir::ArrayAttr | 1-bit boolean array attribute |

#### Operands: ¶

Operand | Description |
---|---|

`source` | shaped of any type values |

`indices` | index |

`padding` | any type |

`mask` | vector of 1-bit signless integer values |

#### Results: ¶

Result | Description |
---|---|

`vector` | vector of any type values |

`vector.transfer_write`

(::mlir::vector::TransferWriteOp) ¶

The vector.transfer_write op writes a supervector to memory.

The `vector.transfer_write`

op performs a write from a
vector
, supplied as its first operand, into a
slice within a
MemRef
or a Ranked
Tensor
of the same base elemental type,
supplied as its second operand.

A vector memref/tensor operand must have its vector element type match a suffix (shape and element type) of the vector (e.g. memref<3x2x6x4x3xf32>, vector<1x1x4x3xf32>). If the operand is a tensor, the operation returns a new tensor of the same type.

The slice is further defined by a full-rank index within the MemRef/Tensor,
supplied as the operands `3 .. 2 + rank(memref/tensor)`

.

The permutation_map
attribute
is an
affine-map
which specifies the transposition on the
slice to match the vector shape. The permutation map may be implicit and
omitted from parsing and printing if it is the canonical minor identity map
(i.e. if it does not permute any dimension). In contrast to `transfer_read`

,
write ops cannot have broadcast dimensions.

The size of the slice is specified by the size of the vector.

An optional SSA value `mask`

of the same shape as the vector type may be
specified to mask out elements. Elements whose corresponding mask element
is `0`

are masked out.

An optional boolean array attribute is provided to specify which dimensions
of the transfer are guaranteed to be within bounds. The absence of this
`in_bounds`

attribute signifies that any dimension of the transfer may be
out-of-bounds. A `vector.transfer_write`

can be lowered to a simple store
if all dimensions are specified to be within bounds and no `mask`

was
specified.

An optional boolean array attribute is provided to specify which dimensions
of the transfer are guaranteed to be within bounds. The length of the array
must equal the rank of the vector type. The absence of this optional
`in_bounds`

attribute signifies that any dimension of the transfer
may be out-of-bounds. A `vector.transfer_write`

can be lowered to a simple
store if all dimensions are specified to be within bounds and no `mask`

was
specified.

Note that `in_bounds`

is specified for result dimensions and not input
dimensions. The starting point of the transfer, i.e.,
`%A[%expr1, %expr2, %expr3, %expr4]`

in the example below, is expected to
be in-bounds and as indices are increasing, accesses may run out-of-bounds.

This operation is called ‘write’ by opposition to ‘store’ because the
super-vector granularity is generally not representable with a single
hardware register. A `vector.transfer_write`

is thus a
mid-level abstraction that supports super-vectorization with non-effecting
padding for full-tile-only code. It is the responsibility of
`vector.transfer_write`

’s implementation to ensure the memory writes are
valid. Different lowerings may be pertinent depending on the hardware
support.

Example:

```
// write vector<16x32x64xf32> into the slice
// `%A[%i0, %i1:%i1+32, %i2:%i2+64, %i3:%i3+16]`:
for %i0 = 0 to %0 {
affine.for %i1 = 0 to %1 step 32 {
affine.for %i2 = 0 to %2 step 64 {
affine.for %i3 = 0 to %3 step 16 {
%val = `ssa-value` : vector<16x32x64xf32>
vector.transfer_write %val, %A[%i0, %i1, %i2, %i3]
{permutation_map: (d0, d1, d2, d3) -> (d3, d1, d2)} :
vector<16x32x64xf32>, memref<?x?x?x?xf32>
}}}}
// write to a memref with vector element type.
vector.transfer_write %4, %arg1[%c3, %c3]
{permutation_map = (d0, d1)->(d0, d1)}
: vector<1x1x4x3xf32>, memref<?x?xvector<4x3xf32>>
// return a tensor where the vector is inserted into the source tensor.
%5 = vector.transfer_write %4, %arg1[%c3, %c3]
{permutation_map = (d0, d1)->(d0, d1)}
: vector<1x1x4x3xf32>, tensor<?x?xvector<4x3xf32>>
// Special encoding for 0-d transfer with 0-d tensor/memref, vector shape
// {1} and permutation_map () -> (0).
%1 = vector.transfer_write %0, %arg0[] {permutation_map = affine_map<()->(0)>} :
vector<1xf32>, tensor<f32>
```

Traits: AttrSizedOperandSegments

Interfaces: MemoryEffectOpInterface, VectorTransferOpInterface, VectorUnrollOpInterface

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`permutation_map` | ::mlir::AffineMapAttr | AffineMap attribute |

`in_bounds` | ::mlir::ArrayAttr | 1-bit boolean array attribute |

#### Operands: ¶

Operand | Description |
---|---|

`vector` | vector of any type values |

`source` | shaped of any type values |

`indices` | index |

`mask` | vector of 1-bit signless integer values |

#### Results: ¶

Result | Description |
---|---|

`result` | ranked tensor of any type values |

`vector.transpose`

(::mlir::vector::TransposeOp) ¶

vector transpose operation

Syntax:

```
operation ::= `vector.transpose` $vector `,` $transp attr-dict `:` type($vector) `to` type($result)
```

Takes a n-D vector and returns the transposed n-D vector defined by the permutation of ranks in the n-sized integer array attribute. In the operation

```
%1 = vector.transpose %0, [i_1, .., i_n]
: vector<d_1 x .. x d_n x f32>
to vector<d_trans[0] x .. x d_trans[n-1] x f32>
```

the transp array [i_1, .., i_n] must be a permutation of [0, .., n-1].

Example:

```
%1 = vector.transpose %0, [1, 0] : vector<2x3xf32> to vector<3x2xf32>
[ [a, b, c], [ [a, d],
[d, e, f] ] -> [b, e],
[c, f] ]
```

Interfaces: NoSideEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

#### Attributes: ¶

Attribute | MLIR Type | Description |
---|---|---|

`transp` | ::mlir::ArrayAttr | 64-bit integer array attribute |

#### Operands: ¶

Operand | Description |
---|---|

`vector` | vector of any type values |

#### Results: ¶

Result | Description |
---|---|

`result` | vector of any type values |

`vector.type_cast`

(::mlir::vector::TypeCastOp) ¶

type_cast op converts a scalar memref to a vector memref

Syntax:

```
operation ::= `vector.type_cast` $memref attr-dict `:` type($memref) `to` type($result)
```

Performs a conversion from a memref with scalar element to a memref with a
*single* vector element, copying the shape of the memref to the vector. This
is the minimal viable operation that is required to makeke
super-vectorization operational. It can be seen as a special case of the
`view`

operation but scoped in the super-vectorization context.

Syntax:

```
operation ::= `vector.type_cast` ssa-use : memref-type to memref-type
```

Example:

```
%A = alloc() : memref<5x4x3xf32>
%VA = vector.type_cast %A : memref<5x4x3xf32> to memref<vector<5x4x3xf32>>
```

Interfaces: NoSideEffect (MemoryEffectOpInterface), ViewLikeOpInterface

Effects: MemoryEffects::Effect{}

#### Operands: ¶

Operand | Description |
---|---|

`memref` | statically shaped memref of any type values |

#### Results: ¶

Result | Description |
---|---|

`result` | memref of any type values |