Non-Confidential | PDF version | 100891_0609_00_en | ||
| ||||
Home > Coding Considerations > Using SVE intrinsics directly in your C code |
Intrinsics are C or C++ pseudo-function calls that the compiler replaces with the appropriate SIMD instructions. This lets you use the data types and operations available in the SIMD implementation, while allowing the compiler to handle instruction scheduling and register allocation.
These intrinsics are defined in the Arm^{®} C Language Extensions for SVE specification.
The Arm^{®} C language extensions for SVE provide a set of types and accessors for SVE vectors and predicates, and a function interface for all relevant SVE instructions.
The function interface is more general than the underlying architecture, so not every function maps directly to an architectural instruction. The intention is to provide a regular interface and leave the compiler to pick the best mapping to SVE instructions.
The Arm^{®} C Language Extensions for SVE specification has a detailed description of this interface, and must be used as the primary reference. This section introduces a selection of features to help you get started with the Arm C Language Extensions (ACLE) for SVE.
Translation units that use the ACLE must first include arm_sve.h, guarded by __ARM_FEATURE_SVE
:
#ifdef __ARM_FEATURE_SVE #include <arm_sve.h> #endif /* __ARM_FEATURE_SVE */
All functions and types that are defined in the header file have the prefix
sv
, to reduce the chance of collisions with
other extensions.
arm_sve.h
defines the following C types to represent values
in SVE vector registers. Each type describes the type of the elements within the
vector:
svint8_t svuint8_t
svint16_t svuint16_t svfloat16_t
svint32_t svuint32_t svfloat32_t
svint64_t svuint64_t svfloat64_t
For example, svint64_t represents a vector of 64-bit signed integers, and svfloat16_t represents a vector of half-precision floating-point numbers.
The extension also defines a single sizeless predicate type svbool_t
, which has enough bits to control an operation
on a vector of bytes.
The main use of predicates is to select elements in a vector. When the elements in the vector have N bytes, only the low bit in each sequence of N predicate bits is significant, as shown in the following table:
Table 3-2 Element selection by predicate type svbool_t
Vector type | Element selected by each svbool_t bit | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
svint8_t | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... |
svint16_t | 0 | 1 | 2 | 3 | 4 | ... | ||||
svint32_t | 0 | 1 | 2 | ... | ||||||
svint64_t | 0 | 1 | ... |
SVE is a vector-length agnostic architecture, allowing an implementation to choose a vector length of any multiple of 128 bits, up to a maximum of 2048 bits. Therefore, the size of SVE ACLE types are unknown at compile time, which limits how these types can be used.
Common situations where SVE types might be used include:
(type) {value}
compound literalBecause of their unknown size at compile time, SVE types must not be used:
sizeof
and _Alignof
+
, -
, ++
, and --
operators)std::vector
.For a comprehensive list of valid usage, refer to the Arm^{®} C Language Extensions for SVE specification.
SVE ACLE functions have the form:
svbase[_disambiguator][_type0][_type1]...[_predication]
where the function is built using the following:
base
For most functions, this is the lowercase name of the SVE instruction. Sometimes, letters indicating the type or size of data being operated on are dropped, where it can be implied from the argument types.
Unsigned extending loads add a u
to indicate that the data will be zero
extended, to more explicitly differentiate them from their signed
equivalent.
disambiguator
This field distinguishes between different forms of a function, for example:
type0 type1 ...
A list of types for vectors and predicates, starting with the return type
then with each argument type. For example, _s8
, _u32
, and _f32
, which represent signed 8-bit
integer, an unsigned 32-bit integer and single precision 32-bit float
types, respectively.
Predicate types are represented by _b8
,
_b16
and so on, for predicates
suitable for 8-bit and 16-bit types respectively. A predicate type
suitable for all element types is represented by _b
. Where a type is not needed to
disambiguate between variants of a base function, it is omitted.
predication
This suffix describes the inactive elements in the result of a predicated operation. It can be one of the following:
Load, store, prefetch, and ADR functions have arguments that describe the memory area being addressed. The first addressing argument is the base – either a single pointer to an element type, or a 32-bit or 64-bit vector of addresses. The second argument, when present, offsets the base (or bases) by some number of bytes, elements, or vectors. This offset argument can be an immediate constant value, a scalar argument, or a vector of offsets.
Not every combination of the above options exists. The following table gives examples of some common addressing mode disambiguators, and describes how to interpret the address arguments:
Table 3-3 Common addressing mode disambiguators
Disambiguator | Interpretation |
---|---|
_u32base | The base argument is a vector of unsigned 32-bit addresses. |
_u64base | The base argument is a vector of unsigned 64-bit addresses. |
_s32offset _s64offset _u32offset _u64offset |
The offset argument is a vector of byte offsets. These offsets are signed or unsigned 32-bit or 64-bit numbers. |
_s32index _s64index _u32index _u64index |
The offset argument is a vector of element-sized indices. These indices are signed or unsigned 32-bit or 64-bit numbers. |
_offset | The offset argument is a scalar, and must be treated as a byte offset. |
_index | The offset argument is a scalar, and must be treated as an index into an array of elements. |
_vnum | The offset argument is a scalar, and must be treated an index into an array of SVE vectors. |
In the following example, the address of element i
is &base[indices[i]]
.
svuint32_t svld1_gather_[s32]index[_u32] (svbool_t pg, const uint32_t *base, svint32_t indices)
All arithmetic functions that take two vector inputs have an alternative form that takes a vector and a scalar. Conceptually, this scalar is duplicated across a vector, and that vector is used as the second vector argument.
Similarly, arithmetic functions that take three vector inputs have an alternative form that takes two vectors and one scalar.
To differentiate these forms, the disambiguator _n
is added to the form that takes a scalar.
Sometimes, it is possible to omit part of the full name, and still uniquely identify the correct form of a function, by inspecting the argument types. Where this is possible, these simplified forms are provided as aliases to their fully named equivalents, and will be used for preference in the rest of this document.
In the Arm^{®} C Language Extensions for SVE
specification, the portion that can be removed is enclosed in square brackets. For
example svclz[_s16]_m
has the full name svclz_s16_m
, and an overloaded alias, svclz_m
.
daxpy is a BLAS (Basic Linear Algebra Subroutines) subroutine that operates on two arrays of double precision floating-point numbers. A slice is taken of each of these arrays. For each element in these slices, an element (x) in the first array is multiplied by a constant (a), then added to the element (y) from the second array. The result is stored back to the second array at the same index.
This example presents a step-1 daxpy implementation, where the indices of x and y start at 0 and increment by 1 each iteration. A C code implementation might look like this:
void daxpy_1_1(int64_t n, double da, double *dx, double *dy) { for (int64_t i = 0; i < n; ++i) { dy[i] = dx[i] * da + dy[i]; } }
Here is an ACLE equivalent:
void daxpy_1_1(int64_t n, double da, double *dx, double *dy) { int64_t i = 0; svbool_t pg = svwhilelt_b64(i, n); // [1] do { svfloat64_t dx_vec = svld1(pg, &dx[i]); // [2] svfloat64_t dy_vec = svld1(pg, &dy[i]); // [2] svst1(pg, &dy[i], svmla_x(pg, dy_vec, dx_vec, da)); // [3] i += svcntd(); // [4] pg = svwhilelt_b64(i, n); // [1] } while (svptest_any(svptrue_b64(), pg)); // [5] }
The following steps explain this example:
[1] - Initialize a predicate register to control the loop. _b64
specifies a predicate for 64-bit elements.
Conceptually, this operation creates an integer vector starting at i
and incrementing by 1 in each subsequent lane. The
predicate lane is active if this value is less than n
. Therefore, this loop is safe, if inefficient, even if n
≤ 0. The same operation is used at the bottom of the
loop, to update the predicate for the next iteration.
[2] - Load some values into an SVE vector, guarded by the loop predicate. Lanes where this predicate is false do not perform any load (and so will not generate a fault), and set the result value to 0.0. The number of lanes that are loaded depends on the vector width, which is only known at runtime.
[3] - Perform a floating-point multiply-add operation, and pass the result to
a store. The _x
on the MLA indicates we don’t care
about the result for inactive lanes. This gives the compiler maximum flexibility in
choosing the most efficient instruction. The result of this operation is stored at
address &dy[i]
, guarded by the loop predicate. Lanes where the predicate is
false are not stored, and so the value in memory will retain its prior value.
[4] - Increment i
by the number of double-precision lanes in the vector.
[5] - ptest
returns true if any lane of the (newly updated) predicate is
active, which causes control to return to the start of the while loop if there is
any work left to do.
“Ideal” assembler output:
daxpy_1_1: MOV Z2.D, D0 // da MOV X3, #0 // i WHILELT P0.D, X3, X0 // i, n loop: LD1D Z1.D, P0/Z, [X1, X3, LSL #3] LD1D Z0.D, P0/Z, [X2, X3, LSL #3] FMLA Z0.D, P0/M, Z1.D, Z2.D ST1D Z0.D, P0, [X2, X3, LSL #3] INCD X3 // i WHILELT P0.D, X3, X0 // i, n B.ANY loop RET
This example presents a general daxpy implementation, where the indices of x and y start at 0 and are then incremented by unknown (but loop-invariant) strides each iteration.
void daxpy(int64_t n, double da, double *dx, int64_t incx, double *dy, int64_t incy) { svint64_t incx_vec = svindex_s64(0, incx); // [1] svint64_t incy_vec = svindex_s64(0, incy); // [1] int64_t i = 0; svbool_t pg = svwhilelt_b64(i, n); // [2] do { svfloat64_t dx_vec = svld1_gather_index(pg, dx, incx_vec); // [3] svfloat64_t dy_vec = svld1_gather_index(pg, dy, incy_vec); // [3] svst1_scatter_index(pg, dy, incy_vec, svmla_x(pg, dy_vec, dx_vec, da)); // [4] dx += incx * svcntd(); // [5] dy += incy * svcntd(); // [5] i += svcntd(); // [6] pg = svwhilelt_b64(i, n); // [2] } while (svptest_any(svptrue_b64(), pg)); // [7] }
The following steps explain this example:
[1] - For each of x
and y
, initialize a vector of indices, starting at 0 for
the first lane and incrementing by incx
and
incy
respectively in each subsequent lane.
[2] - Initialize or update the loop predicate.
[3] - Load a vector’s worth of values, guarded by the loop predicate. Lanes where this predicate is false do not perform any load (and so will not generate a fault), and set the result value to 0.0. This time, a base + vector-of-indices gather load, is used to load the required non-consecutive values.
[4] - Perform a floating-point multiply-add operation, and pass the result to
a store. This time, the base + vector-of-indices scatter store is used to store each
result in the correct index of the dy[]
array.
[5] - Instead of using i
to calculate the
load address, increment the base pointer, by multiplying the vector length by the
stride.
[6] - Increment i
by the number of double-precision lanes in the vector.
[7] - Test the loop predicate to work out whether there is any more work to do, and loop back if appropriate.