3.5 Using SVE intrinsics directly in your C code

Intrinsics are C or C++ pseudo-function calls that the compiler replaces with the appropriate SIMD instructions. This lets you use the data types and operations available in the SIMD implementation, while allowing the compiler to handle instruction scheduling and register allocation.

These intrinsics are defined in the ARM C Language Extensions for SVE specification.

Introduction

The ARM C language extensions for SVE provide a set of types and accessors for SVE vectors and predicates, and a function interface for all relevant SVE instructions.

The function interface is more general than the underlying architecture, so not every function maps directly to an architectural instruction. The intention is to provide a regular interface and leave the compiler to pick the best mapping to SVE instructions.

The ARM C Language Extensions for SVE specification has a detailed description of this interface, and should be used as the primary reference. This section introduces a selection of features to help you get started with the ARM C Language Extensions (ACLE) for SVE.

Note:

Please be aware of issue 1758, described in 5.2 Known limitations in SVE support.

Header file inclusion

Translation units that use the ACLE should first include arm_sve.h, guarded by __ARM_FEATURE_SVE:

#ifdef __ARM_FEATURE_SVE
#include <arm_sve.h>
#endif /* __ARM_FEATURE_SVE */ 

All functions and types that are defined in the header file have the prefix sv, to reduce the chance of collisions with other extensions.

SVE vector types

arm_sve.h defines the following C types to represent values in SVE vector registers. Each type describes the type of the elements within the vector:

svint8_t svuint8_t

svint16_t svuint16_t svfloat16_t

svint32_t svuint32_t svfloat32_t

svint64_t svuint64_t svfloat64_t

For example, svint64_t represents a vector of 64-bit signed integers, and svfloat16_t represents a vector of half-precision floating-point numbers.

SVE predicate type

The extension also defines a single sizeless predicate type svbool_t, which has enough bits to control an operation on a vector of bytes.

The main use of predicates is to select elements in a vector. When the elements in the vector have N bytes, only the low bit in each sequence of N predicate bits is significant, as shown in the following table:

Table 3-2 Element selection by predicate type svbool_t

Vector type Element selected by each svbool_t bit
svint8_t 0 1 2 3 4 5 6 7 8 ...
svint16_t 0   1   2   3   4 ...
svint32_t 0       1       2 ...
svint64_t 0               1 ...

Limitations on how SVE ACLE types can be used

SVE is a vector-length agnostic architecture, allowing an implementation to choose a vector length of any multiple of 128 bits, up to a maximum of 2048 bits. Therefore, the size of SVE ACLE types are unknown at compile time, which limits how these types can be used.

Common situations where SVE types may be used include:

  • as the type of an object with automatic storage duration
  • as a function parameter or return type
  • as the type in a (type) {value} compound literal
  • as the target of a pointer or reference type
  • as a template type argument.

Due to their unknown size at compile time, SVE types may not be used:

  • to declare or define a static or thread-local storage variable
  • as the type of an array element
  • as the operand to a new expression
  • as the type of object deleted by a delete expression
  • as the argument to sizeof and _Alignof
  • with pointer arithmetic on pointers to SVE objects (this affects the +, -, ++, and -- operators)
  • as members of unions, structures and classes
  • in standard library containers like std::vector.

For a comprehensive list of valid usage, refer to the ARM C Language Extensions for SVE specification.

Writing SVE ACLE functions

SVE ACLE functions have the form:

svbase[_disambiguator][_type0][_type1]...[_predication]

where the function is built using the following:

base

For most functions, this is the lowercase name of the SVE instruction. Sometimes, letters indicating the type or size of data being operated on are dropped, where it can be implied from the argument types.

Unsigned extending loads add a u to indicate that the data will be zero extended, to more explicitly differentiate them from their signed equivalent.

disambiguator

This field distinguishes between different forms of a function, for example:

  • To distinguish between addressing modes
  • To distinguish forms that take a scalar rather than a vector as the final argument.
type0 type1 ...

A list of types for vectors and predicates, starting with the return type then with each argument type. For example, _s8, _u32, and _f32, which represent signed 8-bit integer, an unsigned 32-bit integer and single precision 32-bit float types, respectively.

Predicate types are represented by _b8, _b16 and so on, for predicates suitable for 8-bit and 16-bit types respectively. A predicate type suitable for all element types is represented by _b. Where a type is not needed to disambiguate between variants of a base function, it is omitted.

predication

This suffix describes the inactive elements in the result of a predicated operation. It can be one of the following:

  • z – Zero predication: Set all inactive elements of the result to zero.
  • m – Merge predication: copy all inactive elements from the first vector argument.
  • x – ‘Don’t care’ predication. Use this form when you do not care about the inactive elements. The compiler is then free to choose between zeroing, merging, or unpredicated forms to give the best code quality, but gives no guarantee of what data, if any, will be left in inactive elements.

Addressing modes

Load, store, prefetch, and ADR functions have arguments that describe the memory area being addressed. The first addressing argument is the base – either a single pointer to an element type, or a 32-bit or 64-bit vector of addresses. The second argument, when present, offsets the base (or bases) by some number of bytes, elements, or vectors. This offset argument can be an immediate constant value, a scalar argument, or a vector of offsets.

Not every combination of the above options exists. The following table gives examples of some common addressing mode disambiguators, and describes how to interpret the address arguments:

Table 3-3 Common addressing mode disambiguators

Disambiguator Interpretation
_u32base The base argument is a vector of unsigned 32-bit addresses.
_u64base The base argument is a vector of unsigned 64-bit addresses.

_s32offset

_s64offset

_u32offset

_u64offset
The offset argument is a vector of byte offsets. These offsets are signed or unsigned 32-bit or 64-bit numbers.

_s32index

_s64index

_u32index

_u64index
The offset argument is a vector of element-sized indices. These indices are signed or unsigned 32-bit or 64-bit numbers.
_offset The offset argument is a scalar, and should be treated as a byte offset.
_index The offset argument is a scalar, and should be treated as an index into an array of elements.
_vnum The offset argument is a scalar, and should be treated an index into an array of SVE vectors.

In the following example, the address of element i is &base[indices[i]].

svuint32_t svld1_gather_[s32]index[_u32]
	(svbool_t pg, const uint32_t *base, svint32_t indices)

Operations involving vectors and scalars

All arithmetic functions that take two vector inputs have an alternative form that takes a vector and a scalar. Conceptually, this scalar is duplicated across a vector, and that vector is used as the second vector argument.

Similarly, arithmetic functions that take three vector inputs have an alternative form that takes two vectors and one scalar.

To differentiate these forms, the disambiguator _n is added to the form that takes a scalar.

Short forms

Sometimes, it is possible to omit part of the full name, and still uniquely identify the correct form of a function, by inspecting the argument types. Where this is possible, these simplified forms are provided as aliases to their fully named equivalents, and will be used for preference in the rest of this document.

In the ARM C Language Extensions for SVE specification, the portion that can be removed is enclosed in square brackets. For example svclz[_s16]_m has the full name svclz_s16_m, and an overloaded alias, svclz_m.

Example – Naïve step-1 daxpy

daxpy is a BLAS (Basic Linear Algebra Subroutines) subroutine that operates on two arrays of double precision floating-point numbers. A slice is taken of each of these arrays. For each element in these slices, an element (x) in the first array is multiplied by a constant (a), then added to the element (y) from the second array. The result is stored back to the second array at the same index.

This example presents a step-1 daxpy implementation, where the indices of x and y start at 0 and increment by 1 each iteration. A C code implementation might look like this:

void daxpy_1_1(int64_t n, double da, double *dx, double *dy)
{
	for (int64_t i = 0; i < n; ++i) {
		dy[i] = dx[i] * da + dy[i];
	}
}

Here is an ACLE equivalent:

void daxpy_1_1(int64_t n, double da, double *dx, double *dy)
{
	int64_t i = 0;
	svbool_t pg = svwhilelt_b64(i, n);					     // [1]
	do
		{
			svfloat64_t dx_vec = svld1(pg, &dx[i]);                     // [2]
			svfloat64_t dy_vec = svld1(pg, &dy[i]);                     // [2]
			svst1(pg, &dy[i], svmla_x(pg, dy_vec, dx_vec, da));         // [3]
			i += svcntd();                                              // [4]
			pg = svwhilelt_b64(i, n);                                   // [1]
		}
	while (svptest_any(svptrue_b64(), pg));                                   // [5]
}
			

The following steps explain this example:

[1] - Initialize a predicate register to control the loop. _b64 specifies a predicate for 64-bit elements. Conceptually, this operation creates an integer vector starting at i and incrementing by 1 in each subsequent lane. The predicate lane is active if this value is less than n. Therefore, this loop is safe, if inefficient, even if n ≤ 0. The same operation is used at the bottom of the loop, to update the predicate for the next iteration.

[2] - Load some values into an SVE vector, guarded by the loop predicate. Lanes where this predicate is false do not perform any load (and so will not generate a fault), and set the result value to 0.0. The number of lanes that are loaded depends on the vector width, which is only known at runtime.

[3] - Perform a floating-point multiply-add operation, and pass the result to a store. The _x on the MLA indicates we don’t care about the result for inactive lanes. This gives the compiler maximum flexibility in choosing the most efficient instruction. The result of this operation is stored at address &dy[i], guarded by the loop predicate. Lanes where the predicate is false are not stored, and so the value in memory will retain its prior value.

[4] - Increment i by the number of double-precision lanes in the vector.

[5] - ptest returns true if any lane of the (newly updated) predicate is active, which causes control to return to the start of the while loop if there is any work left to do.

“Ideal” assembler output:

daxpy_1_1:
	MOV Z2.D, D0            // da
	MOV X3, #0              // i
	WHILELT P0.D, X3, X0    // i, n
loop:
	LD1D Z1.D, P0/Z, [X1, X3, LSL #3]
	LD1D Z0.D, P0/Z, [X2, X3, LSL #3]
	FMLA Z0.D, P0/M, Z1.D, Z2.D
	ST1D Z0.D, P0, [X2, X3, LSL #3]
	INCD X3                 // i
	WHILELT P0.D, X3, X0    // i, n
	B.ANY loop
	RET
			

Example – Naïve general daxpy

This example presents a general daxpy implementation, where the indices of x and y start at 0 and are then incremented by unknown (but loop-invariant) strides each iteration.

void daxpy(int64_t n, double da, double *dx, int64_t incx,
			double *dy, int64_t incy)
{
  svint64_t incx_vec = svindex_s64(0, incx);				        		  // [1]
  svint64_t incy_vec = svindex_s64(0, incy);				        		  // [1]
  int64_t i = 0;
  svbool_t pg = svwhilelt_b64(i, n);										  // [2]
  do
    {
      svfloat64_t dx_vec = svld1_gather_index(pg, dx, incx_vec);			  // [3]
      svfloat64_t dy_vec = svld1_gather_index(pg, dy, incy_vec);			  // [3]
      svst1_scatter_index(pg, dy, incy_vec, svmla_x(pg, dy_vec, dx_vec, da)); // [4]
      dx += incx * svcntd();							 			   	  // [5]
      dy += incy * svcntd();							 			  	   // [5]
      i += svcntd();								  				  	  // [6]
      pg = svwhilelt_b64(i, n);						    				   // [2]
    }
  while (svptest_any(svptrue_b64(), pg));							   	  // [7]
}
			

The following steps explain this example:

[1] - For each of x and y, initialize a vector of indices, starting at 0 for the first lane and incrementing by incx and incy respectively in each subsequent lane.

[2] - Initialize or update the loop predicate.

[3] - Load a vector’s worth of values, guarded by the loop predicate. Lanes where this predicate is false do not perform any load (and so will not generate a fault), and set the result value to 0.0. This time, a base + vector-of-indices gather load, is used to load the required non-consecutive values.

[4] - Perform a floating-point multiply-add operation, and pass the result to a store. This time, the base + vector-of-indices scatter store is used to store each result in the correct index of the dy[] array.

[5] - Instead of using i to calculate the load address, increment the base pointer, by multiplying the vector length by the stride.

[6] - Increment i by the number of double-precision lanes in the vector.

[7] - Test the loop predicate to work out whether there is any more work to do, and loop back if appropriate.

Non-ConfidentialPDF file icon PDF versionARM 100891_0607_01_en
Copyright © 2016, 2017 ARM Limited or its affiliates. All rights reserved.