3.2 Auto-vectorization examples

Describes example C code for loops that the compiler is capable of vectorizing, with a short description of the interesting features of each loop.

Reductions

Loops that perform a reduction operation, such as the following "running total" calculation, can be vectorized.

If the `-ffp-mode=fast` option is specified, the compiler maintains a vector of per-lane running totals. When the loop completes, it performs a cross-lane reduction operation to efficiently sum all elements into a single scalar value.

Without the `-ffp-mode=fast` option, the compiler can still vectorize, but is likely to produce less efficient code, as it is constrained to perform the summation operation in the same order as the original source code.

```float reduction(float *restrict a, long count)
{
float r = 0;
for (long i = 0; i < count; i++)
{
r += a[i];
}
return r;
}```

Strided access

Access to memory does not need to be sequential. In this example every fifth element in two arrays are added together and written to the corresponding element in a destination array.

```void stride(int *restrict a, int *restrict b, int *restrict c, long count)
{
for (long i = 0; i < count; i+=5)
{
a[i] = b[i] + c[i];
}
}```

Scatter and gather

You can use the SVE scatter and gather operations to efficiently auto-vectorize when loop iterations do not have a regular access pattern. In the following example an indices array indicates which elements in the data array should be added together. The compiler loads as many indices elements as can fit in an SVE vector, and then uses them as offsets from a base register, to perform a gather-load operation from the data array.

```float gather_reduce(float *restrict data, int *restrict indices, long c)
{
float r = 0;
for (long i = 0; i < c; i++)
{
r += data[indices[i]];
}
return r;
}```

Conditions within the loop body

The predication features of the SVE architectural extension make it possible to efficiently support unpredictable control flow within vectorized loops.

```float cond_gather_reduce(float *restrict data, int *restrict indices, long count)
{
float r = 0;
for (long i = 0; i < count; i++)
{
if (indices[i]%2 == 0)
{
r += data[indices[i]];
}
}
return r;
}```