3.4 Using pragmas to encourage or suppress auto-vectorization

ARM® Compiler supports pragmas to both encourage and suppress auto-vectorization. These pragmas make use of, and extend, the pragma clang loop directives.

For more information about the pragma clang loop directives, see Auto-Vectorization in LLVM, at llvm.org.

Note:

In all the following cases, the pragma only affects the loop statement immediately following it. If your code contains multiple nested loops, you must insert a pragma before each one in order to affect all the loops in the nest.

Encouraging auto-vectorization with pragmas

If the SVE auto-vectorization pass is enabled with –O2 or above, then by default it examines all loops.

If static analysis of a loop indicates that it might contain dependencies that hinder parallelism, auto-vectorization might not be performed. If you know that these dependencies do not hinder vectorization, you can use the interleave directive to indicate this to the compiler by placing the following line immediately before the loop:

#pragma clang loop vectorize(assume_safety)

This pragma indicates to the compiler that the following loop contains no data dependencies between loop iterations that would prevent vectorization. The compiler might be able to use this information to vectorize a loop, where it would not typically be possible.

Note:

Use of this pragma does not guarantee auto-vectorization. There might be other reasons why auto-vectorization is not possible or worthwhile for a particular loop.

Ensure that you only use this pragma when it is safe to do so. Using this pragma when there are data dependencies between loop iterations results in incorrect behavior.

For example, consider the following loop, that processes an array indices. Each element in indices specifies the index into a larger histogram array. The referenced element in the histogram array is incremented.

void update(int *restrict histogram, int *restrict indices, int count)
{
  for (int i = 0; i < count; i++)
  {
    histogram[ indices[i] ]++;
  }
}

The compiler is unable to vectorize this loop, because the same index could appear more than once in the indices array. Therefore a vectorized version of the algorithm would lose some of the increment operations if two identical indices are processed in the same vector load/increment/store sequence.

However, if the programmer knows that the indices array only ever contains unique elements, then it is useful to be able to force the compiler to vectorize this loop. This is accomplished by placing the pragma before the loop:

void update(int *restrict histogram, int *restrict indices, int count)
{
  #pragma clang loop vectorize(assume_safety)
  for (int i = 0; i < count; i++)
  {
    histogram[ indices[i] ]++;
  }
}

The following table shows the differences between the compiler output for these functions, where the only difference is the presence of the pragma to encourage vectorization:

Table 3-1 Compiler output with and without auto-vectorization

With no pragma With #pragma clang loop vectorize(assume_safety)
update:
       cmp      w2, #1
       b.lt     .LBB0_2
.LBB0_1:
       ldrsw    x8, [x1], #4
       sub      w2, w2, #1
       lsl      x8, x8, #2
       ldr      w9, [x0, x8]
       add      w9, w9, #1
       str      w9, [x0, x8]
       cbnz     w2, .LBB0_1
.LBB0_2:
       ret
update_unique:                      
// BB#0:
    subs    w9, w2, #1              
    b.lt    .LBB0_3
// BB#1:
    add     x9, x9, #1          
    mov     x8, xzr
    whilelo p0.d, xzr, x9
    
.LBB0_2:                               
    ld1sw   {z0.d}, p0/z, [x1, x8, lsl #2]
    incd    x8
    ld1sw   {z1.d}, p0/z, [x0, z0.d, lsl #2]
    add     z1.d, z1.d, #1
    st1w    {z1.d}, p0, [x0, z0.d, lsl #2]
    whilelo p0.d, x8, x9
    b.mi    .LBB0_2
.LBB0_3:
    ret

Suppressing auto-vectorization with pragmas

If SVE auto-vectorization is not required for a specific loop, you can disable it or restrict it to only use ARM NEON instructions.

You can suppress auto-vectorization on a specific loop by adding #pragma clang loop vectorize(disable) immediately before the loop. In this example, a loop that would be trivially vectorized by the compiler is ignored:

void update_unique(int *restrict a, int *restrict b, int count)
{
  #pragma clang loop vectorize(disable)
  for ( int i = 0; i < count; i++ )
  {
    a[i] = b[i] + 1;
  }
}

You can also suppress SVE instructions while allowing ARM NEON instructions by adding a vectorize_style hint:

vectorize_style(fixed_width)

Prefer fixed-width vectorization, resulting in ARM NEON instructions. For a loop with vectorize_style(fixed_width), the compiler prefers to generate ARM NEON instructions, though SVE instructions may still be used with a fixed-width predicate (such as gather loads or scatter stores).

vectorize_style(scaled_width)

Prefer scaled-width vectorization, resulting in SVE instructions. For a loop with vectorize_style(scaled_width), the compiler prefers SVE instructions but can choose to generate ARM NEON instructions or not vectorize at all.

This is the default.

For example:

void update_unique(int *restrict a, int *restrict b, int count)
{
  #pragma clang loop vectorize(enable) vectorize_style(fixed_width)
  for ( int i = 0; i < count; i++ )
  {
    a[i] = b[i] + 1;
  }
}
Non-ConfidentialPDF file icon PDF versionARM 100891_0607_01_en
Copyright © 2016, 2017 ARM Limited or its affiliates. All rights reserved.