3.24 Vectorization diagnostics to tune code for improved performance

The compiler can provide diagnostic information to indicate where vectorization optimizations were successfully applied and where it failed to apply vectorization.

The command-line options that provide this information are --diag_warning=optimizations and --remarks.

The following example shows two functions that implement a simple sum operation on an array. This code does not vectorize.

int addition(int a, int b)
{
    return a + b;
}
void add_int(int *pa, int *pb, unsigned int n, int x)
{
    unsigned int i;
    for(i = 0; i < n; i++) *(pa + i) = addition(*(pb + i),x);
    /* Function calls cannot be vectorized */
}

Using the --diag_warning=optimizations option produces an optimization warning message for the addition() function:

armcc -O3 -Otime --vectorize --diag_warning=optimizations test.c

Using the --remarks option produces the same messages.

Adding the __inline qualifier to the definition of addition() enables this code to vectorize. However, it is still not optimal. Using the --diag_warning=optimizations option again produces optimization warning messages to indicate that the loop vectorizes but there might be a potential pointer aliasing problem.

The compiler must generate a runtime test for aliasing and output both vectorized and scalar copies of the code. If you know that the pointers are not aliased, you can use the restrict keyword to reduce the runtime test overhead and improve vectorization performance:

__inline int addition(int a, int b)
{
    return a + b;
}
void add_int(int * __restrict pa, int * __restrict pb, unsigned int n, int x)
{
    unsigned int i;
    for(i = 0; i < n; i++) *(pa + i) = addition(*(pb + i),x);
}

The final improvement you can make is to indicate the number of loop iterations. In the previous example, the number of iterations is not fixed and might not be a multiple that can fit exactly into a NEON register. This means that the compiler must test for remaining iterations to execute using nonvectored code. If you know that your iteration count is divisible by the number of elements that the NEON unit can operate on in parallel, you can indicate this to the compiler using the __promise intrinsic. The following example shows the final code that obtains the best performance from vectorization.

__inline int addition(int a, int b)
{
    return a + b;
}
void add_int(int * __restrict pa, int * __restrict pb, unsigned int n, int x)
{
    unsigned int i;
    __promise((n % 4) == 0);
    /* n is a multiple of 4 */
    for(i = 0; i < (n & ~3); i++) *(pa + i) = addition(*(pb + i),x);    
}
Non-ConfidentialPDF file icon PDF versionARM DUI0472M
Copyright © 2010-2016 ARM Limited or its affiliates. All rights reserved.