3.4.8. Example of improving performance by tuning source code

The compiler can provide diagnostic information to indicate where vectorization optimizations are successfully applied and where it failed to apply vectorization. See --diag_suppress=optimizations and --diag_warning=optimizations for more information.

Example 3.14 shows two functions that implement a simple sum operation on an array. This code does not vectorize.

Example 3.14. Non vectorizable code

int addition(int a, int b)
{
    return a + b;
}

void add_int(int *pa, int *pb, unsigned int n, int x)
{
    unsigned int i;
    for(i = 0; i < n; i++) *(pa + i) = addition(*(pb + i),x);
}

Using the --diag_warnings=optimization option produces an optimization warning message for the addition() function.

Adding the __inline qualifier to the definition of addition() enables this code to vectorize but it is still not optimal. Using the --diag_warnings=optimization option again, produces optimization warning messages to indicate that the loop vectorizes but there might be a potential pointer aliasing problem.

The compiler must generate a runtime test for aliasing and output both vectorized and scalar copies of the code. Example 3.15 shows how this can be improved using the restrict keyword if you know that the pointers are not aliased.

Example 3.15. Using restrict to improve vectorization performance

__inline int addition(int a, int b)
{
    return a + b;
}

void add_int(int * __restrict pa, int * __restrict pb, unsigned int n, int x)
{
    unsigned int i;
    for(i = 0; i < n; i++) *(pa + i) = addition(*(pb + i),x);
}

The final improvement that can be made is to the number of loop iterations. In Example 3.15, the number of iterations is not fixed and might not be a multiple that can fit exactly into a NEON register. This means that the compiler must test for remaining iterations to execute using non vectored code. If you know that your iteration count is one of those supported by NEON, you can indicate this to the compiler. Example 3.16 shows the final improvement that can be made to obtain the best performance from vectorization.

Example 3.16. Code tuned for best vectorization performance

__inline int addition(int a, int b)
{
    return a + b;
}

void add_int(int * __restrict pa, int * __restrict pb, unsigned int n, int x)
{
    unsigned int i;
    for(i = 0; i < (n & ~3); i++) *(pa + i) = addition(*(pb + i),x); 
    /* n is a multiple of 4 */
}

Copyright © 2007 ARM Limited. All rights reserved.ARM DUI 0350A
Non-Confidential