3.4.5. Loop structure

The overall structure of a loop is important for obtaining the best performance from vectorization. Generally, it is best to write simple loops with iteration counts that are fixed at the start of the loop, and do not contain complex conditional statements or conditional exits. You might need to rewrite your loops to improve the vectorization performance of the code.

Exits from loops

Example 3.10 is also unable to vectorize because it contains an exit from the loop. In cases like this, you must rewrite the loop if possible for vectorization to succeed.

Example 3.10.  Non vectorizable loop

int a[99], b[99], c[99], i, n;
...
for (i = 0; i < n; i++)
{
    a[i] = b[i] + c[i];
    if (a[i] > 5) break;
};

Loop iteration count

Loops must have a fixed iteration count at the start of the loop. Example 3.11 shows the iteration count is n and this is not changed through the course of the loop.

Example 3.11.  Vectorizable loop

int a[99], b[99], c[99],i, n;
...
for (i = 0; i < n; i++) a[i] = b[i] + c[i];

Example 3.12 has no fixed iteration count and is unable to vectorize automatically.

Example 3.12.  Non vectorizable loop

int a[99], b[99], c[99], i, n;
...
while (i < n)
{
    a[i] = b[i] + c[i];
    i += a[i];
};

The NEON unit can operate on elements in groups of 2, 4, 8, or 16. Where the iteration count at the start of the loop is known, the compiler might add a runtime test to check if the iteration count is not a multiple of the lanes that can be used for the appropriate data type in a NEON register. This increases the code size because additional non vectorized code is generated to execute any additional loop iterations.

If you know that your iteration count is one of those supported by NEON, you can indicate this to the compiler. The most efficient way to do this is to divide the number of iterations by four in the caller and multiply by four in the function that you intend to vectorize. If you cannot modify all of the calling functions, you can use an appropriate expression for your loop limit test to indicate that the loop iteration is a suitable multiple. For example, to indicate that your loop is a multiple of four iterations, use:

for(i = 0; i < (n >> 2 << 2); i++)

or:

for(i = 0; i < (n & ~3); i++)

This reduces the size of the generated code and can give a performance improvement.

Copyright © 2007 ARM Limited. All rights reserved.ARM DUI 0350A
Non-Confidential