3.3.3. Factors affecting vectorization performance

The automatic vectorization process and performance of the generated code is affected by the following:

The way loops are organized

For best performance, the innermost loop in a loop nest must access arrays with a stride of one.

The way the data is structured

For example, a single structure containing arrays of data can be more efficient than an array of structures. The data type also dictates how many data elements can be held in a NEON register, and therefore, how many operations can be performed in parallel.

The iteration counts of loops

Longer iteration counts are generally better, because the loop overhead is amortized over more iterations. Tiny iteration counts, such as two or three elements, can be faster to process with non vector instructions. Extremely long loops, accessing tens of thousands of array elements, can exceed the size of the cache and interfere with data reuse.

The data type of arrays

For example, NEON does not improve performance when double precision floating point arrays are used.

The use of memory hierarchy

Most current processors are relatively unbalanced between memory bandwidth and processor capacity. For example, performing relatively few arithmetic operations on large data sets retrieved from main memory is limited by the memory bandwidth of the system.

Copyright © 2007 ARM Limited. All rights reserved.ARM DUI 0350A
Non-Confidential