|Non-Confidential||PDF version||ARM DUI0472J|
|Home > Using the NEON Vectorizing Compiler > Factors affecting NEON vectorization performance|
The automatic vectorization process and performance of the generated code is affected by a number of criteria:
For best performance, the innermost loop in a loop nest must access arrays with a stride of one.
The data type dictates how many data elements can be held in a NEON register, and therefore how many operations can be performed in parallel.
Longer iteration counts are generally better, because the loop overhead is reduced over more iterations. Tiny iteration counts, such as two or three elements, can be faster to process with nonvector instructions.
For example, NEON does not improve performance when double precision floating point arrays are used.
Most current processors are relatively unbalanced between memory bandwidth and processor capacity. For example, performing relatively few arithmetic operations on large data sets retrieved from main memory is limited by the memory bandwidth of the system.