3.6 Automatic vectorization

Automatic vectorization involves the high-level analysis of loops in your code. This is the most efficient way to map the majority of typical code onto the functionality of the NEON unit.

For most code, the gains that can be made with algorithm-dependent parallelism on a smaller scale are very small relative to the cost of automatic analysis of such opportunities. For this reason, the NEON unit is designed as a target for loop-based parallelism.

Vectorization is carried out in a way that ensures that optimized code gives the same results as nonvectorized code. In certain cases, to avoid the possibility of an incorrect result, vectorization of a loop is not carried out. This can lead to suboptimal code, and you might have to manually tune your code to make it more suitable for automatic vectorization.

Automatic vectorization can also often be impeded by earlier manual optimization attempts, for example, manual loop unrolling in the source code, or complex array accesses. For optimal results, it is best to write code using simple loops, enabling the compiler to perform the optimization. For hand-optimized legacy code, it can be easier to rewrite critical portions of the code based on the original algorithm using simple loops.

By coding in vectorizable loops using NEON extensions instead of writing in explicit NEON instructions, code portability is preserved between processors. Performance levels similar to that of hand coded vectorization are achieved with less effort.

Related concepts
3.8 Stride patterns and data accesses
3.9 Factors affecting NEON vectorization performance
3.12 Data dependency conflicts when vectorizing code
3.13 Carry-around scalar variables and vectorization
3.14 Reduction of a vector to a scalar
3.15 Vectorization on loops containing pointers
3.16 Nonvectorization on loops containing pointers and indirect addressing
3.17 Nonvectorization on conditional loop exits
3.18 Vectorizable loop iteration counts
3.19 Indicating loop iteration counts to the compiler with __promise(expr)
3.20 Grouping structure accesses for vectorization
3.21 Vectorization and struct member lengths
3.22 Nonvectorization of function calls to non-inline functions from within loops
3.23 Conditional statements and efficient vectorization
3.25 Vectorizable code example
3.26 DSP vectorizable code example
Related reference
8.189 --vectorize, --no_vectorize
3.7 Data references within a vectorizable loop
3.10 NEON vectorization performance goals
3.11 Recommended loop structure for vectorization
3.27 What can limit or prevent automatic vectorization
Non-ConfidentialPDF file icon PDF versionARM DUI0472J
Copyright © 2010-2013 ARM. All rights reserved.