4.6.4. Automatic vectorization

ARM Compiler and GCC can also perform automatic vectorization on C or C++ source code. This gives access to high NEON performance, without writing assembly code or using intrinsics. In this way, source code remains portable among different tools and target platforms.

Because the C language does not specify parallelizing behavior, you can indicate to the compiler where it is safe and optimal. You can do this without compromising the portability of the source code among different platforms or toolchains.

Example 4.10 shows a small function that the compiler can safely and optimally vectorize. This is possible if you use the __restrict keyword to ensure that the pointers pa and pb do not address overlapping regions of memory. You can also force the for loop to always execute a multiple of four times by masking off the bottom two bits of n for the limit test. This extra information makes it safe for the compiler to vectorize this function into NEON load and store operations.

Example 4.10. NEON vectorization

void add_ints(int * __restrict pa, int * __restrict pb, unsigned int n, int x)

	unsigned int i;
	for(i = 0; i < (n & ~3); i++)
	 pa[i] = pb[i] + x;

Compiling the example

Although ARM Compiler and GNU development tools support the same source syntax, the command-line syntax differs significantly between the two compilers.

Automatic vectorization with ARM Compiler

To enable automatic vectorization, specify a target processor that includes NEON technology, compile for optimization level -O2 or higher, and add -Otime and --vectorize to the command line. For example:

armcc --cpu=Cortex-A9 -O3 -Otime --vectorize -c vectorized.c


When you specify --vectorize, you must also specify -Otime and an optimization level of -O2 or -O3 to enable automatic vectorization.

Because parallel accumulations of floating-point values can reduce the precision gained by sorting input data, these are disabled unless you specify --fpmode=fast on the command line.

You can request more verbose compiler output by adding --remarks to the command line. This provides additional information about many aspects of the compilation taking place. For NEON vectorization, this includes the following information:

  • Code that the compiler has vectorized.

  • Code that could not be vectorized, and hints of why this was not done.

This information can be used to modify the code into a format that the compiler can vectorize.

Automatic vectorization with GCC

To enable automatic vectorization, you must add -mfpu=neon and -ftree-vectorize to the GCC command line. For example:

arm-none-linux-gnueabi-gcc -mfpu=neon -ftree-vectorize -c vectorized.c

Depending on your toolchain, you might also have to add -mfloat-abi=softfp to indicate that NEON variables must be passed in general-purpose registers.

You can request more verbose compiler output by adding -ftree-vectorizer-verbose=1 to the command line. This gives the following compiler output:

  • Code that it has vectorized.

  • Code that it could not vectorize, and hints of why this was not done.

You can use this information to modify the code into a format that the compiler can vectorize. Some versions of GCC support verbosity values higher than 1, providing even more details about vectorization.

Copyright © 2014 ARM. All rights reserved.ARM DAI0425