| |||
| Home > NEON Intrinsics > Introduction | |||
The ARM® compiler provides NEON intrinsics to provide an intermediate step for SIMD code generation between a vectorizing compiler and writing assembler code. This feature makes it easier to write code that takes advantage of the NEON architecture when compared to writing assembler directly.
The intrinsics described in this appendix map closely to NEON instructions in ARMv7 architecture. Each section begins with a list of function prototypes, with a comment specifying an equivalent assembler instruction. The compiler selects an instruction that has the required semantics, but there is no guarantee that the compiler will emit the listed instruction.
There is no support for NEON intrinsics for architectures before ARMv7. When building for earlier architectures, or for ARMv7 architecture profiles that do not include NEON, the compiler treats NEON intrinsics as ordinary function calls. This results in an error at link time.
For more information about NEON see the RealView Compilation Tools v3.0 Assembler Guide.
The NEON intrinsics described in this appendix are defined
in the arm_neon.h header file. The header file
defines both the intrinsics and a set of vector types. See ARM DAI 0156A: Using
Neon Intrinsics with RVDS 3.0 in for
more information.install_directory\Documentation\Specifications
Example F.1 shows a short example using NEON intrinsics. To build the example:
Compile the C file with the following options:
armcc -c --debug --cpu Cortex-A8 neon_example.c
Link the image using the command:
armlink neon_example.o -o neon_example.axf
Use a compatible debugger, for example, RealView Debugger, to load and run the image.
Example F.1. NEON intrinsics
/* neon_example.c - Neon intrinsics example program */
#include <stdint.h>
#include <stdio.h>
#include <assert.h>
#include <arm_neon.h>
/* fill array with increasing integers beginning with 0 */
void fill_array(int16_t *array, int size)
{ int i;
for (i = 0; i < size; i++)
{
array[i] = i;
}
}
/* return the sum of all elements in an array. This works by calculating 4 totals (one for each lane) and adding those at the end to get the final total */
int sum_array(int16_t *array, int size)
{
/* initialise the accumulator vector to zero */
int16x4_t acc = vdup_n_s16(0);
int32x2_t acc1;
int64x1_t acc2;
/* this implementation assumes the size of the array is a multiple of 4 */
assert((size % 4) == 0);
/* counting backwards gives better code */
for (; size !=0; size -= 4)
{
int16x4_t vec;
/* load 4 values in parallel from the array */
vec = vld1_s16(array);
/* increment the array pointer to the next element */
array += 4;
/* add the vector to the accumulator vector */
acc = vadd_s16(acc, vec);
}
/* calculate the total */
acc1 = vpaddl_s16(acc);
acc2 = vpaddl_s32(acc1);
/* return the total as an integer */
return (int)vget_lane_s64(acc2, 0);
}
/* main function */
int main()
{
int16_t my_array[100];
fill_array(my_array, 100);
printf("Sum was %d\n", sum_array(my_array, 100));
return 0;
}