3.9.4. Example: Swapping color channels

The following example shows you the syntax and usage of NEON instructions.

This example involves a 24-bit RGB image. The red (R), green (G), and blue (B) pixels are arranged in memory in the sequence R0, G0, B0, R1, G2, B2, and so on. R0 is the first red pixel, R1 is the second red pixel, and so on.

This example shows you how to swap red and blue channels so that the sequence in memory becomes B0, G0, R0, B1, G1, R1, and so on. This is a simple signal processing operation, which NEON instructions can perform efficiently.

Figure 3.11 shows a normal load that pulls consecutive R, G, and B data from memory into registers. Code to swap channels based on this input requires mask, shift, and combine operations. It is not elegant and is unlikely to be efficient.

Figure 3.11. Loading RGB data with a linear load

To view this graphic, your browser must support the SVG format. Either install a browser with native support, or install an appropriate plugin such as Adobe SVG Viewer.


NEON provides structure load and store instructions to help in these situations, as shown in Figure 3.12. These instructions pull in data from memory and simultaneously separate values into different registers. For this example, you can use VLD3 to split up red, green, and blue when they are loaded.

Structure load instructions read data from memory into 64-bit NEON registers, with optional de-interleaving. Structure store instructions work similarly, reinterleaving data from registers before writing it to memory.

Figure 3.12. NEON structure loads and stores

To view this graphic, your browser must support the SVG format. Either install a browser with native support, or install an appropriate plugin such as Adobe SVG Viewer.


Syntax

Figure 3.13 shows the syntax of the structure load and store instructions. The syntax consists of five parts.

Figure 3.13. The structure load and store syntax

To view this graphic, your browser must support the SVG format. Either install a browser with native support, or install an appropriate plugin such as Adobe SVG Viewer.


  • The load instructions start with VLD, and the store instructions start with VST.

  • A numeric interleave pattern is the number of registers to interleave. This is also the gap between corresponding elements in each structure.

  • An element size specifies the number of bits in the accessed elements.

  • A list of 64-bit NEON registers to load from or store to memory. The list can contain up to four registers, depending on the interleave pattern.

  • An ARM address register contains the location to be accessed in memory to store to or load from. It is possible to update the address after the access.

Interleave pattern

Instructions can load, store, and de-interleave structures that contain from one to four equally sized elements. These elements usually have NEON-supported widths of 8, 16, or 32 bits.

  • VLD1 is the simplest form. It loads one to four registers of data from memory, with no de-interleaving. Use it when processing an array of non-interleaved data.

  • VLD2 loads two or four registers of data, and de-interleaves even and odd elements into those registers. For example, use it to separate stereo audio data into left and right channels.

  • VLD3 loads three registers and de-interleaves. For example, use it to split RGB pixels into channels.

  • VLD4 loads four registers and de-interleaves. For example, use it to process RGB image data.

Store instructions support the same options, but interleave the data from registers before writing them to memory.

Element sizes

Load and store instructions interleave elements based on the size specified to the instruction. For example, loading two NEON registers with VLD2.16 results in four 16-bit elements in the first register and four 16-bit elements in the second, with even and odd elements in adjacent pairs separated to each register, as shown in Figure 3.14.

Figure 3.14. Loading and de-interleaving 16-bit data

To view this graphic, your browser must support the SVG format. Either install a browser with native support, or install an appropriate plugin such as Adobe SVG Viewer.


Changing the element size to 32 bits causes the same amount of data to be loaded. However, only two elements make up each vector, because each element is 32 bits rather than 16 bits. This still separates the even and odd elements from memory into separate registers, as shown in Figure 3.15.

Figure 3.15. Loading and de-interleaving 32-bit data

To view this graphic, your browser must support the SVG format. Either install a browser with native support, or install an appropriate plugin such as Adobe SVG Viewer.


Element size also affects endianness handling. In general, if you specify the correct element size in load and store instructions, bytes are read from memory in an appropriate order, and the same code works on little- and big-endian systems.

Finally, element size has an impact on pointer alignment. Alignment to the element size generally gives better performance. It might be a requirement of your target operating system. For example, when loading 32-bit elements, align the address of the first element to at least 32 bits.

Single or multiple elements

Structure load instructions de-interleave from memory. In each NEON register, they support the following loading methods:

  • Load multiple lanes with different elements, as shown in Figure 3.12.

  • Load multiple lanes with the same element, as shown in Figure 3.16.

  • Load a single lane with a single element and leave other lanes unaffected, as shown in Figure 3.17.

Figure 3.16. Loading and de-interleaving to all vector lanes

To view this graphic, your browser must support the SVG format. Either install a browser with native support, or install an appropriate plugin such as Adobe SVG Viewer.


When you construct a vector from data scattered in memory, it is efficient to load a single lane with a single element and leave other lanes unaffected.

Figure 3.17. Loading and de-interleaving to a single vector lane

To view this graphic, your browser must support the SVG format. Either install a browser with native support, or install an appropriate plugin such as Adobe SVG Viewer.


Stores are similar, and they provide support for writing single or multiple elements with interleaving.

Other loads and stores

In addition to structure loads and stores, NEON provides the following load and store instructions:

  • VLDR - Loads a single register as a 64-bit value.

  • VSTR - Stores a single register as a 64-bit value.

  • VLDM - Loads multiple registers as 64-bit values.

  • VSTM - Stores multiple registers as 64-bit values.

VLDM and VSTM are useful for storing and retrieving registers from the stack.

Copyright © 2014 ARM. All rights reserved.ARM DAI0425
Non-ConfidentialID080414