16.6.7. Advanced SIMD load/store instructions

Advanced SIMD load/store instructions can be divided into the following subcategories:

VLDR and VSTR instructions transfer a single 64-bit register and require two issue cycles. Processor scheduling is static, and it is not possible to know the address alignment at schedule time. Therefore, scheduling of the VLDR and VSTR instructions must be done assuming the load/store address is not 128-bit aligned.

VLDM and VSTM instructions transfer multiple 64-bit registers. The number of registers in the register list determines the number of cycles required to execute a load or store multiple. The NEON unit can load or store two 64-bit registers in each cycle. The number of cycles required to execute a VLDM or VSTM instruction is given by the following formula:

(number of registers/2) + mod (number of registers,2) + 1

For example, VLDM and VSTM transfer of one or two registers require two cycles, three or four registers require three cycles, five or six registers require four cycles, and 15 or 16 registers require nine cycles.

VLD and VST element and structure load/store instructions transfer one up to four 64-bit registers. The number of cycles required to execute a VLD or VST instruction depends on both the number of registers in the register list and the alignment requirement. Typically, you can reduce the number of cycles if you use a stronger alignment. For example, a 2-register VLD2.16@64 requires two cycles but VLD2.16@128 requires only one cycle.

Table 16.23 shows the operation of the Advanced SIMD load/store instructions.

Table 16.23. Advanced SIMD load/store instructions

Instruction

Register list (alignment)

CyclesSourceResult
    123412
VLDR and VSTR register load/store[1]:
 VLDRDd, <addr>

1

2

-

-

-

-

-

-

-

-

-

Dd:N1

-

-

 VSTRDd, <addr>

1

2

Dd:N1

-

-

-

-

-

-

-

-

-

-

-

VLD and VST multiple 1-element or 2, 3, 4-element structure[2]:
 VLD1

1-reg

(unaligned)

1

2

-

-

-

-

-

-

-

-

-

Dd:N1

-

-

 

1-reg

(@64)

1--

-

-

Dd:N1

-

 

2-reg

(unaligned, @64)

1

2

-

-

-

-

-

-

-

-

-

Dd:N1

-

Dd+1:N1

 

2-reg

(@128)

1--

-

-

Dd:N1

Dd+1:N1

 

3-reg

(unaligned, @64)

1

2

3

-

-

-

-

-

-

-

-

-

-

-

-

-

Dd:N1

Dd+2:N1

-

Dd+1:N1

-

 

4-reg

(unaligned, @64)

1

2

3

-

-

-

-

-

-

-

-

-

-

-

-

-

Dd:N1

Dd+2:N1

-

Dd+1:N1

Dd+3:N1

 

4-reg

(@128, @256)

1

2

-

-

-

-

-

-

-

-

Dd:N1

Dd+2:N1

Dd+1:N1

Dd+3:N1

 VLD2

2-reg

(unaligned, @64)

1

2

-

-

-

-

-

-

-

-

-

Dd:N2

-

Dd+1:N2

 

2-reg

(@128)

1--

-

-

Dd:N2

Dd+1:N2

 

4-reg

(unaligned, @64)

1

2

3

-

-

-

-

-

-

-

-

-

-

-

-

-

Dd:N2

Dd+1:N2

-

Dd+2:N2

Dd+3:N2

 

4-reg

(@128, @256)

1

2

-

-

-

-

-

-

-

-

Dd:N2

Dd+1:N2

Dd+2:N2

Dd+3:N2

 VLD3

3-reg

(unaligned, @64)

1

2

3

4

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

Dd:N2

Dd+2:N2

-

-

Dd+1:N2

-

 VLD4

4-reg

(unaligned, @64)

1

2

3

4

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

Dd:N2

Dd+2:N2

-

-

Dd+1:N2

Dd+3:N2

 

4-reg

(@128, @256)

1

2

3

-

-

-

-

-

-

-

-

-

-

-

-

-

Dd:N2

Dd+2:N2

-

Dd+1:N2

Dd+3:N2

 VST1

1-reg

(unaligned)

1

2

Dd:N1

-

-

-

-

-

-

-

-

-

-

-

 

1-reg

(@64)

1

Dd:N1

-----
 

2-reg

(unaligned, @64)

1

2

Dd:N1

-

Dd+1:N1

-

-

-

-

-

- -- -
 

2-reg

(@128)

1

Dd:N1

Dd+1:N1----
 

3-reg

(unaligned)

1

2

3

Dd:N1

Dd+2:N1

-

Dd+1:N1

-

-

-

-

-

-

-

-

-

-

-

-

-

-

 

3-reg

(@64)

1

2

Dd:N1

Dd+2:N1

Dd+1:N1

-

-

-

-

-

-

-

-

-

  

4-reg

(unaligned, @64)

1

2

3

Dd:N1

Dd+2:N1

-

Dd+1:N1

Dd+3:N1

-

-

-

-

-

-

-

-

-

-

-

-

-

  

4-reg

(@128, @256)

1

2

Dd:N1

Dd+2:N1

Dd+1:N1

Dd+3:N1

-

-

-

-

-

-

-

-

 VST2

2-reg

(unaligned, @64)

1

2

Dd:N1

-

Dd+1:N1

-

-

-

-

-

-

-

-

-

 

2-reg

(@128)

1

Dd:N1

Dd+1:N1----
 

4-reg

(unaligned, @64)

1

2

3

4

Dd:N1

Dd+2:N1

-

-

Dd+1:N1

Dd+3:N1

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

 

4-reg

(@128, @256)

1

2

3

Dd:N1

Dd+2:N1

-

Dd+1:N1

Dd+3:N1

-

-

-

-

-

-

-

-

-

-

-

-

-

 VST3

3-reg

(unaligned)

1

2

3

4

Dd:N1

Dd+2:N1

-

-

Dd+1:N1

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

 

3-reg

(@64)

1

2

3

Dd:N1

Dd+2:N1

-

Dd+1:N1

-

-

-

-

-

-

-

-

-

-

-

-

-

-

 VST4

4-reg

(unaligned, @64)

1

2

3

4

Dd:N1

Dd+2:N1

-

-

Dd+1:N1

Dd+3:N1

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

 

4-reg

(@128, @256)

1

2

3

Dd:N1

Dd+2:N1

-

Dd+1:N1

Dd+3:N1

-

-

-

-

-

-

-

-

-

-

-

-

-

VLD and VST single 1-element or 2, 3, 4-element structure to one lane[3]:
 VLD1

1-reg

(.8 unaligned,

.16@16, .32@32)

1

2

Dd:N1

-

-

-

-

-

-

-

-

Dd:N2

-

-

 

1-reg

(.16 unaligned,

.32 unaligned)

1

2

3

Dd:N1

-

-

-

-

-

-

-

-

-

-

-

-

-

Dd:N2

-

-

-

 VLD2

2-reg

(unaligned)

1

2

3

Dd:N1

-

-

Dd+1:N1

-

-

-

-

-

-

-

-

-

-

Dd:N2

-

-

Dd+1:N2

 

2-reg

(.8@16, .16@32, .32@64)

1

2

Dd:N1

-

Dd+1:N1

-

-

-

-

-

-

Dd:N2

-

Dd+1:N2

 VLD3

3-reg

(unaligned)

1

2

3

4

5

Dd:N1

Dd+2:N1

-

-

-

Dd+1:N1

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

Dd:N2

Dd+2:N2

-

-

-

Dd+1:N2

-

 VLD4

4-reg

(unaligned, .32@64)

1

2

3

4

5

Dd:N1

Dd+2:N1

-

-

-

Dd+1:N1

Dd+3:N1

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

Dd:N2

Dd+2:N2

-

-

-

Dd+1:N2

Dd+3:N1

 

4-reg

(.8@32, .16@64, .32@128)

1

2

3

4

Dd:N1

Dd+2:N1

-

-

Dd+1:N1

Dd+3:N1

-

-

-

-

-

-

-

-

-

-

-

-

Dd:N2

Dd+2:N2

-

-

Dd+1:N2

Dd+3:N2

 VST1

1-reg

(.16 unaligned, .32 unaligned)

1

2

Dd:N1

-

-

-

-

-

-

-

-

-

-

-

 

1-reg

(.8 unaligned, .16@16, .32@32)

1Dd:N1-

-

-

-

-

 VST2

2-reg

(unaligned)

1

2

Dd:N1

-

Dd+1:N1

-

-

-

-

-

-

-

-

-

 

2-reg

(.8@16, .16@32, .32@64)

1Dd:N1Dd+1:N1

-

-

-

-

 VST3

3-reg

(unaligned)

1

2

3

Dd:N1

Dd+2:N1

-

Dd+1:N1

-

-

-

-

-

-

-

-

-

-

-

-

-

-

 VST4

4-reg

(unaligned, .32@64)

1

2

3

Dd:N1

Dd+2:N1

-

Dd+1:N1

Dd+3:N1

-

-

-

-

-

-

-

-

-

-

-

-

-

 

4-reg

(.8@32, .16@64, .32@128)

1

2

Dd:N1

Dd+2:N1

Dd+1:N1

Dd+3:N1

-

-

-

-

-

-

-

-

VLD single 1-element or 2, 3, 4-element structure to all lanesb:
 VLD1

1-reg

(.16 unaligned, .32 unaligned)

1

2

-

-

-

-

-

-

-

-

-

Dd:N2

-

-

 

1-reg

(.8 unaligned, .16@16, .32@32)

1----

Dd:N2

-
 

2-reg

(.16 unaligned, .32 unaligned)

1

2

-

-

-

-

-

-

-

-

-

Dd:N2

-

Dd+1:N2

 

2-reg

(.8 unaligned, .16@16, .32@32)

1----

Dd:N2

Dd+1:N2
 VLD2

2-reg

(unaligned)

1

2

-

-

-

-

-

-

-

-

-

Dd:N2

-

Dd+1:N2

 

2-reg

(.8@16, .16@32, .32@64)

-----

Dd:N2

Dd+1:N2
 VLD3

3-reg

(unaligned)

1

2

3

-

-

-

-

-

-

-

-

-

-

-

-

-

Dd:N2

Dd+2:N2

-

Dd+1:N2

-

 VLD4

4-reg

(unaligned, .32@64)

1

2

3

-

-

-

-

-

-

-

-

-

-

-

-

-

Dd:N2

Dd+2:N2

-

Dd+1:N2

Dd+3:N2

 

4-reg

(.8@32, .16@64, .32@128)

1

2

-

-

-

-

-

-

-

-

Dd:N2

Dd+2:N2

Dd+1:N2

Dd+3:N2

[1] This table lists the VLDR instruction scheduling for little-endian mode. For VLDR in big-endian mode, results are available in N2 and not N1.

[2] This table lists the VLD instruction scheduling for little-endian mode. For VLD1 multiple 1-element in big-endian mode, results are available in N2 and not N1. For VLD2, VLD3, VLD4 results are available in N2 regardless of the endianness configuration. This table lists only the single-spaced register transfer variants. For single-spaced register transfer variants, the source and destination registers are Dd, Dd+1, Dd+2, and Dd+3. For double-spaced register transfer variants, the source and destination registers are Dd, Dd+2, Dd+4, and Dd+6.

[3] This table lists only the single-spaced register transfer variants. For single-spaced register transfer variants, the source and destination registers are Dd, Dd+1, Dd+2, and Dd+3. For double-spaced register transfer variants, the source and destination registers are Dd, Dd+2, Dd+4, and Dd+6.


Copyright © 2006-2009 ARM Limited. All rights reserved.ARM DDI 0344I
Non-Confidential