C.1.1. Instruction execution overview

The instruction execution pipeline has four stages, Iss, Ex1, Ex2, and Wr.

Extensive forwarding to the end of the Iss, Ex1, and Ex2 stages enables many dependent instruction sequences to run without pipeline stalls. General forwarding occurs from the end of the Ex2 and Wr pipeline stages. In addition, the multiplier contains an internal multiply accumulate forwarding path. The address generation unit also contains an internal forwarding path.

Most instructions do not require a register until the Ex2 stage. All result latencies are given as the number of cycles until the register is available for a following instruction in the Ex2 stage. Most ALU operations require their source registers at the start of the Ex2 stage, and have a result latency of one. For example, the following sequence takes two cycles:

ADD R1,R3,R4                ;Result latency one
ADD R5,R2,R1                ;Register R1 required by ALU

The PC is the only register that result latency does not affect. An instruction that alters the PC never causes a pipeline stall because of interlocking with a subsequent instruction that reads the PC.

Most loads have a result latency of two or higher as they do not forward their results until the Wr stage. For example, the following sequence takes three cycles:

LDR R1, [R2]                  ;Result latency two
ADD R3, R3, R1                ;Register R1 required by ALU

If a subsequent instruction requires the register at the end of the Iss stage then an extra cycle must be added to the result latency of the instruction producing the required register. Instructions that require a register at the end of these stages are specified by describing that register as an Early Reg. The following sequence, requiring an Early Reg, takes four cycles:

LDR R1, [R2]                ;Result latency two
ADD R3, R3, R1 LSL#6        ;plus one because Register R1 is Early

The following sequence where R1 is a Late Reg takes two cycles:

LDR R1, [R2]                ;Result latency two minus one cycles
STR R1, [R3]                ;no penalty because R1 is a Late register

The following sequence where R1 is a Very Early Reg takes four cycles:

ADD R3, R1, R2              ;Result latency one plus two cycles
LDR R4, [R3]                ;plus two because register R3 is Very Early
Copyright © 2006-2011 ARM Limited. All rights reserved.ARM DDI 0363G
Non-ConfidentialID041111