18.2. Processor instruction timings

Table 18.1 shows the Thumb-2 subset supported in the ARMv7-M architecture. It provides cycle information including annotations to explain how instruction stream interactions affect timing. System effects, such as running code from slower memory, are also considered.

Table 18.1. Instruction timings

Instruction typeSizeCycles countDescription
Data operations161 (+P[1] if PC is destination)ADC, ADD, AND, ASR, BIC, CMN, CMP, CPY, EOR, LSL, LSR, MOV, MUL, MVN, NEG, ORR, ROR, SBC, SUB, TST, REV, REVH, REVSH, SXTB, SXTH, UXTB, and UXTH. MUL is one cycle.
Branches161+PaB<cond>, B, BL, BX, and BLX. No BLX with immediate. If branch taken, pipeline reloads (two cycles are added).
Load-store Single162[2] (+Pa if PC is destination)LDR, LDRB, LDRH, LDRSB, LDRSH, STR, STRB, and STRH, and T variants.
Load-store Multiple161+Nb (+Pa if PC loaded)LDMIA, POP, PUSH, and STMIA.
Exception generating16-BKPT stops in debug if debug enabled, fault if debug disabled. SVC faults to SVCall handler (see ARMv7-M architecture specification for details).
Data operations with immediate321 (+Pa if PC is destination)ADC{S}. ADD{S}, CMN, RSB{S}, SBC{S}, SUB{S}, CMP, AND{S}, TST, BIC{S}, EOR{S}, TEQ, ORR{S}, MOV{S}, ORN{S}, and MVN{S}.
Data operations with large immediate321MOVW, MOVT, ADDW, and SUBW. MOVW and MOVT have a 16-bit immediate (so can replace literal loads from memory). ADDW and SUBW have a 12-bit immediate (so also can replace many from memory literal loads).
Bit-field operations321BFI, BFC, UBFX, and SBFX. These are bitwise operations that enable control of position and size in bits. These both support C/C++ bit fields (in structs) in addition to many compare and some AND/OR assignment expressions.
Data operations with 3 register321 (+Pa if PC is destination)ADC{S}. ADD{S}, CMN, RSB{S}, SBC{S}, SUB{S}, CMP, AND{S}, TST, BIC{S}, EOR{S}, TEQ, ORR{S}, MOV{S}, ORN{S}, and MVN{S}. No PKxxx instructions.
Shift operations321ASR{S}, LSL{S}, LSR{S}, ROR{S}, and RRX{S}.
Miscellaneous321REV, REVH, REVSH, RBIT, CLZ, SXTB, SXTH, UXTB, and UXTH. Extension instructions same as corresponding ARM v6 16-bit instructions.
Table Branch164+PaTable branches for switch/case use. These are LDR with shifts and then branch.
Multiply321 or 2MUL, MLA, and MLS. MUL is one cycle and MLA and MLS are two cycles.
Multiply with 64-bit result323-7[3]UMULL, SMULL, UMLAL, and SMLAL. Cycle count based on input sizes. That is, ABS(inputs) < 64K terminates early.
Load-store addressing32-Supports Format PC+/-imm12, Rbase+imm12, Rbase+/-imm8, and adjusted register including shifts. T variants used when in Privilege mode.
Load-store Single322b (+Pa if PC is destination)LDR, LDRB, LDRSB, LDRH, LDRSH, STR, STRB, and STRH, and T variants. PLD and PLI are both hints and so act as a NOP.
Load-store Multiple321+Nb (+Pa if PC is loaded)STM, LDM, LDRD, and STRD.
Load-store Special321+NbLDREX, STREX, LDREXB, LDREXH, STREXB, STREXH, CLREX. These fault if no local monitor (is IMP DEF). LDREXD and STREXD are not included in this profile.
Branches321+PaB, BL, and B<cond>. No BLX (1) because it always changes state. No BXJ.
System321-2MSR(2) and MRS(2) replace MSR/MRS but also do more. These access the other stacks and also the status registers. CPSIE/CPSID 32-bit forms are not supported.No RFE or SRS.
System161-2CPSIE and CPSID are quick versions of MSR(2) instructions and use the standard Thumb-2 encodings, but only permit use of i and f and not a.
Extended32321NOP and YIELD (hinted NOP). No MRS (1), MSR (1), or SUBS (PC return link).
Combined Branch161+PaCBZ.
Extended160-1[4]IT and NOP (includes YIELD).
Divide322-12[5]SDIV and UDIV. 32/32 divides both signed and unsigned with 32-bit quotient result (no remainder, it can be derived by subtraction). This earlies out when dividend and divisor are close in size.
Sleep321+W[6]WFI, WFE, and SEV are in the class of hinted NOP instructions that control sleep behavior.
Barriers161+B[7]ISB, DSB, and DMB are barrier instructions that ensure certain actions have taken place before the next instruction is executed.
Saturation321SSAT and USAT perform saturation on a register. They perform three tasks. They normalize the value using shift, test for overflow from a selected bit position (the Q value) and set the xPSR Q bit. Saturation refers to the largest unsigned value or the largest/smallest signed value for the size selected.

[1] Branches take one cycle for instruction and then pipeline reload for target instruction. Non-taken branches are 1 cycle total. Taken branches with an immediate are normally 1 cycle of pipeline reload (2 cycles total). Taken branches with register operand are normally 2 cycles of pipeline reload (3 cycles total). Pipeline reload is longer when branching to unaligned 32-bit instructions in addition to accesses to slower memory. A branch hint is emitted to the code bus that permits a slower system to pre-load. This can reduce the branch target penalty for slower memory, but never less than shown here.

[2] Generally, load-store instructions take two cycles for the first access and one cycle for each additional access. Stores with immediate offsets take one cycle.

[3] UMULL/SMULL/UMLAL/SMLAL use early termination depending on the size of source values. These are interruptible (abandoned/restarted), with worst case latency of one cycle. MLAL versions take four to seven cycles and MULL versions take three to five cycles. For MLAL, the signed version is one cycle longer than the unsigned.

[4] IT instructions can be folded.

[5] DIV timings depend on dividend and divisor. DIV is interruptible (abandoned/restarted), with worst case latency of one cycle. When dividend and divisor are similar in size, divide terminates quickly. Minimum time is for cases of divisor larger than dividend and divisor of zero. A divisor of zero returns zero (not a fault), although a debug trap is available to catch this case.

[6] Sleep is one cycle for the instruction plus as many sleep cycles as appropriate. WFE only uses one cycle when event has passed. WFI is normally more than one cycle unless an interrupt happens to pend exactly when entering WFI.

[7] ISB takes one cycle (acts as branch). DMB and DSB take one cycle unless data is pending in the write buffer or LSU. If an interrupt comes in during a barrier, it is abandoned/restarted.

Cycle count information:

In general, each instruction takes one cycle (one core clock) to start executing as shown in Table 18.1. Additional cycles can be taken because of fetch stalls.

Copyright © 2005, 2006 ARM Limited. All rights reserved.ARM DDI 0337E
Non-Confidential