1.6. Branch target forwarding

The processor forwards certain branch types, by which the memory transaction of the branch is presented at least a cycle earlier than when the opcode reaches execute. Branch forwarding increases the performance of the core, because branches are a significant part of embedded controller applications. Branches affected are PC relative with immediate offset, or use LR as the target register. For conditional branches, by opcode definition or within IT block, that are forwarded, the address must be presented speculatively as the condition evaluation in an internal critical path.

Branch forwarding loses a fetch opportunity if speculated on a conditional opcode, but is mitigated by a three-entry fetch queue and a mix of 16/32-bit opcodes and single cycle ALU. The additional penalty is a cycle of pipeline stalling. The worst case is three 32-bit load/store single opcodes, the instructions word-unaligned, with no data waitstates. The BRCHSTAT interface provides information on forwarded branches to conditional execution, the direction if conditional, and a trailing registered evaluation of success of the preceding conditional opcode. For more information on BRCHSTAT see Branch status interface.

The performance of the core with ICODE registered with prefetch is effectively the same as the core without the branch forwarding interface, around 10% slower. Branch forwarding can be thought of as the internal address generation logic pre-registration to the address interface, increasing flexibility to the memory controller if you have the timing budget to make use of the information a cycle sooner. For example lower MHz power sensitive targets, in 0.13u down to 65nm. Otherwise, you have the flexibility of having access to this early address in your memory controller for lookups before registration to the system.

Branch speculation is more costly against a wait-stated memory because of mispredictions. To avoid this overhead, a rule in the controller that conditional branches are not speculated but instead registered gives subroutine calls and returns the benefits of branch forwarding without the mispredictions penalty. A refinement is to only predict backward conditional branches to accelerate loops. Alternatively, with ARM compilers favouring loops with unconditional branch backwards at the bottom and then conditional branch forward tests on the loop limit, the core fetch queue being ahead at the start of the loop yields good behavior.

The BRCHSTAT also includes other information about the next opcode to reach execute. Unlike the forwarded branches where BRCHSTAT is incident with the transaction, BRCHSTAT with respect to execute opcodes is a hint unrelated to any transaction and can be asserted for multiple cycles. The controller can use this information to suppress additional prefetching because it knows a branch is taken shortly. This helps to avoid any trailing waitstates of the controller prefetch from impacting the branch target when it is generated in execute.

The following scenarios show how you can use branch forwarding and the BRCHSTAT control to get the best performance from your memory system. The scenarios focus on the ideal Harvard setup, where instructions execute from ICODE, literals execute from DCODE (unified to ICODE), and stack/heap/application data executes from SYSTEM.

Copyright © 2005, 2006 ARM Limited. All rights reserved.ARM DDI 0337E
Non-Confidential