2.1.1. Processor components

The following sections describe the main components and their functions:

Data Processing Unit

The Data Processing Unit (DPU) holds most of the program-visible state of the processor, such as general-purpose registers, status registers and control registers. It decodes and executes instructions, operating on data held in the registers in accordance with the ARM Architecture. Instructions are fed to the DPU from the Prefetch Unit (PFU). The DPU executes instructions that require data to be transferred to or from the memory system by interfacing to the Data Cache Unit (DCU), which manages all load and store operations. See Chapter 3 Programmers Model for more information.

System control coprocessor

The system control coprocessor, CP15, provides configuration and control of the memory system and its associated functionality.

See Chapter 4 System Control for more information.

Instruction side memory system

The instruction side memory system includes:

Instruction Cache Unit

The Instruction Cache Unit (ICU) contains the Instruction Cache controller and its associated linefill buffer. The Cortex-A7 MPCore ICache is two-way set associative and uses Virtually Indexed Physically Tagged (VIPT) cache-lines holding up to 8 ARM or Thumb 32-bit instructions or up to 16 Thumb 16-bit instructions.

Prefetch Unit

The Prefetch Unit (PFU) obtains instructions from the instruction cache or from external memory and predicts the outcome of branches in the instruction stream, then passes the instructions to the DPU for processing. In any given cycle, up to a maximum of four instructions can be fetched and two can be passed to the DPU.

Branch Target Instruction Cache

The PFU also contains a four-entry deep Branch Target Instruction Cache (BTIC). Each entry stores up to two instruction cache fetches and enables the branch shadow of predicted taken B and BL instructions to be eliminated. The BTIC implementation is architecturally transparent, so it does not have to be flushed on a context switch.

Branch Target Address Cache

The PFU also contains a eight-entry deep Branch Target Address Cache (BTAC) used to predict the target address of certain indirect branches. The BTAC implementation is architecturally transparent, so it does not have to be flushed on a context switch.

Branch prediction

The branch predictor is a global type that uses history registers and a 256-entry pattern history table.

Return stack

The PFU includes an 8-entry return stack to accelerate returns from procedure calls. For each procedure call, the return address is pushed onto a hardware stack. When a procedure return is recognized, the address held in the return stack is popped, and the PFU uses it as the predicted return address. The return stack is architecturally transparent, so it does not have to be flushed on a context switch.

Data side memory system

This section describes the following:

Data Cache Unit

The Data Cache Unit (DCU) consists of the following sub-blocks:

  • The Level 1 (L1) data cache controller, which generates the control signals for the associated embedded tag, data, and dirty memory (RAMs) and arbitrates between the different sources requesting access to the memory resources. The data cache is 4-way set associative and uses a Physically Indexed Physically Tagged (PIPT) scheme for lookup which enables unambiguous address management in the system.

  • The load/store pipeline that interfaces with the DPU and main TLB.

  • The system coprocessor controller that performs cache maintenance operations directly on the data cache and indirectly on the instruction cache through an interface with the ICU.

  • An interface to receive coherency requests from the Snoop Control Unit (SCU).

The DCU contains a combined local and global exclusive monitor. This monitor can be set to the exclusive state only by a LDREX instruction executing on the local processor, and can be cleared to the open access state by:

  • A STREX instruction on the local processor or a store to the same shared cache line on another processor.

  • The cache line being evicted for other reasons.

  • A CLREX instruction.

The Cortex-A7 MPCore processor uses the MOESI protocol, with ACE modified equivalents of MOESI states, to maintain data coherency between multiple processors. MOESI describes the state that a shareable line in a L1 data cache can be in:


Modified/UniqueDirty (UD). The line is only in this cache and is dirty.


Owned/SharedDirty (SD). The line is possibly in more than one cache and is dirty.


Exclusive/UniqueClean (UC). The line is only in this cache and is clean.


Shared/SharedClean (SC). The line is possibly in more than one cache and is clean.


Invalid/Invalid (I). The line is not in this cache.

The DCU stores the MOESI state of the cache line in the tag and dirty RAMs.

Read allocate mode

The L1 data cache only supports a Write-Back policy. It normally allocates a cache line on either a read miss or a write miss. However, there are some situations where allocating on writes is undesirable, such as executing the C standard library memset() function to clear a large block of memory to a known value. Writing large blocks of data like this can pollute the cache with unnecessary data. It can also waste power and performance if a linefill must be performed only to discard the linefill data because the entire line was subsequently written by the memset().

To prevent this, the Bus Interface Unit (BIU) includes logic to detect when a full cache line has been written by the processor before the linefill has completed. If this situation is detected on three consecutive linefills, it switches into read allocate mode. When in read allocate mode, loads behave as normal and can still cause linefills, and writes still lookup in the cache but, if they miss, they write out to L2 rather than starting a linefill.

The BIU continues in read allocate mode until it detects either a cacheable write burst to L2 that is not a full cache line, or there is a load to the same line as is currently being written to L2.

A secondary read allocate mode applies when the L2 cache is integrated. After 127 consecutive cache line sized writes to L2 are detected, L2 read allocate mode is entered. When in L2 read allocate mode, loads behave as normal and can still cause linefills, and writes still lookup in the cache but, if they miss, they write out to L3 rather than starting a linefill. L2 read allocate mode continues until there is a cacheable write burst that is not a full cache line, or there is a load to the same line as is currently being written to L3.


The number of consecutive cache line sized writes to enter a secondary read allocate mode was 7 prior to product revision r0p3.

Data cache invalidate on reset

The ARMv7 Virtual Memory System Architecture (VMSA) does not support a CP15 operation to invalidate the entire data cache. If this function is required in software, it must be constructed by iterating over the cache geometry and executing a series of individual CP15 invalidate by set/way instructions.

In normal usage the only time the entire data cache must be invalidated is on reset. The processor provides this functionality by default. If it is not required on reset the invalidate operation can be disabled by asserting and holding the appropriate external L1RSTDISABLE signal for a processor when the corresponding reset signal is deasserted.

In parallel to the data cache invalidate, the DCU also sends an invalidate-all request to the ICU and the TLB, unless L1RSTDISABLE is asserted.


To synchronize the L1 tag RAMs and the L1 duplicate tag RAMs, which cannot be done in software, ensure L1RSTDISABLE is held LOW during the initial reset sequence for each individual processor.

Store Buffer

The Store Buffer (STB) holds store operations when they have left the load/store pipeline and have been committed by the DPU. From the STB, a store can request access to the cache RAMs in the DCU, request the BIU to initiate linefills, or request the BIU to write the data out on the external write channel. External data writes are through the SCU.

The STB can merge:

  • Several store transactions into a single transaction if they are to the same 64-bit aligned address. The STB is also used to queue up CP15 maintenance operations before they are broadcast to other processors in the multiprocessor device.

  • Multiple writes into an AXI write burst.

Bus Interface Unit and SCU interface

The Bus Interface Unit (BIU) contains the SCU interface and buffers to decouple the interface from the cache and STB. The BIU interface and the SCU always operate at the processor frequency.

A write buffer is available to hold:

  • Data from cache evictions or non-cacheable write bursts before they are written out to the SCU.

  • The addresses of outstanding ACE write transactions to permit hazard checking against other outstanding requests in the system.

L1 memory system

The processor L1 memory system includes the following features:

  • Separate instruction and data caches.

  • Export of memory attributes for system caches.

The caches have the following features:

  • Support for instruction and data cache sizes between 8KB and 64KB.

  • Pseudo-random cache replacement policy.

  • Ability to disable each cache independently.

  • Streaming of sequential data from LDM and LDRD operations, and sequential instruction fetches.

  • Critical word first linefill on a cache miss.

  • All the cache RAM blocks and associated tag and valid RAM blocks if implemented using standard RAM compilers.

See Chapter 6 L1 Memory System for more information.

L2 memory system

The L2 memory system contains:

  • The SCU that connects between one to four processors to the external memory system through the ACE master interface. The SCU maintains data cache coherency between the processors and arbitrates L2 requests from the processors.

    The SCU includes support for data security using the implemented Security Extensions.


    The SCU does not support hardware management of coherency of the instruction caches.

  • An optional L2 cache that:

    • Has configurable cache RAM sizes of 128KB, 256KB, 512KB, or 1MB

    • Is 8-way set associative.

    • Supports 64-byte cache lines.

  • One ACE master interface. All transactions are routed through the interface.

See Chapter 7 L2 Memory System for more information.

Optional Generic Interrupt Controller

The optional integrated GIC manages interrupt sources and behavior, and can route interrupts to individual processors. It permits software to mask, enable and disable interrupts from individual sources, to prioritize, in hardware, individual sources and to generate software interrupts. It also provides support for the Security and Virtualization Extensions. The GIC accepts interrupts asserted at the system level and can signal them to each processor it is connected to. This can result in an IRQ or FIQ exception being taken.

See Chapter 8 Generic Interrupt Controller for more information.

Media Processing Engine

The optional Media Processing Engine (MPE) implements ARM NEON technology, a media and signal processing architecture that adds instructions targeted at audio, video, 3-D graphics, image, and speech processing. Advanced SIMD instructions are available in both ARM and Thumb states.

See the Cortex-A7 MPCore NEON Media Processing Engine Technical Reference Manual for more information.

Floating-Point Unit

The optional Floating-Point Unit (FPU) implements the ARMv7 VFPv4-D16 architecture and includes the VFP register file and status registers. It performs floating-point operations on the data held in the VFP register file.

See the Cortex-A7 MPCore Floating-Point Unit Technical Reference Manual for more information.


The Cortex-A7 MPCore processor has a CoreSight compliant Advanced Peripheral Bus version 3 (APBv3) debug interface. This permits system access to debug resources, for example, the setting of watchpoints and breakpoints. The processor provides extensive support for real-time debug and performance profiling.

See Chapter 10 Debug for more information.

Performance monitoring

The Cortex-A7 MPCore processor provides performance counters and event monitors that can be configured to gather statistics on the operation of the processor and the memory system. See Chapter 11 Performance Monitoring Unit for more information.

Copyright © 2011-2013 ARM. All rights reserved.ARM DDI 0464F