11.5. Cache maintenance

It is sometimes necessary for software to clean or invalidate a cache. This might be required when the contents of external memory have been changed and it is necessary to remove stale data from the cache. It can also be required after MMU-related activity such as changing access permissions, cache policies, or virtual to Physical Address mappings, or when I and D-caches must be synchronized for dynamically generated code such as JIT-compilers and dynamic library loaders.

For each of these operations, you can select which of the entries the operation should apply to:

AArch64 cache maintenance operations are performed using instructions which have the following general form:

  <cache> <operation>{, <Xt>}

A number of operations are available.

Table 11.1. Data cache, instruction cache, and unified cache operations

CacheOperationDescription

AArch32

Equivalent

DCCISWClean and invalidate by Set/WayDCCISW
CIVACClean and Invalidate by Virtual Address to Point of CoherencyDCCIMVAC
CSWClean by Set/WayDCCSW
CVACClean by Virtual Address to Point of CoherencyDCCMVAC
CVAUClean by Virtual Address to Point of UnificationDCCMVAU
ISWInvalidate by Set/WayDCISW
IVACInvalidate by Virtual Address, to Point of CoherencyDCIMVAC
DCZVACache zero by Virtual Address-
ICIALLUISInvalidate all, to Point of Unification, Inner SharableICIALLUIS
IALLUInvalidate all, to Point of Unification, Inner ShareableICIALLU
IVAUInvalidate by Virtual Address to Point of UnificationICIMVAU

Those instructions that accept an address argument take a 64-bit register which holds the Virtual Address to be maintained. No alignment restrictions apply for this address. Instructions that take a Set/Way/Level argument, take a 64-bit register whose lower 32-bits follow the format described in the ARMv7 architecture. The AArch64 Data Cache invalidate instruction by address, DC IVAC, requires write permission or else a permission fault is generated.

All instruction cache maintenance instructions can execute in any order relative to other instruction cache maintenance instructions, data cache maintenance instructions, and loads and stores, unless a DSB is executed between the instructions.

Data cache operations, other than DC ZVA, that specify an address are only guaranteed to execute in program order relative to each other if they specify the same address. Those operations that specify an address execute in program order relative to all maintenance operations that do not specify an address.

Consider the following code sequence.

Example 11.1. Cache invalidate and clean to PoU

  IC IVAU, X0        // Instruction Cache Invalidate by address to Point of Unification
  DC CVAC, X0        // Data Cache Clean by address to Point of Coherency
  IC IVAU, X1        // Might be out of order relative to the previous operations if 
                     // x0 and x1 differ

The first two instructions execute in order, as they refer to the same address. However, the final instruction might be re-ordered relative to the previous operations, as it refers to a different address.

Example 11.2. Cache invalidate to PoU

  IC IVAU, X0        // I cache Invalidate by address to Point of Unification 
  IC IALLU           // I cache Invalidate All to Point of Unification 
                     // Operations execute in order 

This only applies to issuing the instruction. Completion is only guaranteed after a DSB instruction.

The ability to preload the data cache with zero values using the DC ZVA instruction is new in ARMv8-A. Processors can operate significantly faster than external memory systems and it can sometimes take a long time to load a cache line from memory.

Cache line zeroing behaves in a similar fashion to a prefetch, in that it is a way of hinting to the processor that certain addresses are likely to be used in the future. However, a zeroing operation can be much quicker as there is no need to wait for external memory accesses to complete. Instead of getting the actual data from memory read into the cache, you get cache lines filled with zeros. It enables hinting to the processor that the code completely overwrites the cache line contents, so there is no need for an initial read.

Consider the case where you need a large temporary storage buffer or are initializing a new structure. You could have code simply start using the memory, or you could write code that prefetched it before using it. Both would use a lot of cycles and memory bandwidth in reading the initial contents to the cache. By using a cache zero option, you could potentially save this wasted bandwidth and execute the code faster.

The point at which a cache maintenance instruction takes place can be defined depending on whether the instruction operates by VA or by Set/Way.

You can choose the scope, which can be either PoC or PoU, and for operations that can be broadcast, see Chapter 14 Multi-core processors, you can select the Shareability.

The following example code illustrates a generic mechanism for cleaning the entire data or unified cache to the PoC.

Example 11.3. Cleaning to Point of Coherency

       MRS X0, CLIDR_EL1
       AND W3, W0, #0x07000000  // Get 2 x Level of Coherence
       LSR W3, W3, #23
       CBZ W3, Finished
       MOV W10, #0              // W10 = 2 x cache level
       MOV W8, #1               // W8 = constant 0b1
Loop1: ADD W2, W10, W10, LSR #1 // Calculate 3 x cache level
       LSR W1, W0, W2           // extract 3-bit cache type for this level
       AND W1, W1, #0x7
       CMP W1, #2
       B.LT Skip                // No data or unified cache at this level
       MSR CSSELR_EL1, X10      // Select this cache level
       ISB                      // Synchronize change of CSSELR
       MRS X1, CCSIDR_EL1       // Read CCSIDR
       AND W2, W1, #7           // W2 = log2(linelen)-4
       ADD W2, W2, #4           // W2 = log2(linelen)
       UBFX W4, W1, #3, #10     // W4 = max way number, right aligned
       CLZ W5, W4               /* W5 = 32-log2(ways), bit position of way in DC                                    operand */
       LSL W9, W4, W5           /* W9 = max way number, aligned to position in DC
                                   operand */
       LSL W16, W8, W5          // W16 = amount to decrement way number per iteration
Loop2: UBFX W7, W1, #13, #15    // W7 = max set number, right aligned
       LSL W7, W7, W2           /* W7 = max set number, aligned to position in DC
                                   operand */
       LSL W17, W8, W2          // W17 = amount to decrement set number per iteration
Loop3: ORR W11, W10, W9         // W11 = combine way number and cache number...
       ORR W11, W11, W7         // ... and set number for DC operand
       DC CSW, X11              // Do data cache clean by set and way
       SUBS W7, W7, W17         // Decrement set number
       B.GE Loop3
       SUBS X9, X9, X16         // Decrement way number
       B.GE Loop2
Skip:  ADD W10, W10, #2         // Increment 2 x cache level
       CMP W3, W10
       DSB                      /* Ensure completion of previous cache maintenance
                                  operation */
       B.GT Loop1
Finished:

Some points to note:

If software requires coherency between instruction execution and memory, it must manage this coherency using the ISB and DSB memory barriers and cache maintenance instructions. The code sequence shown in Example 11.4 can be used for this purpose.

Example 11.4. Cleaning a line of self-modifying code

  /* Coherency example for data and instruction accesses within the same Inner
     Shareable domain. Enter this code with <Wt> containing a new 32-bit instruction,
     to be held in Cacheable space at a location pointed to by Xn. */
  STR Wt, [Xn]
  DC CVAU, Xn     // Clean data cache by VA to point of unification (PoU)
  DSB ISH         // Ensure visibility of the data cleaned from cache
  IC IVAU, Xn     // Invalidate instruction cache by VA to PoU
  DSB ISH         // Ensure completion of the invalidations
  ISB             // Synchronize the fetched instruction stream

This code sequence is only valid for an instruction sequence that fits into a single I or D-cache line.

The code cleans and invalidates data and instruction caches by Virtual Address for a region starting at the base address given in x0 and length given in x1.

Example 11.5.  Cleaning by Virtual Address

  //
  // X0 = base address
  // X1 = length (we assume the length is not 0)
  //

  // Calculate end of the region
  ADD x1, x1, x0                // Base Address + Length

  //
  // Clean the data cache by MVA
  //
  MRS X2, CTR_EL0               // Read Cache Type Register

  // Get the minimun data cache line
  //

  UBFX X4, X2, #16, #4          // Extract DminLine (log2 of the cache line)
  MOV X3, #4                    // Dminline iss the number of words (4 bytes)
  LSL X3, X3, X4                // X3 should contain the cache line
  SUB X4, X3, #1                // get the mask for the cache line

  BIC X4, X0, X4                // Aligned the base address of the region
clean data cache:
  DC CVAU, X4                   // Clean data cache line by VA to PoU
  ADD X4, X4, X3                // Next cache line
  CMP X4, X1                    // Is X4 (current cache line) smaller than the end 
                                // of the region
  B.LT clean_data_cache         // while (address < end_address)

  DSB ISH                       // Ensure visibility of the data cleaned from cache
  
  //
  //Clean the instruction cache by VA
  //
  // Get the minimum instruction cache line (X2 contains ctr_el0)
  AND X2, X2, #0xF             // Extract IminLine (log2 of the cache line)
  MOV X3, #4                   // IminLine is the number of words (4 bytes)
  LSL X3, X3, X2               // X3 should contain the cache line
  SUB x4, x3, #1               // Get the mask for the cache line

  BIC X4, X0, X4               // Aligned the base address of the region
clean_instruction_cache:
  IC IVAU, X4                  // Clean instruction cache line by VA to PoU
  ADD X4, X4, X3               // Next cache line
  CMP X4, X1                   // Is X4 (current cache line) smaller than the end 
                               // of the region
  B.LT clean_instruction_cache // while (address < end_address)

  DSB ISH                      // Ensure completion of the invalidations
  ISB                          // Synchronize the fetched instruction stream

Copyright © 2015 ARM. All rights reserved.ARM DEN0024A
Non-ConfidentialID050815