ARM Technical Support Knowledge Articles

In what situations might I need to insert memory barrier instructions?

Applies to: ARM Architecture and Instruction Sets, ARMv6 Architecture, ARMv7 Architecture, DS-5, RealView Development Suite (RVDS)

Answer

This FAQ aims to help users understand when, why and how to use memory barrier instructions.

Classic ARM processors, such as the ARM7TDMI, execute instructions and complete data accesses in program order.  The latest ARM processors can optimize the order of instruction execution and data accesses.  For example, an ARM architecture v6 or v7 processor could optimize the following sequence of instructions:

      LDR r0, [r1]   ; Load from Normal/Cacheable memory leads to a cache miss
      STR r2, [r3]   ; Store to Normal/Non-cacheable memory

The first load from memory misses in the cache it will cause a cache linefill.  This typically takes many cycles to complete. Classic (cached) ARM processors, for example, the ARM926EJ-S, would wait for the load to complete before executing the store instruction.  ARM architecture v6/v7 based processors can recognize that the next instruction does not depend on the result of the load (in register r0) and can execute the store instruction before the load instruction completes.

In some circumstances, processor optimizations such as speculative reads or out-of-order execution (as in the example above), are undesirable and can lead to unintended program behaviour.  In such situations it is necessary to insert barrier instructions into code where there is a requirement for stricter, 'Classic ARM processor-like' behaviour.  There are three types of barrier instructions.  For simplicity, note that the descriptions below are for a uni-processor environment:

Note that the CP15 equivalent barrier instructions available in ARMv6 are deprecated in ARMv7. Therefore, if possible, it is recommended that any code that uses these instructions is migrated to use the new barrier instructions described above instead.

Mutexes

It is architecturally defined that software must perform a Data Memory Barrier (DMB) operation:

The following examples show an implementation of a simple blocking mutex:

      LOCKED   EQU 1
      UNLOCKED EQU 0

lock_mutex
      ; Is mutex locked?
      LDREX r1, [r0]         ; Check if locked
      CMP r1, #LOCKED        ; Compare with "locked"

      WFEEQ                  ; Mutex is locked, go into standby

      BEQ lock_mutex         ; On waking re-check the mutex                          

      ; Attempt to lock mutex
      MOV r1, #LOCKED
      STREX r2, r1, [r0]     ; Attempt to lock mutex
      CMP r2, #0x0           ; Check whether store completed
      BNE lock_mutex         ; If store failed, try again

      DMB                    ; Required before accessing protected resource

      BX lr

unlock_mutex
      DMB                    ; Ensure accesses to protected resource have completed
      MOV r1, #UNLOCKED      ; Write "unlocked" into lock field
      STR r1, [r0]

      DSB                    ; Ensure update of the mutex occurs before other CPUs wake

      SEV                    ; Send event to other CPUs, wakes any CPU waiting on using WFE

      BX lr

The DSB ensures that the update to the synchronization variable is visible to all processors before SEV is executed.

Memory Remapping

Consider a situation where your reset handler/boot code lives in Flash memory (ROM), which is aliased to address 0x0 to ensure that your program boots correctly from the vector table, which normally resides at the bottom of memory (see left-hand-side memory map).

After you have initialized your system, you may wish to turn off the Flash memory alias so that you can use the bottom portion of memory for RAM (see right-hand-side memory map).  The following code (running from the permanent Flash memory region) disables the Flash alias, before calling a memory block copying routine (e.g., memcpy) to copy some data from to the bottom portion of memory (RAM).

     MOV r0, #0
     MOV r1, #REMAP_REG               
     STR r0, [r1]                     ; Disable Flash alias
     DMB                              ; Ensure completion with Data Memory Barrier      

     BL block_copy_routine()          ; Block copy code into RAM
     DSB                              ; Ensure block copy is completed with Data Synchronization Barrier
     ISB                              ; Ensure pipeline flush with Instruction Synchronization Barrier
    
BL copied_routine()              ; Execute copied routine (now in RAM)

Without the DMB between the store (STR) and the branch with link (BL) instructions, there is no guarantee that the store out to memory will complete before the block copying routine starts writing to the bottom portion of memory, because the block copying routine can execute while the data is draining through the write buffer.  The DSB causes all instructions before the DSB to complete.  The ISB prevents instructions being fetched from RAM before the block copying routine completes.

Self-modifying code

Self-modifying code sequences must be preceded by an ISB, because the prefetch unit pipeline and the core pipeline may contain out-of-date instructions.

The example below shows a block of code being copied from ROM to RAM, and then branched to and executed. 

Overlay_manager
      ; ...
      BL block_copy                    ; Copy new routine from ROM to RAM

      B relocated_code                 ; Branch to new routine

Due to speculative prefetching the processor might attempt to fetch instructions from the relocated region before the block copy has completed.  To ensure that this optimization does not occur, you should insert an ISB before any newly relocated code begins executing, to ensure that the prefetch buffer is flushed before the processor continues fetching instructions:

Overlay_manager
      ; ...
      BL block_copy                    ; Copy new routine from ROM to RAM
     
DSB                              ; Ensure block copy has completed
      ISB                              ; Ensure processor fetches new instructions
      B relocated_code                 ; Branch to new routine

If the memory you are performing the block copying routine on is marked as 'cacheable' the instruction cache will need to be invalidated so that the processor does not execute any other 'cached' code.  For "write-back" regions the data cache must be cleaned before the instruction cache invalidate.

Overlay_manager
      ; ...
      BL block_copy                    ; Copy new routine from ROM to RAM
      data_cache_clean                 ; Clean the cache so that the new routine is written out to memory
      DSB                              ; Ensure data cache clean has completed
      icache_and_pb_invalidate         ; Invalidate the instruction cache and branch predictor so that the old routine is no longer cached
     
DSB                              ; Ensure block copy has completed
      ISB                              ; Ensure processor fetches new instructions

      B relocated_code                 ; Branch to new routine

Similar situations where you may require a barrier are:

Further Reading:

There are many more examples of where barriers are needed. For more information on the use of barriers see:

Attachments

memory_barriers_1.PNG

Rate this article

[Bad]
|
|
[Good]
Disagree? Move your mouse over the bar and click

Did you find this article helpful? Yes No

How can we improve this article?

Link to this article
Copyright © 2011 ARM Limited. All rights reserved. External (Open), Non-Confidential