ARM Technical Support Knowledge Articles

What is the fastest way to copy memory on a Cortex-A8?

Applies to: Cortex-A8, RealView Development Suite (RVDS)


Many applications frequently copy substantial amounts of data from one area of memory to another, typically using the memcpy() C library function. As this can be quite time consuming, it may be worth spending some time optimizing the functions that do this. There is no single ‘best method’ for implementing a copy routine, as the performance will depend upon many factors (see below). This article looks at seven possible schemes, and compares their performance in a particular system.

Target system and conditions

The various schemes were tested using a Beagle Board from Texas Instruments. This board is based on an OMAP 3530 SoC, which is based on an ARM Cortex-A8 processor. The implementations have been written with the assumption that the source address, destination address and number of bytes to transfer are all multiples of the level 1 cache line size (64 bytes). However, the tests all copied 16 Mbytes of data, so the overhead of checking alignment and ensuring the main loop could assume 64 byte alignment would be insignificant by comparison (this may not be the case for smaller memory copies). The execution time was measured using the performance counters integrated into the Cortex-A8 processor, which provide a highly accurate measure.

For all tests, the L1NEON bit was set, meaning that loads using the NEON instruction set can cause an L1 data cache linefill. Both the level 1 and level 2 caches are enabled, with both the code and data memory regions being used marked as cacheable. The MMU and branch prediction are also enabled.

Some of the routines make use of the preload instruction (PLD). This instruction causes the level 2 cache to start loading the data some time before the processor executes the code to access this data. This effectively starts the memory request early, so may mean the processor does not have to wait as long for it be available.


  1. Word by Word memory copy

    This is a very simple loop which reads one word from the source, writes it to the destination and decrements a counter. The performance of this function is taken as the reference for the other tests.

          LDR r3, [r1], #4
          STR r3, [r0], #4
          SUBS r2, r2, #4
          BGE WordCopy

  2. Load-Multiple memory copy

    The previous example is modified to use LDM and STM instructions, transferring 8 words per iteration. Due to the extra registers used, these must be stored to the stack and later restored.

          PUSH {r4-r10}
          LDMIA r1!, {r3 - r10}
          STMIA r0!, {r3 - r10}
          SUBS r2, r2, #32
          BGE LDMloop
          POP {r4-r10}

  3. NEON memory copy

    This implementation uses NEON load and store instructions.

          VLDM r1!, {d0-d7}
          VSTM r0!, {d0-d7}
          SUBS r2, r2, #0x40
          BGE NEONCopy

  4. Word by Word memory copy with preload

          PLD [r1, #0x100]
          MOV r12, #16

          LDR r3, [r1], #4
          STR r3, [r0], #4
          SUBS r12, r12, #1
          BNE WordCopyPLD1
          SUBS r2, r2, #0x40
          BNE WordCopyPLD

  5. Load-Multiple memory copy with preload

          PUSH {r4-r10}
          PLD [r1, #0x80]
          LDMIA r1!, {r3 - r10}
          STMIA r0!, {r3 - r10}
          LDMIA r1!, {r3 - r10}
          STMIA r0!, {r3 - r10}
          SUBS r2, r2, #0x40
          BGE LDMloopPLD
          POP {r4-r10}
  6. NEON memory copy with preload

          PLD [r1, #0xC0]
          VLDM r1!,{d0-d7}
          VSTM r0!,{d0-d7}
          SUBS r2,r2,#0x40
          BGE NEONCopyPLD

  7. Mixed ARM and NEON memory copy with preload

    This final implementation interleaves ARM and NEON multiple load and store instructions, as well as using the preload instruction.

          PUSH {r4-r11}
          MOV r3, r0

          PLD [r1, #192]
          PLD [r1, #256]
          VLD1.64 {d0-d3}, [r1@128]!
          VLD1.64 {d4-d7}, [r1@128]!
          VLD1.64 {d16-d19}, [r1@128]!
          LDM r1!, {r4-r11}
          SUBS r2, r2, #128
          VST1.64 {d0-d3}, [r3@128]!
          VST1.64 {d4-d7}, [r3@128]!
          VST1.64 {d16-d19}, [r3@128]!
          STM r3!, {r4-r11}
          BGT ARMNEON
          POP {r4-r11}


Test Cycles Time (seconds) Mbytes per second Relative
Word by Word memory copy 52401776 0.104804 152.6665814 100%
Load-Multiple memory copy 47235011 0.09447 169.3658968 111%
NEON memory copy 52389453 0.104779 152.7024915 100%
Word by Word memory copy with PLD 68774347 0.137549 116.3224421 76%
Load-Multiple memory copy with PLD 53277011 0.106554 150.158574 98%
NEON memory copy with PLD 35102279 0.070205 227.9054303 149%
Mixed ARM and NEON memory copy 46742131 0.093484 171.1518031 112%

Some of these results may be surprising.

The Load-multiple routine offers only 11% better performance, despite requiring far fewer instructions to execute and not as many branch instructions. The increase is limited because the processor will be achieving a 100% instruction cache hit ratio, so instructions can easily be fetched as quickly as they can be executed. Branch prediction will also work efficiently in this example, negating the effect of executing more branches. The merging write buffer also means that the memory system sees burst writes, in the same way as it would for single word writes.

The NEON memory copy routine has a few benefits, that are not shown in the performance of the copy itself. Firstly, the loop can execute without corrupting contents of many ARM (integer) registers. This would be of more benefit for a small memory copy, where the overhead of stacking / restoring these registers would be significant. The Cortex-A8 processor can also be configured (though it is not for these tests) to only allocate into the level-2 cache for NEON loads; this would prevent the memory copy routine replacing the existing data in the level-1 cache with data that will not be reused. However, the results show that the copy performance itself is the same as with ARM code.

A large gain is used by using the PLD instruction within the NEON code loop. This is because it allows the processor to issue the load for a future load to the memory system before it is required, meaning the memory controller can start accessing these locations early. The nature of SDRAM means that (with a suitable controller) having multiple requests to work on can hide the long latency of the first access in a burst. This routine is similar to that used by the ARM compiler for code generated for the Cortex-A8.

Factors affecting memory copy performance

Other methods to copy memory

The C function memcpy() is blocking; once it is called it does not return until the memory copy is complete. This can result in the processor pipeline being idle for many cycles while waiting for memory accesses. It is often more efficient to perform the memory copy in the background, allowing the processor pipeline to be use for other tasks. However, this is likely to be more complex to control, needing to either poll for completion or handle an interrupt at completion. There are a few possible ways to achieve this:

Rate this article

Disagree? Move your mouse over the bar and click

Did you find this article helpful? Yes No

How can we improve this article?

Link to this article
Copyright © 2011 ARM Limited. All rights reserved. External (Open), Non-Confidential