4.6. Optimize loops

Loops are used for intensive computations in both application code and shaders. They take up a large amount of processing time. You can make application and shader code faster by optimizing loops.

Generally, the key to loop optimization is to make the loop do as little as possible per iteration to make the iterations faster. If the loop runs ten thousand times, reducing the loop by one instruction reduces the number of instructions executed by ten thousand.

Move repeated computations

If there are computations in a loop that can be done before the loop, move these computations to outside the loop.

Look for instructions that compute the same value over and over again. You can move this computation out of the loop.

Move unused computations

If there are computation in the loop that generate results that are not used within the loop, move these computations to outside the loop.

Avoid computations in the iteration if test

Every time a loop is iterated a conditional test determines if another iteration is required. Make this computation as simple as possible. Consider if the entire computation must be performed each time. If possible move any pre-computable parts outside the loop.

Simplify code

Avoid complex code constructs. It is easier for the compiler to optimize if your code is simpler.

Avoid cross-iteration dependencies

Try to keep the computations in iterations independent of other iterations. This makes a number of optimizations possible in the compiler.

Work on small blocks of data

Ensure the inner loops work on small blocks of data. Smaller blocks of data are more cacheable.

Minimize code size

Keeping loops and especially inner loops small makes them more cacheable. This can have a significant impact on performance. Reducing the size of loops, especially inner loops make the instructions more likely to be cached increasing performance.

Unroll loops

You can take the computations from many loop iterations and make them into a single big iteration. This saves if tests computations so it reduces the computational load.

Test how well unrolling the loop works at different sizes. Over a certain threshold the loop becomes too big for efficient caching and performance drops.

On some microprocessors loop unrolling can be done automatically so you do not have to do it manually in your code. However, automatic code unrolling is limited. If your CPU supports automatic code unrolling keeping the code small is the better option. Test to see what works best.

Avoid branches in loops

A conditional test in a loop generates at least one branch instruction. Branch instructions can slow a microprocessor down and are especially bad in loops. Avoid branches if possible and especially in inner loops.

Do not make function calls in inner loops

Function calls in loops generate at least two branches and can initiate program reads from a different part of memory. If possible, avoid making function calls in loops and especially in inner loops.

Consider if the data be part processed in the loop first, then the call made outside the loop.

Some functions calls can be avoided by copying the contents of the function into the loop. The compiler might do this automatically.

Make use of tools

Various CPU tools can help with optimizing applications. Use what is available.

Use vector instructions if possible

Vector instructions process multiple data items at once so you can use these to speed up a loop or reduce the number of iterations. For more information see Use vector instructions.


If you are processing a very small number of items it might be faster not to use a loop.

Copyright © 2011 ARM. All rights reserved.ARM DUI 0555A