|ARM Technical Support Knowledge Articles|
The ARM compiler (armcc) is highly optimizing by default to ensure a small code size. The main optimizations carried out by
1. Common Subexpression Elimination (CSE)
Identify common sub-expressions in the code, and use the result for each instance, rather than re-evaluating them each time. For example, code may use the expression '
a+1' in several places, re-evaluating it each time it is used. The compiler will identify this and only evaluate it once, using that value several times. These expressions can be very complex. This is one of the most effective optimizations in the compilers.
2. Loop Invariant Motion (Expression Lifting)
This is the 'lifting of expressions' from loops. The compiler can identify that a particular expression inside a loop actually does not change as the loop is running. The continuous re-evaluation of the expression would be costly, and so the compiler will evaluate it only once.
3. Live range splitting (for dynamic register allocation)
This is the identification of the 'live' state of variables within a program section. For example, a variable could be used in one situation as a counter for a loop, then later as a working variable within a calculation. If these two uses are completely unrelated they can be allocated to different registers. Additionally when a variable is dead (when its value will not be used later), the register to which it has been assigned will be reused.
4. Constant Folding
Replacement of constant expressions with the value that the compiler evaluated for them.
5. Tail Call Optimization and Tail Recursion
A tailcall is a call immediately before a return. Normally this function will be called, and when it returns to the caller, the caller returns again. Tailcall optimization avoids this by restoring the saved registers before jumping to the tailcall. The called function will now return directly to the caller's caller, saving a return sequence.
The compiler also supports tail call recursion, which is possible when the tailcall is made to the same function. In this case it is possible to skip the entry and exit sequence altogether, converting the call into a loop.
6. Cross Jump Elimination
This is the combining of two or more instances of identical code. For example, multiple returns from functions generate often identical code, and will be optimized to a single return sequence. This optimization mainly saves space, and is disabled when optimizing for time.
7. Table Driven Peepholing
During the compiler's processing of source code into ARM or Thumb code, there is a point at which commonly found code sequences can be replaced with known optimal versions. This is achieved by viewing the code through a window (of some number of instructions) called a 'peephole', and then replacing identified instruction sequences with a 'hand crafted' version. The table of 'peepholes' is constantly growing as 'optimal' sequences are identified and added by ARM engineers.
8. Structure Splitting
Structure splitting is the action of dividing structures into their components. Once accomplished, the components may then be assigned to registers, for faster access. This is a particular advantage when returning a structure from a function, where the whole structure can be returned in registers rather than on the stack.
9. Conditional Execution (or Branch Elimination)
The compiler uses conditional execution to avoid branches. Conditional execution saves both space and execution time, as many conditional branches can be removed. The Thumb instruction has limited support for conditional execution, so when building with
--thumb, the compiler is not always permitted to use this optimization. When compiling for a processor like the Cortex-M3 that supports the 32-bit If-Then Thumb instruction, an application can still benefit from this optimization, whereas when compiling for a processor like the Cortex-M0 or a traditional ARM7TDMI processor that only supports a limited number of 32-bit Thumb instructions or no 32-bit Thumb instructions at all, this optimization is not available.
10. Inlining (including auto-inlining of functions not specifically marked)
Function inlining offers a trade-off between code size and performance. By default, the compiler decides for itself whether to inline code or not. As a general rule, the compiler makes sensible decisions about inlining with a view to producing code of minimal size. This is because code size for embedded systems is of fundamental importance.
11. Auto-vectorization for NEON
Automatic vectorization involves the high-level analysis of loops in your code. This is the most efficient way to map the majority of typical code onto the functionality of the NEON unit. For most code, the gains that can be made with algorithm-dependent parallelism on a smaller scale are very small relative to the cost of automatic analysis of such opportunities. For this reason, the NEON unit is designed as a target for loop-based parallelism.
12. Loop restructuring
Small loops can be unrolled for higher performance, with the disadvantage of increased code size. When a loop is unrolled, a loop counter needs to be updated less often and fewer branches are executed. If the loop iterates only a few times, it can be fully unrolled, so that the loop overhead completely disappears. The ARM compiler unrolls loops automatically at
-O3 -Otime. Otherwise, any unrolling must be done in source code.
13. Instruction Scheduling
Instruction scheduling is enabled at optimization level
-O1 and higher. Instructions are re-ordered to suit the processor that the code is compiled for. This can help improve throughput by minimizing interlocks and also make use of processor that have features such as dual execution.
It is possible to restrict some of the optimizations performed by the compiler by using the
--retain compiler option, described in the ARM Compiler toolchain - Compiler Reference document, if a particular option is undesirable.
Article last edited on: 2011-10-28 14:50:32
Did you find this article helpful? Yes No
How can we improve this article?