Non-Confidential | PDF version | ARM DUI0472J | ||

| ||||

Home > Compiler Coding Practices > Loop unrolling in C code |

Loops are a common construct in most programs. Because a significant amount of execution time is often spent in loops, it is worthwhile paying attention to time-critical loops.

Small loops can be unrolled for higher performance, with the disadvantage of
increased code size. When a loop is unrolled, a loop counter needs to be updated
less often and fewer branches are executed. If the loop iterates only a few times,
it can be fully unrolled so that the loop overhead completely disappears. The
compiler unrolls loops automatically at `-O3`

`-Otime`

. Otherwise, any unrolling must be done in source code.

Manual unrolling of loops might hinder the automatic re-rolling of loops and other loop optimizations by the compiler.

The advantages and disadvantages of loop unrolling can be illustrated using the two sample routines shown in the following table. Both routines efficiently test a single bit by extracting the lowest bit and counting it, after which the bit is shifted out.

The first implementation uses a loop to count bits. The second routine is the first
implementation unrolled four times, with an optimization applied by combining the
four shifts of `n`

into one shift.

Unrolling frequently provides new opportunities for optimization.

**Table 5-3 C code for rolled and unrolled bit-counting loops**

Bit-counting loop | Unrolled bit-counting loop |
---|---|

int countbit1(unsigned int n) { int bits = 0; while (n != 0) { if (n & 1) bits++; n >>= 1; } return bits; } |
int countbit2(unsigned int n) { int bits = 0; while (n != 0) { if (n & 1) bits++; if (n & 2) bits++; if (n & 4) bits++; if (n & 8) bits++; n >>= 4; } return bits; } |

The following table shows the corresponding disassembly of the machine code produced
by the compiler for each of the sample implementations above, where the C code for
each implementation has been compiled using the option `-O2`

.

**Table 5-4 Disassembly for rolled and unrolled bit-counting loops**

Bit-counting loop | Unrolled bit-counting loop |
---|---|

countbit1 PROC MOV r1, #0 B |L1.20| |L1.8| TST r0, #1 ADDNE r1, r1, #1 LSR r0, r0, #1 |L1.20| CMP r0, #0 BNE |L1.8| MOV r0, r1 BX lr ENDP |
countbit2 PROC MOV r1, r0 MOV r0, #0 B |L1.48| |L1.12| TST r1, #1 ADDNE r0, r0, #1 TST r1, #2 ADDNE r0, r0, #1 TST r1, #4 ADDNE r0, r0, #1 TST r1, #8 ADDNE r0, r0, #1 LSR r1, r1, #4 |L1.48| CMP r1, #0 BNE |L1.12| BX lr ENDP |

On the ARM9 processor, checking a single bit takes six cycles in the disassembly of the bit-counting loop shown in the leftmost column. The code size is only nine instructions. The unrolled version of the bit-counting loop checks four bits at a time per loop iteration, taking on average only three cycles per bit. However, the cost is the larger code size of fifteen instructions.