Non-Confidential | PDF version | ARM 100066_0608_00_en | ||

| ||||

Home > Coding Considerations > Loop unrolling in C code |

Loops are a common construct in most programs. Because a significant amount of execution time is often spent in loops, it is worthwhile paying attention to time-critical loops.

Small loops can be unrolled for higher performance, with the
disadvantage of increased code size. When a loop is unrolled, the loop counter
requires updating less often and fewer branches are executed. If the loop iterates
only a few times, it can be fully unrolled so that the loop overhead completely
disappears. The compiler unrolls loops automatically at `-O3`

. Otherwise, any unrolling must be done in source code.

Manual unrolling of loops might hinder the automatic re-rolling of loops and other loop optimizations by the compiler.

The advantages and disadvantages of loop unrolling can be illustrated using the two sample routines shown in the following table. Both routines efficiently test a single bit by extracting the lowest bit and counting it, after which the bit is shifted out.

The first implementation uses a loop to count bits. The second routine is the first
implementation unrolled four times, with an optimization applied by combining the
four shifts of `n`

into one shift.

Unrolling frequently provides new opportunities for optimization.

**Table 7-3 C code for rolled and unrolled bit-counting loops**

Bit-counting loop | Unrolled bit-counting loop |
---|---|

int countbit1(unsigned int n) { int bits = 0; while (n != 0) { if (n & 1) bits++; n >>= 1; } return bits; } |
int countbit2(unsigned int n) { int bits = 0; while (n != 0) { if (n & 1) bits++; if (n & 2) bits++; if (n & 4) bits++; if (n & 8) bits++; n >>= 4; } return bits; } |

The following table shows the corresponding disassembly of the
machine code produced by the compiler for each of the sample implementations above,
where the C code for each implementation has been compiled using ```
armclang -Os -S --target=arm-arm-none-eabi
-march=armv8-a
```

.

**Table 7-4 Disassembly for rolled and unrolled bit-counting loops**

Bit-counting loop | Unrolled bit-counting loop |
---|---|

countbit1: mov r1, r0 mov r0, #0 cmp r1, #0 bxeq lr mov r2, #0 .LBB0_1: and r3, r1, #1 cmp r2, r1, lsr #1 add r0, r0, r3 lsr r3, r1, #1 mov r1, r3 bne .LBB0_1 bx lr |
countbit2: mov r1, r0 mov r0, #0 cmp r1, #0 bxeq lr mov r2, #0 .LBB1_1: and r3, r1, #1 cmp r2, r1, lsr #4 add r0, r0, r3 ubfx r3, r1, #1, #1 add r0, r0, r3 ubfx r3, r1, #2, #1 add r0, r0, r3 ubfx r3, r1, #3, #1 add r0, r0, r3 lsr r3, r1, #4 mov r1, r3 bne .LBB1_1 bx lr |

The unrolled version of the bit-counting loop is faster than the original version, but has a larger code size.