| |||

Home > Coding Practices > Optimizing code > Optimizing loops |

Loops are a common construct in most programs. Because a significant amount of execution time is often spent in loops, it is worthwhile paying attention to time-critical loops.

The loop termination condition can cause significant overhead if written without caution. Where possible:

always write count-down-to-zero loops and use simple termination conditions

always use a counter of type unsigned int, and test for not equal to zero.

Table 4.1 shows
two sample implementations of a routine to calculate `n!`

that
together illustrate loop termination overhead. The first implementation
calculates `n!`

using an incrementing loop, while
the second routine calculates `n!`

using a decrementing
loop.

**Table 4.1. C code for incrementing and decrementing loops**

Incrementing loop | Decrementing loop |
---|---|

int fact1(int n) { int i, fact = 1; for (i = 1; i <= n; i++) fact *= i; return (fact); } | int fact2(int n) { unsigned int i, fact = 1; for (i = n; i != 0; i--) fact *= i; return (fact); } |

Table 4.2 shows
the corresponding disassembly of the machine code produced by the compiler
for each of the sample implementations of Table 4.1, where the C code for both implementations
has been compiled using the options `-O2`

`-Otime`

.

**Table 4.2. C Disassembly for incrementing and decrementing loops**

Incrementing loop | Decrementing loop |
---|---|

fact1 PROC MOV r2, r0 MOV r0, #1 CMP r2, #1 MOV r1, r0 BXLT lr |L1.20| MUL r0, r1, r0 ADD r1, r1, #1 CMP r1, r2 BLE |L1.20| BX lr ENDP | fact2 PROC MOVS r1, r0 MOV r0, #1 BXEQ lr |L1.12| MUL r0, r1, r0 SUBS r1, r1, #1 BNE |L1.12| BX lr ENDP |

Comparing the disassemblies of Table 4.2 shows that the `ADD`

/`CMP`

instruction
pair in the incrementing loop disassembly has been replaced with
a single `SUBS`

instruction in the decrementing
loop disassembly. This is because a compare with zero can be optimized away.

In addition to saving an instruction in the loop, the variable `n`

does
not need to be saved across the loop, so the use of a register is
also saved in the decrementing loop disassembly. This eases register
allocation.

The technique of initializing the loop counter to the number of iterations required, and then decrementing down to zero, also applies to while and do statements.

Small loops can be unrolled for higher performance, with the
disadvantage of increased code size. When a loop is unrolled, a
loop counter needs to be updated less often and fewer branches are
executed. If the loop iterates only a few times, it can be fully unrolled,
so that the loop overhead completely disappears. The ARM compiler
unrolls loops automatically at `-O3`

`-Otime`

.
Otherwise, any unrolling must be done in source code.

Manual unrolling of loops might hinder the automatic re-rolling of loops and other loop optimizations by the compiler.

The advantages and disadvantages of loop unrolling can be illustrated using the two sample routines shown in Table 4.3. Both routines efficiently test a single bit by extracting the lowest bit and counting it, after which the bit is shifted out.

The first implementation uses a loop to count bits. The second
routine is the first unrolled four times, with an optimization applied
by combining the four shifts of `n`

into one. Unrolling
frequently provides new opportunities for optimization.

**Table 4.3. C code for rolled and unrolled bit-counting loops**

Bit-counting loop | Unrolled bit-counting loop |
---|---|

int countbit1(unsigned int n) { int bits = 0; while (n != 0) { if (n & 1) bits++; n >>= 1; } return bits; } | int countbit2(unsigned int n) { int bits = 0; while (n != 0) { if (n & 1) bits++; if (n & 2) bits++; if (n & 4) bits++; if (n & 8) bits++; n >>= 4; } return bits; } |

Table 4.4 shows
the corresponding disassembly of the machine code produced by the compiler
for each of the sample implementations of Table 4.3, where the
C code for each implementation has been compiled using the option `-O2`

.

**Table 4.4. Disassembly for rolled and unrolled bit-counting loops**

Bit-counting loop | Unrolled bit-counting loop |
---|---|

countbit1 PROC MOV r1, #0 B |L1.20| |L1.8| TST r0, #1 ADDNE r1, r1, #1 LSR r0, r0, #1 |L1.20| CMP r0, #0 BNE |L1.8| MOV r0, r1 BX lr ENDP | countbit2 PROC MOV r1, r0 MOV r0, #0 B |L1.48| |L1.12| TST r1, #1 ADDNE r0, r0, #1 TST r1, #2 ADDNE r0, r0, #1 TST r1, #4 ADDNE r0, r0, #1 TST r1, #8 ADDNE r0, r0, #1 LSR r1, r1, #4 |L1.48| CMP r1, #0 BNE |L1.12| BX lr ENDP |

On the ARM7, checking a single bit takes six cycles in the disassembly of the bit-counting loop shown in the leftmost column. The code size is only nine instructions. The unrolled version of the bit-counting loop checks four bits at a time, taking on average only three cycles per bit. However, the cost is the larger code size of fifteen instructions.