7.2 Loop unrolling in C code

Loops are a common construct in most programs. Because a significant amount of execution time is often spent in loops, it is worthwhile paying attention to time-critical loops.

Small loops can be unrolled for higher performance, with the disadvantage of increased code size. When a loop is unrolled, the loop counter requires updating less often and fewer branches are executed. If the loop iterates only a few times, it can be fully unrolled so that the loop overhead completely disappears. The compiler unrolls loops automatically at -O3. Otherwise, any unrolling must be done in source code.

Note:

Manual unrolling of loops might hinder the automatic re-rolling of loops and other loop optimizations by the compiler.

The advantages and disadvantages of loop unrolling can be illustrated using the two sample routines shown in the following table. Both routines efficiently test a single bit by extracting the lowest bit and counting it, after which the bit is shifted out.

The first implementation uses a loop to count bits. The second routine is the first implementation unrolled four times, with an optimization applied by combining the four shifts of n into one shift.

Unrolling frequently provides new opportunities for optimization.

Table 7-3 C code for rolled and unrolled bit-counting loops

Bit-counting loop Unrolled bit-counting loop
int countbit1(unsigned int n)
{
    int bits = 0;
    while (n != 0)
    {
        if (n & 1) bits++;
        n >>= 1;
    }
    return bits;
}
int countbit2(unsigned int n)
{
    int bits = 0;
    while (n != 0)
    {
        if (n & 1) bits++;
        if (n & 2) bits++;
        if (n & 4) bits++;
        if (n & 8) bits++;
        n >>= 4;
    }
    return bits;
}

The following table shows the corresponding disassembly of the machine code produced by the compiler for each of the sample implementations above, where the C code for each implementation has been compiled using armclang -Os -S --target=arm-arm-none-eabi -march=armv8-a.

Table 7-4 Disassembly for rolled and unrolled bit-counting loops

Bit-counting loop Unrolled bit-counting loop
countbit1:                              
        mov     r1, r0
        mov     r0, #0
        cmp     r1, #0
        bxeq    lr
        mov     r2, #0
.LBB0_1:                                
        and     r3, r1, #1
        cmp     r2, r1, lsr #1
        add     r0, r0, r3
        lsr     r3, r1, #1
        mov     r1, r3
        bne     .LBB0_1
        bx      lr
countbit2:                              
        mov     r1, r0
        mov     r0, #0
        cmp     r1, #0
        bxeq    lr
        mov     r2, #0
.LBB1_1:                                
        and     r3, r1, #1
        cmp     r2, r1, lsr #4
        add     r0, r0, r3
        ubfx    r3, r1, #1, #1
        add     r0, r0, r3
        ubfx    r3, r1, #2, #1
        add     r0, r0, r3
        ubfx    r3, r1, #3, #1
        add     r0, r0, r3
        lsr     r3, r1, #4
        mov     r1, r3
        bne     .LBB1_1
        bx      lr

The unrolled version of the bit-counting loop is faster than the original version, but has a larger code size.

Non-ConfidentialPDF file icon PDF versionARM 100066_0608_00_en
Copyright © 2014–2017 ARM Limited or its affiliates. All rights reserved.