2.4.4. Example of the use of conditional execution

This example uses two implementations of the Greatest Common Divisor (gcd) algorithm (Euclid). It demonstrates how you can use conditional execution to improve code density and execution speed. The detailed analysis of execution speed only applies to an ARM7™ processor. The code density calculations apply to all ARM processors.

In C the algorithm can be expressed as:

int gcd(int a, int b)
{
    while (a != b)
      {
        if (a > b)
            a = a - b;
        else
            b = b - a;
      }
    return a;
}

You can implement the gcd function with conditional execution of branches only, in the following way:

gcd     CMP      r0, r1
        BEQ      end
        BLT      less
        SUBS     r0, r0, r1  ; could be SUB r0, r0, r1 for ARM
        B        gcd
less
        SUBS     r1, r1, r0  ; could be SUB r1, r1, r0 for ARM
        B        gcd
end

The code is seven instructions long because of the number of branches. Every time a branch is taken, the processor must refill the pipeline and continue from the new location. The other instructions and non-executed branches use a single cycle each.

By using the conditional execution feature of the ARM instruction set, you can implement the gcd function in only four instructions:

gcd
        CMP      r0, r1
        SUBGT    r0, r0, r1
        SUBLE    r1, r1, r0
        BNE      gcd

In addition to improving code size, this code executes faster in most cases. Table 2.3 and Table 2.4 show the number of cycles used by each implementation for the case where r0 equals 1 and r1 equals 2. In this case, replacing branches with conditional execution of all instructions saves three cycles.

The conditional version of the code executes in the same number of cycles for any case where r0 equals r1. In all other cases, the conditional version of the code executes in fewer cycles.

Table 2.3. Conditional branches only

r0: ar1: bInstructionCycles (ARM7)
12CMP r0, r11
12BEQ end1 (not executed)
12BLT less3
12SUB r1, r1, r01
12B gcd3
11CMP r0, r11
11BEQ end3
   Total = 13

Table 2.4. All instructions conditional

r0: ar1: bInstructionCycles (ARM7)
12CMP r0, r11
12SUBGT r0,r0,r11 (not executed)
11SUBLT r1,r1,r01
11BNE gcd3
11CMP r0,r11
11SUBGT r0,r0,r11 (not executed)
11SUBLT r1,r1,r01 (not executed)
11BNE gcd1 (not executed)
   Total = 10

Pre-Thumb-2 Thumb version of gcd

Because B is the only pre-Thumb-2 Thumb instruction that can be executed conditionally, the gcd algorithm must be written with conditional branches.

Like the ARM conditional branch implementation, the pre-Thumb-2 Thumb code requires seven instructions. The overall code size is 14 bytes, compared to 16 bytes for the smaller ARM implementation.

In addition, on a system using 16-bit memory this version runs faster than the second ARM implementation because only one memory access is required for each 16-bit Thumb instruction, whereas each ARM 32-bit instruction requires two fetches.

Thumb-2 version of gcd

You can convert the ARM version of this code into Thumb-2 code, by making the SUB instructions conditional using an IT instruction:

gcd
        CMP     r0, r1
        ITE     GT 
        SUBGT   r0, r0, r1
        SUBLE   r1, r1, r0
        BNE     gcd

This assembles equally well to ARM or Thumb-2 code. The assembler checks the IT instructions, but omits them on assembly to ARM code. (You can omit the IT instructions. The assembler inserts them for you when assembling to Thumb-2 code.)

It requires one more instruction in Thumb-2 code than in ARM code, but the overall code size is 10 bytes in Thumb-2 code compared with 16 bytes in ARM code.

Execution speed

To optimize code for execution speed you must have detailed knowledge of the instruction timings, branch prediction logic, and cache behavior of your target system. See ARM Architecture Reference Manual and the technical reference manuals for individual processors for full information.

Copyright © 2002-2010 ARM. All rights reserved.ARM DUI 0204J
Non-ConfidentialID101213