### 5.4.3. Using conditional execution in ARM state

You can use conditional execution of ARM instructions to reduce the number of branch instructions in your code.

Branch instructions are expensive in both code density and processor cycles. Typically it takes three processor cycles to refill the processor pipeline each time a branch is taken. (The cost is less on ARM processors that have branch prediction hardware.)

Example 5.4. : Euclid's Greatest Common Divisor

The following example uses two implementations of Euclid’s Greatest Common Divisor algorithm to demonstrate how you can use conditional execution to improve code density and execution speed. In pseudo-code the algorithm can be expressed as:

function gcd (integer a, integer b) : result is integer
while (a <> b) do
if (a > b) then
a = a - b
else
b = b - a
endif
endwhile
result = a

You can implement the gcd function with conditional execution of branches only, in the following way:

```gcd
CMP 		r0, r1
BEQ 		end
BLT 		less
SUB 		r0, r0, r1
B 		gcd
less
SUB 		r1, r1, r0
B 		gcd
end
```

Because of the number of branches, the code is seven instructions long. Every time a branch is taken, the processor must refill the pipeline and continue from the new location. The other instructions and non-executed branches use a single cycle each.

By using the conditional execution feature of the ARM instruction set, you can implement the gcd function in only four instructions:

```gcd
CMP	 	r0, r1
SUBGT		r0, r0, r1
SUBLT		r1, r1, r0
BNE 		gcd
```

In addition to improving code size, this code executes faster in most cases. Table 5.2 and Table 5.3 show the number of cycles used by each implementation for the case where r0 equals 1 and r1 equals 2. In this case, replacing branches with conditional execution of all instructions saves three cycles.

The conditional version of the code executes in the same number of cycles for any case where r0 equals r1. In all other cases the conditional version of the code executes in fewer cycles.

Table 5.2. Conditional branches only

r0: ar1: bInstructionCycles
12`CMP r0, r1`1
12`BEQ end`1 (Not executed)
12`BLT less`3
12`SUB r1, r1, r0`1
12`B gcd`3
11`CMP r0, r1`1
11`BEQ end`3
Total = 13

Table 5.3. All instructions conditional

r0: ar1: bInstructionCycles
12`CMP r0, r1`1
12`SUBGT r0,r0,r1`1 (Not executed)
11`SUBLT r1,r1,r0`1
11`BNE gcd`3
11`CMP r0,r1`1
11`SUBGT r0,r0,r1`1 (Not executed)
11`SUBLT r1,r1,r0`1 (Not executed)
11`BNE gcd`1 (Not executed)
Total = 10

#### Converting to Thumb

Because `B` is the only Thumb instruction that can be executed conditionally, the Greatest Common Divisor algorithm in Example 5.4 must be written with conditional branches in Thumb code.

Like the ARM conditional branch implementation, the Thumb code requires seven instructions. However, because Thumb instructions are only 16-bits long, the overall code size is 14 bytes, compared to 16 bytes for the smaller ARM implementation.

In addition, on a system using 16-bit memory the Thumb version runs faster than the second ARM implementation because only one memory access is required for each Thumb instruction, whereas each ARM instruction requires two fetches.