Non-Confidential | PDF version | DUI0801J | ||

| ||||

Home > Condition Codes > Example showing the benefits of conditional instructions in A32 and T32 code |

Using conditional instructions rather than conditional branches can save both code size and cycles.

This example shows the difference between using branches and using conditional instructions. It uses the Euclid algorithm for the Greatest Common Divisor (gcd) to show how conditional instructions improve code size and speed.

In C the gcd algorithm can be expressed as:

int gcd(int a, int b) { while (a != b) { if (a > b) a = a - b; else b = b - a; } return a; }

The following examples show implementations of the gcd algorithm with and without conditional instructions.

This example is an A32 code implementation of the gcd algorithm. It achieves conditional execution by using conditional branches, rather than individual conditional instructions:

gcd CMP r0, r1 BEQ end BLT less SUBS r0, r0, r1 ; could be SUB r0, r0, r1 for A32 B gcd less SUBS r1, r1, r0 ; could be SUB r1, r1, r0 for A32 B gcd end

The code is seven instructions long because of the number of branches. Every time a branch is taken, the processor must refill the pipeline and continue from the new location. The other instructions and non-executed branches use a single cycle each.

The following table shows the number of cycles this implementation uses on an Arm7™ processor when R0 equals 1 and R1 equals 2.

**Table
7-4 Conditional branches only**

R0: a | R1: b | Instruction | Cycles (Arm7) |
---|---|---|---|

1 | 2 | `CMP r0, r1` |
1 |

1 | 2 | `BEQ end` |
1 (not executed) |

1 | 2 | `BLT less` |
3 |

1 | 2 | ```
SUB r1, r1,
r0
``` |
1 |

1 | 2 | `B gcd` |
3 |

1 | 1 | `CMP r0, r1` |
1 |

1 | 1 | `BEQ end` |
3 |

Total = 13 |

This example is an A32 code implementation of the gcd algorithm using individual conditional instructions in A32 code. The gcd algorithm only takes four instructions:

gcd CMP r0, r1 SUBGT r0, r0, r1 SUBLE r1, r1, r0 BNE gcd

In addition to improving code size, in most cases this code executes faster than the version that uses only branches.

The following table shows the number of cycles this implementation uses on an Arm7 processor when R0 equals 1 and R1 equals 2.

**Table
7-5 All instructions conditional**

R0: a | R1: b | Instruction | Cycles (Arm7) |
---|---|---|---|

1 | 2 | `CMP r0, r1` |
1 |

1 | 2 | ```
SUBGT
r0,r0,r1
``` |
1 (not executed) |

1 | 1 | ```
SUBLT
r1,r1,r0
``` |
1 |

1 | 1 | `BNE gcd` |
3 |

1 | 1 | `CMP r0,r1` |
1 |

1 | 1 | ```
SUBGT
r0,r0,r1
``` |
1 (not executed) |

1 | 1 | ```
SUBLT
r1,r1,r0
``` |
1 (not executed) |

1 | 1 | `BNE gcd` |
1 (not executed) |

Total = 10 |

Comparing this with the example that uses only branches:

- Replacing branches with conditional execution of all instructions saves three cycles.
- Where R0 equals R1, both implementations execute in the same number of cycles. For all other cases, the implementation that uses conditional instructions executes in fewer cycles than the implementation that uses branches only.

You can use the `IT`

instruction to write
conditional instructions in T32 code. The T32 code implementation of the gcd algorithm using
conditional instructions is similar to the implementation in A32 code. The implementation in
T32 code is:

gcd CMP r0, r1 ITE GT SUBGT r0, r0, r1 SUBLE r1, r1, r0 BNE gcd

These instructions assemble equally well to A32 or T32 code. The assembler
checks the `IT`

instructions, but omits them on assembly to
A32 code.

It requires one more instruction in T32 code (the `IT`

instruction) than in A32 code, but the overall code size is 10 bytes in T32
code, compared with 16 bytes in A32 code.

In architectures before Arm^{®}v6T2, there is
no `IT`

instruction and therefore T32 instructions cannot be executed
conditionally except for the `B`

branch instruction. The gcd algorithm must
be written with conditional branches and is similar to the A32 code implementation using
branches, without conditional instructions.

The T32 code implementation of the gcd algorithm without conditional instructions requires seven instructions. The overall code size is 14 bytes. This figure is even less than the A32 implementation that uses conditional instructions, which uses 16 bytes.

In addition, on a system using 16-bit memory this T32 implementation runs faster than both A32 implementations because only one memory access is required for each 16-bit T32 instruction, whereas each 32-bit A32 instruction requires two fetches.