ARM Technical Support Knowledge Articles

How do I get the best performance when compiling floating point code for Cortex-M4F?

Applies to: Cortex-M4, DS-5, RealView Development Suite (RVDS)

Answer

The Cortex-M4F has much better single precision floating point performance than other ARMv6-M and ARMv7-M based processors, however the double precision performance might be slightly worse.

Cortex-M4F has single precision floating-point hardware.  The types of floating-point linkage are:

Further information about types of floating point linkage is available from the ARM Compiler toolchain's Using the ARM Compiler document

Example: Cortex-M3 vs Cortex-M4F

If the example below is compiled with --cpu=Cortex-M3 the default option --fpu=softvfp is implied and software floating point library code will be pulled in by the linker to handle the floating point operations. If the example is compiled for Cortex-M4F instead using the option --cpu=Cortex-M4.fp, the compiler is permitted to exploit the single precision hardware.

The example below uses the type double.  If the example is targeted at Cortex-M4F, double precision values are still passed between functions in floating point registers (as required by the hardware floating point procedure call standard). When actual computation is required, because there is no double precision floating point hardware, the values must be shuffled to integer registers so the operation can be performed in software in the same manner as when compiling for software floating point.  This is an overhead against compiling for software floating point, where no shuffling is required.

/* armcc -c --cpu=cortex-m4.fp double.c */

#include <math.h>

double f(double a, double b) { return a + b; }

int main(int argc, char *argv)
{
  return f(4.0, 5.0);
}

Generates the following code (disassembled with fromelf -c double.o):

    f
        0x00000000:    b510        ..      PUSH     {r4,lr}
        0x00000002:    ec532b11    S..+    VMOV     r2,r3,d1
        0x00000006:    ec510b10    Q...    VMOV     r0,r1,d0
        0x0000000a:    f7fffffe    ....    BL       __aeabi_dadd
        0x0000000e:    ec410b10    A...    VMOV     d0,r0,r1
        0x00000012:    bd10        ..      POP      {r4,pc}
    main
        0x00000014:    b510        ..      PUSH     {r4,lr}
        0x00000016:    ed9f1b06    ....    VLDR     d1,[pc,#24] ; [0x30] = 0
        0x0000001a:    ed9f0b07    ....    VLDR     d0,[pc,#28] ; [0x38] = 0
        0x0000001e:    f7fffffe    ....    BL       f ; 0x0 Section #1
        0x00000022:    e8bd4010    ...@    POP      {r4,lr}
        0x00000026:    ec510b10    Q...    VMOV     r0,r1,d0
        0x0000002a:    f7ffbffe    ....    B.W      __aeabi_d2iz
    $d
        0x0000002e:    0000        ..      DCW    0
        0x00000030:    00000000    ....    DCD    0
        0x00000034:    40140000    ...@    DCD    1075052544
        0x00000038:    00000000    ....    DCD    0
        0x0000003c:    40100000    ...@    DCD    1074790400

This code uses the hardware floating point call convention, so double precision representations of 4.0 and 5.0 are loaded into registers d0 and d1, then passed to f. There is no hardware double precision add instruction, so f has to perform the computation in software. It does this by unpacking d0 and d1 into r0-r3, calling __aeabi_dadd (which has the software floating point calling convention), then packing the result back into d0.

In contrast the code for Cortex-M3, or Cortex-M4 (without FP) is:

    f
        0x00000000:    f7ffbffe    ....    B.W      __aeabi_dadd
    main
        0x00000004:    b510        ..      PUSH     {r4,lr}
        0x00000006:    2200        ."      MOVS     r2,#0
        0x00000008:    4b04        .K      LDR      r3,[pc,#16] ; [0x1c] = 0x40140000
        0x0000000a:    4610        .F      MOV      r0,r2
        0x0000000c:    4904        .I      LDR      r1,[pc,#16] ; [0x20] = 0x40100000
        0x0000000e:    f7fffffe    ....    BL       f ; 0x0 Section #1
        0x00000012:    e8bd4010    ...@    POP      {r4,lr}
        0x00000016:    f7ffbffe    ....    B.W      __aeabi_d2iz
    $d
        0x0000001a:    0000        ..      DCW    0
        0x0000001c:    40140000    ...@    DCD    1075052544
        0x00000020:    40100000    ...@    DCD    1074790400

On the other hand, if single precision types (float types) are used instead, the computation is performed in hardware when compiling for Cortex-M4F:

    f
        0x00000000:    ee300a20    0. .    VADD.F32 s0,s0,s1
        0x00000004:    4770        pG      BX       lr
    main
        0x00000006:    b500        ..      PUSH     {lr}
        0x00000008:    eef10a04    ....    VMOV.F32 s1,#5.00000000
        0x0000000c:    eeb10a00    ....    VMOV.F32 s0,#4.00000000
        0x00000010:    f7fffffe    ....    BL       f ; 0x0 Section #1
        0x00000014:    eebd0ac0    ....    VCVT.S32.F32 s0,s0
        0x00000018:    ee100a10    ....    VMOV     r0,s0
        0x0000001c:    bd00        ..      POP      {pc}

This is more efficient than the same code compiled for Cortex-M3, which generates a call to a software floating point library routine to handle the add operation:

    f
        0x00000000:    f7ffbffe    ....    B.W      __aeabi_fadd
    main
        0x00000004:    2009        .       MOVS     r0,#9
        0x00000006:    4770        pG      BX       lr

Summary

When targeting a processor with single precision hardware:

Attachments: example.zip

Article last edited on: 2012-01-24 14:54:35

Rate this article

[Bad]
|
|
[Good]
Disagree? Move your mouse over the bar and click

Did you find this article helpful? Yes No

How can we improve this article?

Link to this article
Copyright © 2011 ARM Limited. All rights reserved. External (Open), Non-Confidential