B.21. Floating-point single-precision data processing instructions

This section describes the cycle timing behavior for all single-precision VFP CDP instructions. This includes arithmetic instructions such as VMUL.F32, data and immediate moving instructions such as ”VMOV.F32 <Sd>, #<imm>”, VABS.F32, VNEG.F32, and ”VMOV <Sd>, <Sm>”, and comparison instructions and conversion instructions.

Table B.26 shows the floating-point single-precision data processing instructions cycle timing behavior.

Table B.26. Floating-point single-precision data processing instructions cycle timing behavior

Example instructionCyclesEarly RegResult latency
VMLA.F32 <Sd>, <Sn>, <Sm>[a]1[b]<Sn>, <Sm>5[c]
VADD.F32 <Sd>, <Sn>, <Sm>[d]1<Sn>, <Sm>2
VDIV.F32 <Sd>, <Sn>, <Sm>2<Sn>, <Sm>16
VSQRT.F32 <Sd>, <Sm>2<Sm>16
VMOV.F32 <Sd>, #<imm>1-1
VMOV.F32 <Sd>, <Sm>[e]1-1
VCMP.F32 <Sd>, <Sm>[f]1<Sd>, <Sm>-
VCMP.F32 <Sd>, #0.0[f]1<Sd>-
VCVT.F32.U32 <Sd>, <Sm>[g]1<Sm>2
VCVT.F32.U32 <Sd>, <Sd>, #<fbits>[h]1<Sd>2
VCVTR.U32.F32 <Sd>, <Sm>[i]1<Sm>2
VCVT.U32.F32 <Sd>, <Sd>, #<fbits>[j]1<Sd>2
VCVT.F64.F32 <Dd>, <Sn>3<Sm>5

[a] Also VMLS.F32, VNMLS.F32, and VNMLA.F32.

[b] VMLA.F32 completes out-of-order, and can take an extra cycle (two in total) if an add instruction (VADD) or certain dual-issued instruction pairs are in the iss-stage when the instruction completes.

[c] Except when the instruction dependent on the result <Sd> is another VMLA.F32 instruction, and the dependent operand is the accumulate operand, <Sd>. In this case, the result latency is reduced to 3 cycles.

[d] Also VSUB.F32, VMUL.F32, and VNMUL.F32.

[e] Also VABS.F32 and VNEG.F32.

[f] Also VCMPE.F32.

[g] Also VCVT.F32.S32.

[h] Also VCVT.F32.U16, VCVT.F32.S32, and VCVT.F32.S16.

[i] Also VCVT.U32.F32, VCVTR.S32.F32, and VCVT.S32.F32.

[j] Also VCVT.U16.F32, VCVT.S32.F32, and VCVT.S16.F32.

Copyright © 2010-2011 ARM. All rights reserved.ARM DDI 0460C