16.7.1. VFP instruction execution in the VFP coprocessor

The VFP coprocessor is a nonpipelined floating-point execution engine that can execute any VFPv3 data-processing instruction. Each instruction runs to completion before the next instruction can issue, and there is no forwarding of VFP results to other instructions. Two cycles of decode, stages M2 and M3, are required between consecutive VFP instructions. These decode cycles are included in the cycle timing of this section.

The number of cycles required to complete an instruction depends on both the instruction and the input data operands. Floating-point operands can be divided into three broad categories:

Most numbers are normal and have an internal format that consists of a sign, a fractional number between one and two, and an exponent. Subnormal numbers are too small to represent in the normal space. A subnormal number consists of a sign, a fractional number between zero and one, and a zero in the exponent field. Special numbers are zeros, NaNs, and infinities.

Table 16.25 shows the range of cycle times for VFPv3 data-processing instruction with normal numbers. Subnormal numbers usually take more time as the Subnormal penalty column in Table 16.25 shows. Special numbers are handled by separate logic, and usually take less time than what is indicated in this table.

Table 16.25. VFP Instruction cycle counts

InstructionSingle precision cyclesDouble precision cyclesSubnormal penalty
FADD9-109-10operand/result
FSUB9-109-10operand/result
FMUL10-1211-17operand/result
FNMUL10-1211-17operand/result
FMAC18-2119-26operand/intermediate/result
FNMAC18-2119-26operand/intermediate/result
FMSC18-2119-26operand/intermediate/result
FNMSC18-2119-26operand/intermediate/result
FDIV20-3729-65operand/result
FSQRT19-3329-60operand
FCONST44none
FABS44none
FCPY44none
FNEG44none
FCMP4 or 74 or 7none
FCMPE4 or 74 or 7none
FCMPZ4 or 74 or 7none
FCMPEZ4 or 74 or 7none
FCVTDS5-operand
FCVTSD-7intermediate
FSITO99none
FUITO99none
FTOSI88none
FTOUI88none
FTOSIZ88none
FTOUIZ88none
FSHTO99none
FUHTO99none
FSLTO99none
FULTO99none
FTOSH68none
FTOUH68none
FTOSL68none
FTOUL68none

The Instruction column of Table 16.25 indicates the specific VFPv3 data-processing instruction. The Single precision cycles column indicates the number of cycles required for normal single-precision inputs of the associated instruction. The Double precision cycles column indicates the number of cycles required for normal double-precision inputs of the associated instruction. For example, a double-precision FMUL instruction takes any where between 11 and 17 cycles, depending on the data. A single- or double-precision FCMP instruction takes either four or seven cycles, depending on the data.

The reason for the wide range of cycles required for normal data is because the VFP coprocessor can detect when a given problem does not require additional computation. For example, if the VFP coprocessor multiplies 3 times 3, the operation takes less time than when it multiplies π times π.

The Subnormal penalty column indicates whether additional cycles are required for subnormal operands, subnormal intermediate values, or subnormal final results. This penalty only applies when the VFP coprocessor has flush-to-zero mode disabled.

For operations that have the result penalty, six to seven additional cycles are required to format the final result.

For operations that have the operand penalty:

All 3-input operations FMAC, FNMAC, FMSC, or FNMSC are variations of multiply-add, that is, a multiplication followed by an addition. The multiplication produces an intermediate result that might itself be subnormal. This intermediate subnormal has a penalty that is the same as the output penalty (applied to the multiply) plus the input penalty (applied to the addition), which amounts to an additional 11-13 cycles.

A slightly simpler way to look at 3-input operation is to split them into equivalent multiply and add instructions. A 3-input operation takes the same amount of time as its component multiplication and addition, usually minus one cycle.

An FMAC operation with three normal operands might have a multiplication that takes 12 cycles and an addition that takes nine cycles. The corresponding multiply followed by add instruction takes:

12 + 9 - 1 = 20 cycles

For a multiplication of a normal number with a subnormal number that results in a product that is also subnormal, this operation has an operand and result penalty and takes a total of 21 to 25 cycles. We then add the subnormal product to another subnormal number, resulting in a normal sum. This addition has two operand penalties, and takes a total of 18 to 20 cycles. The total time the two operations take is between:

10 + 5 + 6 + 18 = 39 cycles and 12 + 6 + 7 + 20 = 45 cycles

The corresponding FMAC multiply followed by add instruction has two operand penalties of nine to 10 cycles, an intermediate penalty of 11 to 13 cycles, and the cost of the multiply-add of 18 to 21 cycles. The total time is between:

9 + 11 + 18 = 38 cycles and 10 + 13 + 21 = 44 cycles
Copyright © 2006-2010 ARM Limited. All rights reserved.ARM DDI 0344K
Non-ConfidentialID060510