5.5.1. Basic data types

ARM floating-point values are stored in one of two data types, single precision and double precision. In this document these are called float and double. These are the corresponding C types.

Single precision

A float value is 32 bits wide. The structure is shown in Figure 5.3.

Figure 5.3. IEEE 754 single-precision floating-point format

The S field gives the sign of the number. It is 0 for positive, or 1 for negative.

The Exp field gives the exponent of the number, as a power of two. It is biased by 0x7F (127), so that very small numbers have exponents near zero and very large numbers have exponents near 0xFF (255). So, for example:

  • if Exp = 0x7D (125), the number is between 0.25 and 0.5 (not including 0.5)

  • if Exp = 0x7E (126), the number is between 0.5 and 1.0 (not including 1.0)

  • if Exp = 0x7F (127), the number is between 1.0 and 2.0 (not including 2.0)

  • if Exp = 0x80 (128), the number is between 2.0 and 4.0 (not including 4.0)

  • if Exp = 0x81 (129), the number is between 4.0 and 8.0 (not including 8.0).

The Frac field gives the fractional part of the number. It usually has an implicit 1 bit on the front that is not stored to save space. So if Exp is 0x7F, for example:

  • if Frac = 00000000000000000000000 (binary), the number is 1.0

  • if Frac = 10000000000000000000000 (binary), the number is 1.5

  • if Frac = 01000000000000000000000 (binary), the number is 1.25

  • if Frac = 11000000000000000000000 (binary), the number is 1.75.

So in general, the numeric value of a bit pattern in this format is given by the formula:

(–1)S * 2Exp(–0x7F) * (1 + Frac * 2–23)

Numbers stored in the above form are called normalized numbers.

The maximum and minimum exponent values, 0 and 255, are special cases. Exponent 255 is used to represent infinity, and store Not a Number (NaN) values. Infinity can occur as a result of dividing by zero, or as a result of computing a value that is too large to store in this format. NaN values are used for special purposes. Infinity is stored by setting Exp to 255 and Frac to all zeros. If Exp is 255 and Frac is nonzero, the bit pattern represents a NaN.

Exponent 0 is used to represent very small numbers in a special way. If Exp is zero, then the Frac field has no implicit 1 on the front. This means that the format can store 0.0, by setting both Exp and Frac to all 0 bits. It also means that numbers that are too small to store using Exp >= 1 are stored with less precision than the ordinary 23 bits. These are called denormals.

Double precision

A double value is 64 bits wide. Figure 5.4 shows its structure.

Figure 5.4. IEEE 754 double-precision floating-point format

As before, S is the sign, Exp the exponent, and Frac the fraction. Most of the discussion of float values remains true, except that:

  • The Exp field is biased by 0x3FF (1023) instead of 0x7F, so numbers between 1.0 and 2.0 have an Exp field of 0x3FF.

  • The Exp value used to represent infinity and NaNs is 0x7FF (2047) instead of 0xFF.

Sample values

Some sample float and double bit patterns, together with their mathematical values, are given in Table 5.12 and Table 5.13.

Table 5.12. Sample single-precision floating-point values

Float valueSExpFracMathematical valueNotes
0x3F80000100x7F000...0011.000 000 119[1]
0x7F80000000xFF000...000Plus infinity-
0xFF80000010xFF000...000Minus infinity-
0x7F80000100xFF000...001Signalling NaN[6]
0x7FC0000000xFF100...000Quiet NaNf

[1] The smallest representable number that can be seen to be greater than 1.0. The amount that it differs from 1.0 is known as the machine epsilon. This is 0.000 000 119 in float, and 0.000 000 000 000 000 222 in double. The machine epsilon gives a rough idea of the number of decimal places the format can keep track of. float can do six or seven places. double can do fifteen or sixteen.

[2] The smallest value that can be represented as a normalized number in each format. Numbers smaller than this can be stored as denormals, but are not held with as much precision.

[3] The smallest positive number that can be distinguished from zero. This is the absolute lower limit of the format.

[4] The largest finite number that can be stored. Attempting to increase this number by addition or multiplication causes overflow and generates infinity (in general).

[5] Zero. Strictly speaking, they show plus zero. Zero with a sign bit of 1, minus zero, is treated differently by some operations, although the comparison operations (for example == and !=) report that the two types of zero are equal.

[6] There are two types of NaNs, signalling NaNs and quiet NaNs. Quiet NaNs have a 1 in the first bit of Frac, and signalling NaNs have a zero there. The difference is that signalling NaNs cause an exception (see Exceptions) when used, whereas quiet NaNs do not.

Table 5.13. Sample double-precision floating-point values

Double valueSExpFracMathematical valueNotes
0x3FF00000 0000000000x3FF000...0001.0-
0xBFF00000 0000000010x3FF000...000-1.0-
0x3FF00000 0000000100x3FF000...0011.000 000 000 000 000 222[1]
0x3FE80000 0000000000x3FE100...0000.75-
0x00100000 0000000000x001000...0002.23*10-308b
0x00000000 0000000100x000000...0014.94*10-324c
0x7FEFFFFF FFFFFFFF00x7FE111...1111.80*10308d
0x7FF00000 0000000000x7FF000...000Plus infinity-
0xFFF00000 0000000010x7FF000...000Minus infinity-
0x00000000 0000000000x000000...0000.0e
0x7FF00000 0000000100x7FF000...001Signalling NaNf
0x7FF80000 0000000000x7FF100...000Quiet NaNf

[1] to f. For footnotes, see Table 5.12.

Copyright © 1999-2001 ARM Limited. All rights reserved.ARM DUI 0067D