| |||
Home > Floating-point Support > IEEE 754 arithmetic > Basic data types |
ARM floating-point values are stored in one of two data types, single precision and double precision. In this document these are called float and double. These are the corresponding C types.
A float value is 32 bits wide. The structure is shown in Figure 5.3.
The S field gives the sign of the number. It is 0 for positive, or 1 for negative.
The Exp field gives the exponent of the number, as a power
of two. It is biased by 0x7F
(127),
so that very small numbers have exponents near zero and very large
numbers have exponents near 0xFF
(255). So, for
example:
if Exp = 0x7D
(125),
the number is between 0.25 and 0.5 (not including 0.5)
if Exp = 0x7E
(126),
the number is between 0.5 and 1.0 (not including 1.0)
if Exp = 0x7F
(127),
the number is between 1.0 and 2.0 (not including 2.0)
if Exp = 0x80
(128),
the number is between 2.0 and 4.0 (not including 4.0)
if Exp = 0x81
(129),
the number is between 4.0 and 8.0 (not including 8.0).
The Frac field gives the fractional part of the number. It
usually has an implicit 1 bit on the front that is not stored to
save space. So if Exp is 0x7F
,
for example:
if Frac = 00000000000000000000000
(binary),
the number is 1.0
if Frac = 10000000000000000000000
(binary),
the number is 1.5
if Frac = 01000000000000000000000
(binary),
the number is 1.25
if Frac = 11000000000000000000000
(binary),
the number is 1.75.
So in general, the numeric value of a bit pattern in this format is given by the formula:
(–1)^{S }* 2^{Exp(–0x7F)} * (1 + Frac * 2^{–23})
Numbers stored in the above form are called normalized numbers.
The maximum and minimum exponent values, 0 and 255, are special cases. Exponent 255 is used to represent infinity, and store Not a Number (NaN) values. Infinity can occur as a result of dividing by zero, or as a result of computing a value that is too large to store in this format. NaN values are used for special purposes. Infinity is stored by setting Exp to 255 and Frac to all zeros. If Exp is 255 and Frac is nonzero, the bit pattern represents a NaN.
Exponent 0 is used to represent very small numbers in a special way. If Exp is zero, then the Frac field has no implicit 1 on the front. This means that the format can store 0.0, by setting both Exp and Frac to all 0 bits. It also means that numbers that are too small to store using Exp >= 1 are stored with less precision than the ordinary 23 bits. These are called denormals.
A double value is 64 bits wide. Figure 5.4 shows its structure.
As before, S is the sign, Exp the exponent, and Frac the fraction. Most of the discussion of float values remains true, except that:
The Exp field is biased by 0x3FF
(1023)
instead of 0x7F
, so numbers between 1.0 and 2.0
have an Exp field of 0x3FF
.
The Exp value used to represent infinity and NaNs
is 0x7FF
(2047) instead of 0xFF
.
Some sample float and double bit patterns, together with their mathematical values, are given in Table 5.12 and Table 5.13.
Table 5.12. Sample single-precision floating-point values
Float value | S | Exp | Frac | Mathematical value | Notes |
---|---|---|---|---|---|
0x3F800000 | 0 | 0x7F | 000...000 | 1.0 | - |
0xBF800000 | 1 | 0x7F | 000...000 | -1.0 | - |
0x3F800001 | 0 | 0x7F | 000...001 | 1.000 000 119 | ^{[1]} |
0x3F400000 | 0 | 0x7E | 100...000 | 0.75 | - |
0x00800000 | 0 | 0x01 | 000...000 | 1.18*10^{-38} | ^{[2]} |
0x00000001 | 0 | 0x00 | 000...001 | 1.40*10^{-45} | ^{[3]} |
0x7F7FFFFF | 0 | 0xFE | 111...111 | 3.40*10^{38} | ^{[4]} |
0x7F800000 | 0 | 0xFF | 000...000 | Plus infinity | - |
0xFF800000 | 1 | 0xFF | 000...000 | Minus infinity | - |
0x00000000 | 0 | 0x00 | 000...000 | 0.0 | ^{[5]} |
0x7F800001 | 0 | 0xFF | 000...001 | Signalling NaN | ^{[6]} |
0x7FC00000 | 0 | 0xFF | 100...000 | Quiet NaN | ^{f} |
^{[1] }The smallest representable number that can be seen to be greater than 1.0. The amount that it differs from 1.0 is known as the machine epsilon. This is 0.000 000 119 in float, and 0.000 000 000 000 000 222 in double. The machine epsilon gives a rough idea of the number of decimal places the format can keep track of. float can do six or seven places. double can do fifteen or sixteen. ^{[2] }The smallest value that can be represented as a normalized number in each format. Numbers smaller than this can be stored as denormals, but are not held with as much precision. ^{[3] }The smallest positive number that can be distinguished from zero. This is the absolute lower limit of the format. ^{[4] }The largest finite number that can be stored. Attempting to increase this number by addition or multiplication causes overflow and generates infinity (in general). ^{[5] }Zero. Strictly speaking, they
show plus zero. Zero with a sign bit of 1, minus zero, is treated
differently by some operations, although the comparison operations
(for example ^{[6] }There are two types of NaNs, signalling NaNs and quiet NaNs. Quiet NaNs have a 1 in the first bit of Frac, and signalling NaNs have a zero there. The difference is that signalling NaNs cause an exception (see Exceptions) when used, whereas quiet NaNs do not. |
Table 5.13. Sample double-precision floating-point values
Double value | S | Exp | Frac | Mathematical value | Notes |
---|---|---|---|---|---|
0x3FF00000 00000000 | 0 | 0x3FF | 000...000 | 1.0 | - |
0xBFF00000 00000000 | 1 | 0x3FF | 000...000 | -1.0 | - |
0x3FF00000 00000001 | 0 | 0x3FF | 000...001 | 1.000 000 000 000 000 222 | ^{[1]} |
0x3FE80000 00000000 | 0 | 0x3FE | 100...000 | 0.75 | - |
0x00100000 00000000 | 0 | 0x001 | 000...000 | 2.23*10^{-308} | ^{b} |
0x00000000 00000001 | 0 | 0x000 | 000...001 | 4.94*10^{-324} | ^{c} |
0x7FEFFFFF FFFFFFFF | 0 | 0x7FE | 111...111 | 1.80*10^{308} | ^{d} |
0x7FF00000 00000000 | 0 | 0x7FF | 000...000 | Plus infinity | - |
0xFFF00000 00000000 | 1 | 0x7FF | 000...000 | Minus infinity | - |
0x00000000 00000000 | 0 | 0x000 | 000...000 | 0.0 | ^{e} |
0x7FF00000 00000001 | 0 | 0x7FF | 000...001 | Signalling NaN | ^{f} |
0x7FF80000 00000000 | 0 | 0x7FF | 100...000 | Quiet NaN | ^{f} |
^{[1] }to f. For footnotes, see Table 5.12. |