Binary FloatingPoint Numbers
By Stephen Bucaro
A computer with nbit word size is capable of handling unsigned integers
in the range of 0 to 2^{n}  1 in a single word. For a 32bit computer
this would be numbers up to 4,294,967,295. Floatingpoint numbers allow you
to use the very large, and very small, numbers commonly found in scientific
calculations. In fact, sometimes they're called "scientific notation".
A floatingpoint number has two parts, the number part and the radix. For
example the mass of the sun is 1.989x10^{30} kg. The diameter
of a red blood cell is 3×10^{4} inches. The 1.989 and the 3 are
the number parts, the 10^{30} and the 10^{4} are
the radix part.
Note, some displays aren't capable of displaying superscripts, so they use
a capital E to indicate the following number is an exponent. For example
the mass of the sun can be expressed as 1.989E30 kg. The diameter of a
red blood cell can be expressed as 3×E4 inches. Or a limited display might
express it as 3 × 104, leaving out the superscript and the E.
A binary floatingpoint number consists of three parts, the sign bit, the
mantissa and the exponent. A sign bit of 1 indicates a
negative number. A sign bit of 0 indicates a positive number. In a 32bit system,
the exponent is 8 bits following the sign bit, and the number part is 23 bits.
In a 64bit system, the exponent is 11 bits and the mantissa is 53
bits. A 64bit floatingpoint number is called doubleprecision as
opposed to 32bit being referred to as singleprecision. The
mantissa is now called the significand because it's size controls
the accuracy of the number.
Normalizing a Binary FloatingPoint Number
Before a binary floatingpoint number can be correctly stored, it must be
normalized. Normalizing means moving the decimal point so that only one
digit appears to the left of the decimal point. This creates the mantissa.
The exponent then becomes the number of positions the decimal point was
moved. If the decimal point was moved left it creates a positive exponent.
If the decimal point was moved right, it creates a negative exponent.
For example, the floatingpoint binary number 1101.101 is normalized
by moving the decimal point 3 places to the left. The mantissa then becomes
1.101101 and the exponent becomes 3, creating the normalized floating
point binary number 1.101101x2^{3}.
Biasing a Binary FloatingPoint Number
We could store negative exponents as two's complement binary numbers,
however this would make it more difficult (programmatically) to make number
comparisons (< == >). For this reason a biasing constant is added
to the exponent to make sure it's always positive.
The value of the biasing constant depends upon the number of bits available
for the exponent. For a 32bit system, the bias is 127_{10}, which
is 01111111 binary. So for example;
If exponent is 5, biased exponent is 5 + 127 = 132_{10} = 10000100 binary.
If exponent is 5, biased exponent is 5 + 127 = 122_{10} = 01111010 binary.
Although exponent biasing makes number comparisons faster and easier,
there are always tradeoffs. The actual exponent is found by subtracting the
bias from the stored exponent. This means the largest exponent possible
for a 32bit system will be 127_{10} to +128_{10}.
If you can't fit your number within that range you need to use a doubleprecision
binary floatingpoint number. A doubleprecision number is biased by adding 1023
by adding 1023_{10} for an exponent in the range 1022_{10}
to +1023_{10}.
