Lecture 05 Floating Point

Joseph Haugh

University of New Mexico

Learning Objectives

After the Floating point sessions students should be able to:
- Describe how are fractional binary numbers interpreted.
- Describe the IEEE floating point standard representation to approximate real numbers.
- Perform conversions between IEEE fp and real numbers in decimal.
- Define and perform the operations of rounding, addition, and multiplication of floating point.
- Define how are floating point numbers represented in C.

Fractional Binary Numbers

What is 1011.101₂?
Your first question should be what interpretation are you using?
Fractional binary notation?
IEEE floating point notation?
Something else?
Lets start with fraction binary notation.

Recall: Base Ten Decimal Notation

What does 1023.45₁₀ mean?
1 * 10³ + 2 * 10¹ + 3 * 10⁰ + 4 * 10⁻¹ + 5 * 10⁻²
We can easily then use this for base 2:

Visualized: Fractional Binary Numbers

Fractional Binary Representation

$b = \sum_{k = -j}^{i} b_k * 2^k$

Fractional Binary Numbers

What number does 1011.101₂ represent in base 10?
$11\frac{5}{8}$
Left Part: 1011₂ = 1 * 2³ + 1 * 2¹ + 1 * 2⁰ = 11
Right Part: $101_2 = 1 * 2^{-1} + 1 * 2^{-3} = \frac{5}{8}$

Practice: Fractional Binary Numbers

0.111₂ = ?₁₀
101.1₂ = ?₁₀
0.00110011₂ = ?₁₀

Practice: Fractional Binary Numbers

$0.111_2 = \frac{7}{8}$
101.1₂ = ?₁₀
0.00110011₂ = ?₁₀

Practice: Fractional Binary Numbers

$0.111_2 = \frac{7}{8}$
$101.1_2 = 5\frac{1}{2}$
0.00110011₂ = ?₁₀

Practice: Fractional Binary Numbers

$0.111_2 = \frac{7}{8}$
$101.1_2 = 5\frac{1}{2}$
$0.00110011_2 = \frac{51}{256}$

Addition

Does addition still work?
$0.111_2 (\frac{7}{8}_{10}) + 101.1_2 (5\frac{1}{2}_{10}) = ?_2 = ?_{10}$

Addition

Does addition still work?
$0.111_2 (\frac{7}{8}_{10}) + 101.1_2 (5\frac{1}{2}_{10}) = 110.011_2 = 6\frac{3}{8}_{10}$

Shifting

Left Shift:

Base 2	Base 10	Effect
101.011	$5\frac{3}{8}$
10.1011	$2\frac{11}{16}$	/2
1.01011	$1\frac{11}{32}$	/2

Right Shift:

Base 2	Base 10	Effect
101.011	$5\frac{3}{8}$
1010.11	$10\frac{3}{4}$	*2
10101.1	$21\frac{1}{2}$	*2

From Base 10 to Base 2

Convert number from fraction to a decimal point number
Split the number at the decimal point
- Whole number part: convert normally
- Fractional part: follow these steps
Multiply fractional part by 2
Whole number part (always 0 or 1) is the next digit of the result
Fractional part is used if repetition is necessary
Repeat until either:
- Fractional part is 0
- Cycle is detected (result of multiplication has been encountered before)
Result is whole number converted append to the left of digits from step 3 read from top to bottom

Demo: From Base 10 to Base 2

12.1₁₀ = ?₂

Frac. Part		Base		Res. Digit	Frac. Part
0.1	*	2	=	0	.2
0.2	*	2	=	0	.4
0.4	*	2	=	0	.8
0.8	*	2	=	1	.6
0.6	*	2	=	1	.2
0.2	*	2	=	0	.4

$12.1_{10} = 1100.0\overline{0011}_2$

Practice: From Base 10 to Base 2

123.4₁₀ = ?₂

Frac. Part		Base		Res. Digit	Frac. Part
0.4	*	2	=	0	.8
0.8	*	2	=	1	.6
0.6	*	2	=	1	.2
0.2	*	2	=	0	.4
0.4	*	2	=	0	.8

$123.4_{10} = 1111011.\overline{0110}_2$

Limitations: Representable Numbers

Can only represent numbers of the form: $\frac{x}{2^k}$
- $\frac{1}{2}, \frac{1}{4}, \frac{5}{16},$ etc.
Other rational numbers must be represented with repeating representations:
- $\frac{1}{3} = 0.0101010101[01]$
- $\frac{1}{5} = 0.001100110011[0011]$
- $\frac{1}{10} = 0.0001100110011[0011]$
Limited range of numbers
- Very small and very larger numbers require a lot of space
Inefficiently stores numbers with lots of zeroes
There must be a better way!

Insight: Scientific Notation

What if instead storing the number literally we stored it in scientific notation?
Recall: 12345₁₀ = 1.2345 * 10⁴
Base 2 Version: n₂ = x₂ * 2^y
Instead of storing n directly we instead store x and y!

IEEE Floating Point

IEEE Standard 754
- Established in 1985 as uniform standard for floating point arithmetic
- Before that, many idiosyncratic formats, each computer manufacturer had their own
- Supported by all major CPUs
Driven by numerical concerns
- Nice standards for rounding, overflow, underflow
- Hard to make fast in hardware
- Numerical analysts predominated over hardware designers in defining the standard

Floating Point Representation

Floating Point Bit Pattern

High level interpretation: (−1)^s * M * 2^E
- s: sign bit
- exp: exponent which weights the value by a power of 2, encodes E which is written in bias notation
- frac: fractional part between [1, 2), encodes M which stands for mantissa
Remember that what is stored in exp and frac encodes E and M respectively

Aside: Mantissa

An old mathematical term
Is simply the decimal/fractional part of a number written in scientific notation
For example:
- 6230₁₀ = 6.23 * 10³
- 6.23 is the mantissa

Floating Point Precision

Single Precision: 32 bits

Floating Point 32 Bits

Double Precision: 64 bits

Floating Point 64 Bits

Caveats

(−1)^s * M * 2^E

Of course nothing is as simple as it seems and floating point is no exception
We have 3 cases for interpreting a floating point number:
- Case 1: Normalized values
- Case 2: Denormalized values, used for numbers close to 0
- Case 3: Special values, used for infinity and NaN

Aside: Bias Values

We are all familiar with using twos complement for representing negative numbers
But could you think of another way we could represent them?
One other way is to use biased notation also called offset binary
General form: n − K where n is an unsigned number and K is a constant (bias)

Aside: Bias Values

For example a 4 bit biased binary number with K = 8:

Dec.	Offset (K = 8)	Two’s Comp.
7	1111	0111
6	1110	0110
5	1101	0101
4	1100	0100
3	1011	0011
2	1010	0010
1	1001	0001
0	1000	0000

Dec.	Offset (K = 8)	Two’s Comp.
-1	0111	1111
-2	0110	1110
-3	0101	1101
-4	0100	1100
-5	0011	1011
-6	0010	1010
-7	0001	1001
-8	0000	1000

Practice: Bias Notation

Convert the following into 6 bit binary with K = 32
0₁₀ = ?₂
31₁₀ = ?₂
−15₁₀ = ?₂

Practice: Bias Notation

Convert the following into 6 bit binary with K = 32
0₁₀ = 100000₂
31₁₀ = ?₂
−15₁₀ = ?₂

Practice: Bias Notation

Convert the following into 6 bit binary with K = 32
0₁₀ = 100000₂
31₁₀ = 111111₂
−15₁₀ = ?₂

Practice: Bias Notation

Convert the following into 6 bit binary with K = 32
0₁₀ = 100000₂
31₁₀ = 111111₂
−15₁₀ = 010001₂

Normalized Values

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

exp ≠ 000...0 and exp ≠ 111...1
exp is interpreted as a biased value
- E = exp − bias
- bias = 2^k − 1 − 1, where k is the number of exponent bits
  - Single precision: 127 (exp: 1...254, E: −126...127)
  - Double precision: 1023 (exp: 1...2046, E: −1022...1023)
M, mantissa, encoded with implied leading 1:
- M = 1.x₁x₂x₃...x_j
- x₁x₂x₃...x_j is what is stored in the mantissa
- Minimum: 000...0, M = 1
- Maximum: 111...1, M = 2 − ϵ

Example: Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

Convert 13.0₁₀ into 32 bit floating point
Unsigned binary: 1101₂
Scientific notation: 1.101₂ * 2³
Exponent:

$$ \begin{aligned} E & {}= 3 \\ bias & {}= 127 (k = 8, 2^{k-1} = 128 - 1) \\ exp & {}= 130 (E + bias) = 10000010_2 \end{aligned} $$

Example: Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

Convert 13.0₁₀ into 32 bit floating point
Unsigned binary: 1101₂
Scientific notation: 1.101₂ * 2³
Mantissa:

$$ \begin{aligned} M & {}= 1.101_2 \\ frac & {}= \phantom{0.}10100000000000000000000_2 \end{aligned} $$

Example: Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

Convert 13.0₁₀ into 32 bit floating point
Unsigned binary: 1101₂
Scientific notation: 1.101₂ * 2³
Exponent: 10000010₂
Fraction: 10100000000000000000000₂

sign exp frac

0 10000010 10100000000000000000000

sign	exp	frac
0	10000010	10100000000000000000000

Example: Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

Convert 15213.0₁₀ into 32 bit floating point
Unsigned binary: 11101101101101₂
Scientific notation: 1.1101101101101 * 2¹³
Exponent:

$$ \begin{aligned} E & {}= 13 \\ bias & {}= 127 (k = 8, 2^{k-1} = 128 - 1) \\ exp & {}= 140 (E + bias) = 10001100_2 \end{aligned} $$

Example: Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

Convert 15213.0₁₀ into 32 bit floating point
Unsigned binary: 11101101101101₂
Scientific notation: 1.1101101101101 * 2¹³
Mantissa:

$$ \begin{aligned} M & {}= 1.1101101101101_2 \\ frac & {}= \phantom{0.}11011011011010000000000_2 \end{aligned} $$

Example: Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

Convert 15213.0₁₀ into 32 bit floating point
Unsigned binary: 11101101101101₂
Scientific notation: 1.1101101101101 * 2¹³
Exponent: 10001100₂
Fraction: 11011011011010000000000₂

sign exp frac

0 10001100 11011011011010000000000

sign	exp	frac
0	10001100	11011011011010000000000

Practice: Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

12.0₁₀ = ?₂
100.0₁₀ = ?₂
01000010110010000000000000000000₂

Practice: Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

12.0₁₀ = ?₂
- 01000001010000000000000000000000₂
100.0₁₀ = ?₂
01000010110010000000000000000000₂

Practice: Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

12.0₁₀ = ?₂
- 01000001010000000000000000000000₂
100.0₁₀ = ?₂
- 01000010110010000000000000000000₂
10111110111010000000000000000000₂ = ?₁₀

Practice: Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

12.0₁₀ = ?₂
- 01000001010000000000000000000000₂
100.0₁₀ = ?₂
- 01000010110010000000000000000000₂
10111110111010000000000000000000₂ = ?₁₀
- $-\frac{29}{64}_{10} = -0.453125_{10}$

Denormalized Values

(−1)^s * M * 2^E where M = 0.xxx...x₂ and E = 1 − bias

When exp = 000...0
E = 1 − bias instead of 0 − bias
Mantissa encoded with leading 0 (instead of 1): M = 0.xxx...x₂
Cases:
- exp = 000...0 and frac = 000...0
  - Represents zero
  - +0 and -0 exist
- exp = 000...0 and frac ≠ 000...0
  - Numbers closest to zero
  - Equidistant

Special Values

When exp = 111...1
Cases:
- exp = 111...1 and frac = 000...0
  - ±∞
  - Operation that overflows
  - $\frac{1.0}{0.0} = \frac{-1.0}{-0.0} = +\infty$
  - $\frac{-1.0}{0.0} = \frac{1.0}{-0.0} = -\infty$
- exp = 111...1 and frac ≠ 000...0
  - Not-a-Number (NaN)
  - Means no numeric value can be determined
  - $\sqrt{-1} = \infty - \infty = \infty * 0 = NaN$

Table: Floating Point Cases

Type	Sign	Exp	Frac
Normalized	0/1	(000...0, 111...1)	[000...0, 111...1]
Denormalized	0/1	000...0	[000...0, 111...1]
Positive Zero	0	000...0	000...0
Negative Zero	1	000...0	000...0
Special	0/1	111...1	[000...0, 111...1]
Infinites	0/1	111...1	000...0
NaN	0/1	111...1	(000…0, 111…1]

Visualize: Floating Point

Visualizing Floating Point

Simplifying: Floating Point

8-bit Floating Point

8-bit floating point representation
Same general form as IEEE format just smaller

8-bit Floating Point Positive Range

Practice: 8-bit Floating Point

8-bit Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

What is the bias?

Practice: 8-bit Floating Point

8-bit Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

What is the bias? 2^4 − 1 − 1 = 7
2.5₁₀ = ?₂
01100011₂ = ?₁₀
10000111₂ = ?₁₀

Practice: 8-bit Floating Point

8-bit Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

What is the bias? 2^4 − 1 − 1 = 7
2.5₁₀ = ?₂
- 01000010₂
01100011₂ = ?₁₀
10000111₂ = ?₁₀

Practice: 8-bit Floating Point

8-bit Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

What is the bias? 2^4 − 1 − 1 = 7
2.5₁₀ = ?₂
- 01000010₂
01100011₂ = ?₁₀
- 44₁₀
10000111₂ = ?₁₀

Practice: 8-bit Floating Point

8-bit Floating Point

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

What is the bias? 2^4 − 1 − 1 = 7
2.5₁₀ = ?₂
- 01000010₂
01100011₂ = ?₁₀
- 44₁₀
10000111₂ = ?₁₀
- $-1 * \frac{7}{8} * \frac{1}{64} = -\frac{7}{512} = -0.013671875_{10}$

Dichotomy: 8-bit vs 32-bit

(−1)^s * M * 2^E where M = 1.xxx...x₂ and E = exp − bias

Represent $\frac{1}{512}_{10}$ in 8-bit and 32-bit
Normalized or denormalized?

Example: Denormalized Value 8-bit

(−1)^s * M * 2^E where M = 0.xxx...x₂ and E = 1 − bias

Convert $\frac{1}{512}_{10}$
Unsigned binary: 0.000000001₂
Scientific notation: 1.0 * 2⁻⁹
Exponent:

$$ \begin{aligned} E & {}= -9 \\ bias & {}= 7 (k = 4, 2^{k-1} = 8 - 1) \\ exp & {}= -2 (E + bias) = ?_2 \end{aligned} $$

Example: Denormalized Value 8-bit

(−1)^s * M * 2^E where M = 0.xxx...x₂ and E = 1 − bias

Convert $\frac{1}{512}_{10}$
Exponent:

$$ \begin{aligned} E & {}= -9 \\ bias & {}= 7 (k = 4, 2^{k-1} = 8 - 1) \\ exp & {}= -2 (E + bias) = ?_2 \end{aligned} $$
exp cannot be negative! This means $\frac{1}{512}_{10}$ is denormalized

Example: Denormalized Value 8-bit

(−1)^s * M * 2^E where M = 0.xxx...x₂ and E = 1 − bias

Convert $\frac{1}{512}_{10}$
Unsigned binary: 0.000000001₂
Scientific notation: 1.0 * 2⁻⁹
Exponent:

$$ \begin{aligned} exp & {}= 0 (denorm) \\ bias & {}= 7 (k = 4, 2^{k-1} = 8 - 1) \\ E & {}= -6 {} = 1 - 7 (fixed \text{ } for \text{ } denorm) \end{aligned} $$

Example: Denormalized Value 8-bit

(−1)^s * M * 2^E where M = 0.xxx...x₂ and E = 1 − bias

Convert $\frac{1}{512}_{10}$
Unsigned binary: 0.000000001₂
Scientific notation: 1.0 * 2⁻⁹
Mantissa: becomes more tricky now
- Must write the number with a leading zero and E is fixed at −6
$$ \begin{aligned} M * 2^{-6} & {}= 1.0 * 2^{-9} \\ M & {}= 2^{-3} {}= 0.001_2 {}= 0.125_{10} {}= \frac{1}{8} \\ frac & {}= 001 \end{aligned} $$

Example: Denormalized Value 8-bit

(−1)^s * M * 2^E where M = 0.xxx...x₂ and E = 1 − bias

Convert $\frac{1}{512}_{10}$
Unsigned binary: 0.000000001₂
Scientific notation: 1.0 * 2⁻⁹
Exponent: 0000
Fraction: 001

sign exp frac

0 0000 001

sign	exp	frac
0	0000	001

Example: Denormalized Value 32-bit

(−1)^s * M * 2^E where M = 0.xxx...x₂ and E = 1 − bias

Convert $\frac{1}{512}_{10}$
Unsigned binary: 0.000000001₂
Scientific notation: 1.0 * 2⁻⁹
Exponent:

$$ \begin{aligned} E & {}= -9 \\ bias & {}= 127 (k = 8, 2^{k-1} = 128 - 1) \\ exp & {}= 118 (E + bias) = 01110110_2 \end{aligned} $$

Example: Denormalized Value 32-bit

(−1)^s * M * 2^E where M = 0.xxx...x₂ and E = 1 − bias

Convert $\frac{1}{512}_{10}$
Unsigned binary: 0.000000001₂
Scientific notation: 1.0 * 2⁻⁹
Mantissa:

$$ \begin{aligned} M & {}= 1.0 \\ frac & {}= \phantom{1.}000...0 \end{aligned} $$

Example: Denormalized Value 32-bit

(−1)^s * M * 2^E where M = 0.xxx...x₂ and E = 1 − bias

Convert $\frac{1}{512}_{10}$
Unsigned binary: 0.000000001₂
Scientific notation: 1.0 * 2⁻⁹
Exponent: 01110110
Fraction: 00000000000000000000000

sign exp frac

0 01110110 00000000000000000000000

sign	exp	frac
0	01110110	00000000000000000000000

Distribution of Values

6-bit Floating Point

6-bit Floating Point Graphed

Notice that the distribution is denser towards zero, why is that?

Negative exponents make the number smaller and smaller
Whereas positive ones spread the numbers out more and more

Distribution of Values

6-bit Floating Point

6-bit Floating Point Around 0

Nice Properties of IEEE Floating Point

Floating point 0 is equal to integer zero (all bits = 0)
Comparisons can almost always be done using unsigned integers
- Needs to consider the sign bit first
- −0 = 0
- NaN is problematic
  - Is greater than any other value
  - What should comparison yield?

Floating Point Operations

Only approximates real numbers (sec 2.4.4)
- x+^fy = round(x + y)
- x*^fy = round(x * y)
Basic Idea:
1. Compute exact result
2. Make it fit into desired precision
- Possibility of overflow if exponent is too large
- Possibility of rounding to fit into fraction part

Rounding

Value	To-Even (default)	Toward-zero	Round down (−∞)	Round up (+∞)
$1.40	$1	$1	$1	$2
$1.60	$2	$1	$1	$2
$1.50	$2	$1	$1	$2
$2.50	$2	$2	$2	$3
-$1.50	-$2	-$1	-$2	-$1

Round-To-Even

Default rounding mode
- Hard to get anything else without using assembly directly
Applying to other decimal places/bit positions
- When exactly halfway between two possible vales:
  - Round so the least significant digit is even
- For example rounding to the nearest hundredth:
Value Rounded Value Note

7.8949999 7.89 (Less than half way)

7.8950001 7.90 (Greater than half way)

7.8950000 7.90 (Half way—round up)

7.8850000 7.88 (Half way—round down)

Value	Rounded Value	Note
7.8949999	7.89	(Less than half way)
7.8950001	7.90	(Greater than half way)
7.8950000	7.90	(Half way—round up)
7.8850000	7.88	(Half way—round down)

Rounding Binary Numbers

Binary fractional numbers
- “even” is when least significant bit i 0
- “halfway” is when bits to the right of rounding position = 100...₂
For example rounding to the nearest quarter ($\frac{1}{4}$ 2 bits)

Value	Binary	Rounded	Action	Rounded Value
$2 \frac{3}{32}$	10.00011₂	10.00₂	(<$\frac{1}{2}$—down)	2
$2 \frac{3}{16}$	10.00110₂	10.01₂	(>$\frac{1}{2}$—up)	$2 \frac{1}{4}$
$2 \frac{7}{8}$	10.11100₂	11.00₂	($\frac{1}{2}$—up)	3
$2 \frac{5}{8}$	10.10100₂	10.10₂	($\frac{1}{2}$—down)	$2 \frac{1}{2}$

Practice: Round To Even

Round to the $\frac{1}{2}$
10.010₂ = ?₂
10.011₂ = ?₂
10.110₂ = ?₂
11.001₂ = ?₂

Practice: Round To Even

Round to the $\frac{1}{2}$
10.010₂ = 10.0₂
10.011₂ = ?₂
10.110₂ = ?₂
11.001₂ = ?₂

Practice: Round To Even

Round to the $\frac{1}{2}$
10.010₂ = 10.0₂
10.011₂ = 10.1₂
10.110₂ = ?₂
11.001₂ = ?₂

Practice: Round To Even

Round to the $\frac{1}{2}$
10.010₂ = 10.0₂
10.011₂ = 10.1₂
10.110₂ = 11.0₂
11.001₂ = ?₂

Practice: Round To Even

Round to the $\frac{1}{2}$
10.010₂ = 10.0₂
10.011₂ = 10.1₂
10.110₂ = 11.0₂
11.001₂ = 11.0₂

Floating Point Multiplication (sec 2.4.5)

(−1)^s1 M1 2^E1 * (−1)^s2 M2 2^E2
Exact Result: (−1)^s M 2^E

Part Result

Sign s s1 ^ s2

Mantissa M M1 * M2

Exponent E E1 + E2
If M ≥ 2, shift M right, increment E
If E is out of range, overflow and round M to fit frac precision
Helpful Site

Part	Result
Sign s	s1 ^ s2
Mantissa M	M1 * M2
Exponent E	E1 + E2

Floating Point Addition

(−1)^s1 M1 2^E1 + (−1)^s2 M2 2^E2
Assume E1 > E2
Exact Result: (−1)^s M 2E
Sign s, mantissa M:
- Result of signed align & add
- Exponent E: E1
Fixing
- If M ≥ 2, shift M right, increment E
- If M < 1, shift M left k positions, decrement E by k
- If E is out of range, overflow and round M to fit frac precision
- Helpful Site

Floating Point Addition

Mathematical Properties of Floating Point: Add

x+^fy = round(x + y)
Closed under addition? Yes
- Could still generate infinity of NaN though
Commutative? Yes
Associative? No!
- Due to overflow and inexactness of rounding
- (3.14 + 1e10) − 1e10 = 0, 3.14 + (1e10 − 1e10) = 3.14
0 is additive identity? Yes
Every element has additive inverse? Almost!
- Everything except infinities and NaNs
Monotonicity (a ≥ b ⟹ a+^fc ≥ b+^fc)? Almost!
- Everything except infinities and NaNs

Mathematical Properties of Floating Point: Mul

Closed under multiplication? Yes
- Could still generate infinity or NaN though
Commutative? Yes
Associative? No!
- Due to overflow and inexactness of rounding
- (1e20 * 1e20) * 1e − 20 = inf, 1e20 * (1e20 * 1e − 20) = 1e20
1 is multiplicative identity? Yes
Multiplication distributes over addition? No!
- Due to overflow and inexactness of rounding
- 1e20 * (1e20 − 1e20) = 0.0, 1e20 * 1e20 − 1e20 * 1e20 = NaN
Monotonicity (a ≥ b & c ≥ 0 ⟹ a * c ≥ b * c)? Almost!
- Everything except infinities and NaNs

Why You Need To Know

Rounding and overflows are a fact of life and need to be mitigated
Addition and multiplication are not associative or distributive!
Some things aren’t exact such as 0.1:

for (double i = 0.0; i < 1.0; i += 0.1)
  printf(“%.19f “, i);

0.0000000000000000000  0.1000000000000000056  0.2000000000000000111  
0.3000000000000000444  0.4000000000000000222  0.5000000000000000000
0.5999999999999999778  0.6999999999999999556  0.7999999999999999334 
0.8999999999999999112  0.9999999999999998890

Every number has an additive and multiplicative inverse
Sometimes you need to be very careful about the order things are done

Floating Point in C (sec 2.4.6)

C guarantees two levels
- float (single precision)
- double (double precision)
Conversions/Casting
- Casting between int, float, and double changes bit pattern
- double/float → int
  - Truncates fractional part
  - Behaves like rounding toward zero
  - Not defined when out of range or NaN, but generally sets to TMin
- int → double
  - Exact conversion
- int → float
  - Will round according to rounding mode

Summary

IEEE Floating Point has clear mathematical properties
Represents numbers of form Mx2^E
One can reason about operations independent of implementation
- As if computed with perfect precision and then rounded
Not the same as real arithmetic
- Violates associativity/distributivity
- Makes life difficult for compilers & serious numerical applications programmers