Lecture 05 Floating Point
Joseph Haugh
University of New Mexico
Learning Objectives
- After the Floating point sessions students should be able to:
- Describe how are fractional binary numbers interpreted.
- Describe the IEEE floating point standard representation to approximate real numbers.
- Perform conversions between IEEE fp and real numbers in decimal.
- Define and perform the operations of rounding, addition, and multiplication of floating point.
- Define how are floating point numbers represented in C.
Fractional Binary Numbers
- What is 1011.1012?
- Your first question should be what interpretation are you using?
- Fractional binary notation?
- IEEE floating point notation?
- Something else?
- Lets start with fraction binary notation.
Recall: Base Ten Decimal Notation
- What does 1023.4510 mean?
- 1 * 103 + 2 * 101 + 3 * 100 + 4 * 10−1 + 5 * 10−2
- We can easily then use this for base 2:
Visualized: Fractional Binary Numbers
![Fractional Binary Representation]()
$b = \sum_{k = -j}^{i} b_k * 2^k$
Fractional Binary Numbers
- What number does 1011.1012 represent in base 10?
- $11\frac{5}{8}$
- Left Part: 10112 = 1 * 23 + 1 * 21 + 1 * 20 = 11
- Right Part: $101_2 = 1 * 2^{-1} + 1 * 2^{-3} = \frac{5}{8}$
Practice: Fractional Binary Numbers
- 0.1112 = ?10
- 101.12 = ?10
- 0.001100112 = ?10
Practice: Fractional Binary Numbers
- $0.111_2 = \frac{7}{8}$
- 101.12 = ?10
- 0.001100112 = ?10
Practice: Fractional Binary Numbers
- $0.111_2 = \frac{7}{8}$
- $101.1_2 = 5\frac{1}{2}$
- 0.001100112 = ?10
Practice: Fractional Binary Numbers
- $0.111_2 = \frac{7}{8}$
- $101.1_2 = 5\frac{1}{2}$
- $0.00110011_2 = \frac{51}{256}$
Addition
- Does addition still work?
- $0.111_2 (\frac{7}{8}_{10}) + 101.1_2 (5\frac{1}{2}_{10}) = ?_2 = ?_{10}$
Addition
- Does addition still work?
- $0.111_2 (\frac{7}{8}_{10}) + 101.1_2 (5\frac{1}{2}_{10}) = 110.011_2 = 6\frac{3}{8}_{10}$
Shifting
Left Shift:
101.011 |
$5\frac{3}{8}$ |
|
10.1011 |
$2\frac{11}{16}$ |
/2 |
1.01011 |
$1\frac{11}{32}$ |
/2 |
Right Shift:
101.011 |
$5\frac{3}{8}$ |
|
1010.11 |
$10\frac{3}{4}$ |
*2 |
10101.1 |
$21\frac{1}{2}$ |
*2 |
From Base 10 to Base 2
- Convert number from fraction to a decimal point number
- Split the number at the decimal point
- Whole number part: convert normally
- Fractional part: follow these steps
- Multiply fractional part by 2
- Whole number part (always 0 or 1) is the next digit of the result
- Fractional part is used if repetition is necessary
- Repeat until either:
- Fractional part is 0
- Cycle is detected (result of multiplication has been encountered before)
- Result is whole number converted append to the left of digits from step 3 read from top to bottom
Demo: From Base 10 to Base 2
12.110 = ?2
0.1 |
* |
2 |
= |
0 |
.2 |
0.2 |
* |
2 |
= |
0 |
.4 |
0.4 |
* |
2 |
= |
0 |
.8 |
0.8 |
* |
2 |
= |
1 |
.6 |
0.6 |
* |
2 |
= |
1 |
.2 |
0.2 |
* |
2 |
= |
0 |
.4 |
$12.1_{10} = 1100.0\overline{0011}_2$
Practice: From Base 10 to Base 2
123.410 = ?2
0.4 |
* |
2 |
= |
0 |
.8 |
0.8 |
* |
2 |
= |
1 |
.6 |
0.6 |
* |
2 |
= |
1 |
.2 |
0.2 |
* |
2 |
= |
0 |
.4 |
0.4 |
* |
2 |
= |
0 |
.8 |
$123.4_{10} = 1111011.\overline{0110}_2$
Limitations: Representable Numbers
- Can only represent numbers of the form: $\frac{x}{2^k}$
- $\frac{1}{2}, \frac{1}{4}, \frac{5}{16},$ etc.
- Other rational numbers must be represented with repeating representations:
- $\frac{1}{3} = 0.0101010101[01]$
- $\frac{1}{5} = 0.001100110011[0011]$
- $\frac{1}{10} = 0.0001100110011[0011]$
- Limited range of numbers
- Very small and very larger numbers require a lot of space
- Inefficiently stores numbers with lots of zeroes
- There must be a better way!
Insight: Scientific Notation
- What if instead storing the number literally we stored it in scientific notation?
- Recall: 1234510 = 1.2345 * 104
- Base 2 Version: n2 = x2 * 2y
- Instead of storing n directly we instead store x and y!
IEEE Floating Point
- IEEE Standard 754
- Established in 1985 as uniform standard for floating point arithmetic
- Before that, many idiosyncratic formats, each computer manufacturer had their own
- Supported by all major CPUs
- Driven by numerical concerns
- Nice standards for rounding, overflow, underflow
- Hard to make fast in hardware
- Numerical analysts predominated over hardware designers in defining the standard
Floating Point Representation
![Floating Point Bit Pattern]()
- High level interpretation: (−1)s * M * 2E
- s: sign bit
- exp: exponent which weights the value by a power of 2, encodes E which is written in bias notation
- frac: fractional part between [1, 2), encodes M which stands for mantissa
- Remember that what is stored in exp and frac encodes E and M respectively
Aside: Mantissa
- An old mathematical term
- Is simply the decimal/fractional part of a number written in scientific notation
- For example:
- 623010 = 6.23 * 103
- 6.23 is the mantissa
Floating Point Precision
- Single Precision: 32 bits
![Floating Point 32 Bits]()
- Double Precision: 64 bits
![Floating Point 64 Bits]()
Caveats
(−1)s * M * 2E
- Of course nothing is as simple as it seems and floating point is no exception
- We have 3 cases for interpreting a floating point number:
- Case 1: Normalized values
- Case 2: Denormalized values, used for numbers close to 0
- Case 3: Special values, used for infinity and NaN
Aside: Bias Values
- We are all familiar with using twos complement for representing negative numbers
- But could you think of another way we could represent them?
- One other way is to use biased notation also called offset binary
- General form: n − K where n is an unsigned number and K is a constant (bias)
Aside: Bias Values
- For example a 4 bit biased binary number with K = 8:
7 |
1111 |
0111 |
6 |
1110 |
0110 |
5 |
1101 |
0101 |
4 |
1100 |
0100 |
3 |
1011 |
0011 |
2 |
1010 |
0010 |
1 |
1001 |
0001 |
0 |
1000 |
0000 |
-1 |
0111 |
1111 |
-2 |
0110 |
1110 |
-3 |
0101 |
1101 |
-4 |
0100 |
1100 |
-5 |
0011 |
1011 |
-6 |
0010 |
1010 |
-7 |
0001 |
1001 |
-8 |
0000 |
1000 |
Practice: Bias Notation
- Convert the following into 6 bit binary with K = 32
- 010 = ?2
- 3110 = ?2
- −1510 = ?2
Practice: Bias Notation
- Convert the following into 6 bit binary with K = 32
- 010 = 1000002
- 3110 = ?2
- −1510 = ?2
Practice: Bias Notation
- Convert the following into 6 bit binary with K = 32
- 010 = 1000002
- 3110 = 1111112
- −1510 = ?2
Practice: Bias Notation
- Convert the following into 6 bit binary with K = 32
- 010 = 1000002
- 3110 = 1111112
- −1510 = 0100012
Normalized Values
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
- exp ≠ 000...0 and exp ≠ 111...1
- exp is interpreted as a biased value
- E = exp − bias
- bias = 2k − 1 − 1, where k is the number of exponent bits
- Single precision: 127 (exp: 1...254, E: −126...127)
- Double precision: 1023 (exp: 1...2046, E: −1022...1023)
- M, mantissa, encoded with implied leading 1:
- M = 1.x1x2x3...xj
- x1x2x3...xj is what is stored in the mantissa
- Minimum: 000...0, M = 1
- Maximum: 111...1, M = 2 − ϵ
Example: Floating Point
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
Convert 13.010 into 32 bit floating point
Unsigned binary: 11012
Scientific notation: 1.1012 * 23
Exponent:
$$
\begin{aligned}
E & {}= 3 \\
bias & {}= 127 (k = 8, 2^{k-1} = 128 - 1) \\
exp & {}= 130 (E + bias) = 10000010_2
\end{aligned}
$$
Example: Floating Point
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
Convert 13.010 into 32 bit floating point
Unsigned binary: 11012
Scientific notation: 1.1012 * 23
Mantissa:
$$
\begin{aligned}
M & {}= 1.101_2 \\
frac & {}= \phantom{0.}10100000000000000000000_2
\end{aligned}
$$
Example: Floating Point
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
Convert 13.010 into 32 bit floating point
Unsigned binary: 11012
Scientific notation: 1.1012 * 23
Exponent: 100000102
Fraction: 101000000000000000000002
0 |
10000010 |
10100000000000000000000 |
Example: Floating Point
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
Convert 15213.010 into 32 bit floating point
Unsigned binary: 111011011011012
Scientific notation: 1.1101101101101 * 213
Exponent:
$$
\begin{aligned}
E & {}= 13 \\
bias & {}= 127 (k = 8, 2^{k-1} = 128 - 1) \\
exp & {}= 140 (E + bias) = 10001100_2
\end{aligned}
$$
Example: Floating Point
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
Convert 15213.010 into 32 bit floating point
Unsigned binary: 111011011011012
Scientific notation: 1.1101101101101 * 213
Mantissa:
$$
\begin{aligned}
M & {}= 1.1101101101101_2 \\
frac & {}= \phantom{0.}11011011011010000000000_2
\end{aligned}
$$
Example: Floating Point
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
Convert 15213.010 into 32 bit floating point
Unsigned binary: 111011011011012
Scientific notation: 1.1101101101101 * 213
Exponent: 100011002
Fraction: 110110110110100000000002
0 |
10001100 |
11011011011010000000000 |
Practice: Floating Point
![]()
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
- 12.010 = ?2
- 100.010 = ?2
- 010000101100100000000000000000002
Practice: Floating Point
![]()
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
- 12.010 = ?2
- 010000010100000000000000000000002
- 100.010 = ?2
- 010000101100100000000000000000002
Practice: Floating Point
![]()
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
- 12.010 = ?2
- 010000010100000000000000000000002
- 100.010 = ?2
- 010000101100100000000000000000002
- 101111101110100000000000000000002 = ?10
Practice: Floating Point
![]()
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
- 12.010 = ?2
- 010000010100000000000000000000002
- 100.010 = ?2
- 010000101100100000000000000000002
- 101111101110100000000000000000002 = ?10
- $-\frac{29}{64}_{10} = -0.453125_{10}$
Denormalized Values
(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias
- When exp = 000...0
- E = 1 − bias instead of 0 − bias
- Mantissa encoded with leading 0 (instead of 1): M = 0.xxx...x2
- Cases:
- exp = 000...0 and frac = 000...0
- Represents zero
- +0 and -0 exist
- exp = 000...0 and frac ≠ 000...0
- Numbers closest to zero
- Equidistant
Special Values
- When exp = 111...1
- Cases:
- exp = 111...1 and frac = 000...0
- ±∞
- Operation that overflows
- $\frac{1.0}{0.0} = \frac{-1.0}{-0.0} = +\infty$
- $\frac{-1.0}{0.0} = \frac{1.0}{-0.0} = -\infty$
- exp = 111...1 and frac ≠ 000...0
- Not-a-Number (NaN)
- Means no numeric value can be determined
- $\sqrt{-1} = \infty - \infty = \infty * 0 = NaN$
Table: Floating Point Cases
Normalized |
0/1 |
(000...0, 111...1) |
[000...0, 111...1] |
Denormalized |
0/1 |
000...0 |
[000...0, 111...1] |
Positive Zero |
0 |
000...0 |
000...0 |
Negative Zero |
1 |
000...0 |
000...0 |
Special |
0/1 |
111...1 |
[000...0, 111...1] |
Infinites |
0/1 |
111...1 |
000...0 |
NaN |
0/1 |
111...1 |
(000…0, 111…1] |
Visualize: Floating Point
![Visualizing Floating Point]()
Simplifying: Floating Point
![8-bit Floating Point]()
- 8-bit floating point representation
- Same general form as IEEE format just smaller
8-bit Floating Point Positive Range
![8-bit Floating Point Positive Range]()
Practice: 8-bit Floating Point
![8-bit Floating Point]()
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
Practice: 8-bit Floating Point
![8-bit Floating Point]()
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
- What is the bias? 24 − 1 − 1 = 7
- 2.510 = ?2
- 011000112 = ?10
- 100001112 = ?10
Practice: 8-bit Floating Point
![8-bit Floating Point]()
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
- What is the bias? 24 − 1 − 1 = 7
- 2.510 = ?2
- 011000112 = ?10
- 100001112 = ?10
Practice: 8-bit Floating Point
![8-bit Floating Point]()
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
- What is the bias? 24 − 1 − 1 = 7
- 2.510 = ?2
- 011000112 = ?10
- 100001112 = ?10
Practice: 8-bit Floating Point
![8-bit Floating Point]()
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
- What is the bias? 24 − 1 − 1 = 7
- 2.510 = ?2
- 011000112 = ?10
- 100001112 = ?10
- $-1 * \frac{7}{8} * \frac{1}{64} = -\frac{7}{512} = -0.013671875_{10}$
Dichotomy: 8-bit vs 32-bit
(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias
- Represent $\frac{1}{512}_{10}$ in 8-bit and 32-bit
- Normalized or denormalized?
Example: Denormalized Value 8-bit
(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias
Convert $\frac{1}{512}_{10}$
Unsigned binary: 0.0000000012
Scientific notation: 1.0 * 2−9
Exponent:
$$
\begin{aligned}
E & {}= -9 \\
bias & {}= 7 (k = 4, 2^{k-1} = 8 - 1) \\
exp & {}= -2 (E + bias) = ?_2
\end{aligned}
$$
Example: Denormalized Value 8-bit
(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias
Convert $\frac{1}{512}_{10}$
Exponent:
$$
\begin{aligned}
E & {}= -9 \\
bias & {}= 7 (k = 4, 2^{k-1} = 8 - 1) \\
exp & {}= -2 (E + bias) = ?_2
\end{aligned}
$$
exp cannot be negative! This means $\frac{1}{512}_{10}$ is denormalized
Example: Denormalized Value 8-bit
(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias
Convert $\frac{1}{512}_{10}$
Unsigned binary: 0.0000000012
Scientific notation: 1.0 * 2−9
Exponent:
$$
\begin{aligned}
exp & {}= 0 (denorm) \\
bias & {}= 7 (k = 4, 2^{k-1} = 8 - 1) \\
E & {}= -6 {} = 1 - 7 (fixed \text{ } for \text{ } denorm)
\end{aligned}
$$
Example: Denormalized Value 8-bit
(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias
- Convert $\frac{1}{512}_{10}$
- Unsigned binary: 0.0000000012
- Scientific notation: 1.0 * 2−9
- Mantissa: becomes more tricky now
- Must write the number with a leading zero and E is fixed at −6
$$
\begin{aligned}
M * 2^{-6} & {}= 1.0 * 2^{-9} \\
M & {}= 2^{-3} {}= 0.001_2 {}= 0.125_{10} {}= \frac{1}{8} \\
frac & {}= 001
\end{aligned}
$$
Example: Denormalized Value 8-bit
(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias
Convert $\frac{1}{512}_{10}$
Unsigned binary: 0.0000000012
Scientific notation: 1.0 * 2−9
Exponent: 0000
Fraction: 001
Example: Denormalized Value 32-bit
(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias
Convert $\frac{1}{512}_{10}$
Unsigned binary: 0.0000000012
Scientific notation: 1.0 * 2−9
Exponent:
$$
\begin{aligned}
E & {}= -9 \\
bias & {}= 127 (k = 8, 2^{k-1} = 128 - 1) \\
exp & {}= 118 (E + bias) = 01110110_2
\end{aligned}
$$
Example: Denormalized Value 32-bit
(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias
Convert $\frac{1}{512}_{10}$
Unsigned binary: 0.0000000012
Scientific notation: 1.0 * 2−9
Mantissa:
$$
\begin{aligned}
M & {}= 1.0 \\
frac & {}= \phantom{1.}000...0
\end{aligned}
$$
Example: Denormalized Value 32-bit
(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias
Convert $\frac{1}{512}_{10}$
Unsigned binary: 0.0000000012
Scientific notation: 1.0 * 2−9
Exponent: 01110110
Fraction: 00000000000000000000000
0 |
01110110 |
00000000000000000000000 |
Distribution of Values
![6-bit Floating Point]()
![6-bit Floating Point Graphed]()
- Notice that the distribution is denser towards zero, why is that?
- Negative exponents make the number smaller and smaller
- Whereas positive ones spread the numbers out more and more
Distribution of Values
![6-bit Floating Point]()
![6-bit Floating Point Around 0]()
Nice Properties of IEEE Floating Point
- Floating point 0 is equal to integer zero (all bits = 0)
- Comparisons can almost always be done using unsigned integers
- Needs to consider the sign bit first
- −0 = 0
- NaN is problematic
- Is greater than any other value
- What should comparison yield?
Floating Point Operations
- Only approximates real numbers (sec 2.4.4)
- x+fy = round(x + y)
- x*fy = round(x * y)
- Basic Idea:
- Compute exact result
- Make it fit into desired precision
- Possibility of overflow if exponent is too large
- Possibility of rounding to fit into fraction part
Rounding
$1.40 |
$1 |
$1 |
$1 |
$2 |
$1.60 |
$2 |
$1 |
$1 |
$2 |
$1.50 |
$2 |
$1 |
$1 |
$2 |
$2.50 |
$2 |
$2 |
$2 |
$3 |
-$1.50 |
-$2 |
-$1 |
-$2 |
-$1 |
Round-To-Even
- Default rounding mode
- Hard to get anything else without using assembly directly
- Applying to other decimal places/bit positions
- When exactly halfway between two possible vales:
- Round so the least significant digit is even
- For example rounding to the nearest hundredth:
7.8949999 |
7.89 |
(Less than half way) |
7.8950001 |
7.90 |
(Greater than half way) |
7.8950000 |
7.90 |
(Half way—round up) |
7.8850000 |
7.88 |
(Half way—round down) |
Rounding Binary Numbers
- Binary fractional numbers
- “even” is when least significant bit i
0
- “halfway” is when bits to the right of rounding position = 100...2
- For example rounding to the nearest quarter ($\frac{1}{4}$ 2 bits)
$2 \frac{3}{32}$ |
10.000112 |
10.002 |
(<$\frac{1}{2}$—down) |
2 |
$2 \frac{3}{16}$ |
10.001102 |
10.012 |
(>$\frac{1}{2}$—up) |
$2 \frac{1}{4}$ |
$2 \frac{7}{8}$ |
10.111002 |
11.002 |
($\frac{1}{2}$—up) |
3 |
$2 \frac{5}{8}$ |
10.101002 |
10.102 |
($\frac{1}{2}$—down) |
$2 \frac{1}{2}$ |
Practice: Round To Even
- Round to the $\frac{1}{2}$
- 10.0102 = ?2
- 10.0112 = ?2
- 10.1102 = ?2
- 11.0012 = ?2
Practice: Round To Even
- Round to the $\frac{1}{2}$
- 10.0102 = 10.02
- 10.0112 = ?2
- 10.1102 = ?2
- 11.0012 = ?2
Practice: Round To Even
- Round to the $\frac{1}{2}$
- 10.0102 = 10.02
- 10.0112 = 10.12
- 10.1102 = ?2
- 11.0012 = ?2
Practice: Round To Even
- Round to the $\frac{1}{2}$
- 10.0102 = 10.02
- 10.0112 = 10.12
- 10.1102 = 11.02
- 11.0012 = ?2
Practice: Round To Even
- Round to the $\frac{1}{2}$
- 10.0102 = 10.02
- 10.0112 = 10.12
- 10.1102 = 11.02
- 11.0012 = 11.02
Floating Point Multiplication (sec 2.4.5)
(−1)s1 M1 2E1 * (−1)s2 M2 2E2
Exact Result: (−1)s M 2E
Sign s |
s1 ^ s2 |
Mantissa M |
M1 * M2 |
Exponent E |
E1 + E2 |
If M ≥ 2, shift M right, increment E
If E is out of range, overflow and round M to fit frac precision
Helpful Site
Floating Point Addition
- (−1)s1 M1 2E1 + (−1)s2 M2 2E2
- Assume E1 > E2
- Exact Result: (−1)s M 2E
- Sign s, mantissa M:
- Result of signed align & add
- Exponent E: E1
- Fixing
- If M ≥ 2, shift M right, increment E
- If M < 1, shift M left k positions, decrement E by k
- If E is out of range, overflow and round M to fit frac precision
- Helpful Site
Mathematical Properties of Floating Point: Add
- x+fy = round(x + y)
- Closed under addition? Yes
- Could still generate infinity of NaN though
- Commutative? Yes
- Associative? No!
- Due to overflow and inexactness of rounding
- (3.14 + 1e10) − 1e10 = 0, 3.14 + (1e10 − 1e10) = 3.14
- 0 is additive identity? Yes
- Every element has additive inverse? Almost!
- Everything except infinities and NaNs
- Monotonicity (a ≥ b ⟹ a+fc ≥ b+fc)? Almost!
- Everything except infinities and NaNs
Mathematical Properties of Floating Point: Mul
- Closed under multiplication? Yes
- Could still generate infinity or NaN though
- Commutative? Yes
- Associative? No!
- Due to overflow and inexactness of rounding
- (1e20 * 1e20) * 1e − 20 = inf, 1e20 * (1e20 * 1e − 20) = 1e20
- 1 is multiplicative identity? Yes
- Multiplication distributes over addition? No!
- Due to overflow and inexactness of rounding
- 1e20 * (1e20 − 1e20) = 0.0, 1e20 * 1e20 − 1e20 * 1e20 = NaN
- Monotonicity (a ≥ b & c ≥ 0 ⟹ a * c ≥ b * c)? Almost!
- Everything except infinities and NaNs
Why You Need To Know
- Rounding and overflows are a fact of life and need to be mitigated
- Addition and multiplication are not associative or distributive!
- Some things aren’t exact such as 0.1:
for (double i = 0.0; i < 1.0; i += 0.1)
printf(“%.19f “, i);
0.0000000000000000000 0.1000000000000000056 0.2000000000000000111
0.3000000000000000444 0.4000000000000000222 0.5000000000000000000
0.5999999999999999778 0.6999999999999999556 0.7999999999999999334
0.8999999999999999112 0.9999999999999998890
- Every number has an additive and multiplicative inverse
- Sometimes you need to be very careful about the order things are done
Floating Point in C (sec 2.4.6)
- C guarantees two levels
- float (single precision)
- double (double precision)
- Conversions/Casting
- Casting between int, float, and double changes bit pattern
- double/float → int
- Truncates fractional part
- Behaves like rounding toward zero
- Not defined when out of range or NaN, but generally sets to TMin
- int → double
- int → float
- Will round according to rounding mode
Summary
- IEEE Floating Point has clear mathematical properties
- Represents numbers of form Mx2E
- One can reason about operations independent of implementation
- As if computed with perfect precision and then rounded
- Not the same as real arithmetic
- Violates associativity/distributivity
- Makes life difficult for compilers & serious numerical applications programmers