Lecture 05 Floating Point

Joseph Haugh

University of New Mexico

Learning Objectives

  • After the Floating point sessions students should be able to:
    • Describe how are fractional binary numbers interpreted.
    • Describe the IEEE floating point standard representation to approximate real numbers.
    • Perform conversions between IEEE fp and real numbers in decimal.
    • Define and perform the operations of rounding, addition, and multiplication of floating point.
    • Define how are floating point numbers represented in C.

Fractional Binary Numbers

  • What is 1011.1012?
  • Your first question should be what interpretation are you using?
  • Fractional binary notation?
  • IEEE floating point notation?
  • Something else?
  • Lets start with fraction binary notation.

Recall: Base Ten Decimal Notation

  • What does 1023.4510 mean?
  • 1 * 103 + 2 * 101 + 3 * 100 + 4 * 10−1 + 5 * 10−2
  • We can easily then use this for base 2:

Visualized: Fractional Binary Numbers

$b = \sum_{k = -j}^{i} b_k * 2^k$

Fractional Binary Numbers

  • What number does 1011.1012 represent in base 10?
  • $11\frac{5}{8}$
  • Left Part: 10112 = 1 * 23 + 1 * 21 + 1 * 20 = 11
  • Right Part: $101_2 = 1 * 2^{-1} + 1 * 2^{-3} = \frac{5}{8}$

Practice: Fractional Binary Numbers

  • 0.1112 = ?10
  • 101.12 = ?10
  • 0.001100112 = ?10

Practice: Fractional Binary Numbers

  • $0.111_2 = \frac{7}{8}$
  • 101.12 = ?10
  • 0.001100112 = ?10

Practice: Fractional Binary Numbers

  • $0.111_2 = \frac{7}{8}$
  • $101.1_2 = 5\frac{1}{2}$
  • 0.001100112 = ?10

Practice: Fractional Binary Numbers

  • $0.111_2 = \frac{7}{8}$
  • $101.1_2 = 5\frac{1}{2}$
  • $0.00110011_2 = \frac{51}{256}$


  • Does addition still work?
  • $0.111_2 (\frac{7}{8}_{10}) + 101.1_2 (5\frac{1}{2}_{10}) = ?_2 = ?_{10}$


  • Does addition still work?
  • $0.111_2 (\frac{7}{8}_{10}) + 101.1_2 (5\frac{1}{2}_{10}) = 110.011_2 = 6\frac{3}{8}_{10}$


Left Shift:

Base 2 Base 10 Effect
101.011 $5\frac{3}{8}$
10.1011 $2\frac{11}{16}$ /2
1.01011 $1\frac{11}{32}$ /2

Right Shift:

Base 2 Base 10 Effect
101.011 $5\frac{3}{8}$
1010.11 $10\frac{3}{4}$ *2
10101.1 $21\frac{1}{2}$ *2

From Base 10 to Base 2

  1. Convert number from fraction to a decimal point number
  2. Split the number at the decimal point
    • Whole number part: convert normally
    • Fractional part: follow these steps
  3. Multiply fractional part by 2
  4. Whole number part (always 0 or 1) is the next digit of the result
  5. Fractional part is used if repetition is necessary
  6. Repeat until either:
    • Fractional part is 0
    • Cycle is detected (result of multiplication has been encountered before)
  7. Result is whole number converted append to the left of digits from step 3 read from top to bottom

Demo: From Base 10 to Base 2

12.110 = ?2

Frac. Part Base Res. Digit Frac. Part
0.1 * 2 = 0 .2
0.2 * 2 = 0 .4
0.4 * 2 = 0 .8
0.8 * 2 = 1 .6
0.6 * 2 = 1 .2
0.2 * 2 = 0 .4

$12.1_{10} = 1100.0\overline{0011}_2$

Practice: From Base 10 to Base 2

123.410 = ?2

Frac. Part Base Res. Digit Frac. Part
0.4 * 2 = 0 .8
0.8 * 2 = 1 .6
0.6 * 2 = 1 .2
0.2 * 2 = 0 .4
0.4 * 2 = 0 .8

$123.4_{10} = 1111011.\overline{0110}_2$

Limitations: Representable Numbers

  • Can only represent numbers of the form: $\frac{x}{2^k}$
    • $\frac{1}{2}, \frac{1}{4}, \frac{5}{16},$ etc.
  • Other rational numbers must be represented with repeating representations:
    • $\frac{1}{3} = 0.0101010101[01]$
    • $\frac{1}{5} = 0.001100110011[0011]$
    • $\frac{1}{10} = 0.0001100110011[0011]$
  • Limited range of numbers
    • Very small and very larger numbers require a lot of space
  • Inefficiently stores numbers with lots of zeroes
  • There must be a better way!

Insight: Scientific Notation

  • What if instead storing the number literally we stored it in scientific notation?
  • Recall: 1234510 = 1.2345 * 104
  • Base 2 Version: n2 = x2 * 2y
  • Instead of storing n directly we instead store x and y!

IEEE Floating Point

  • IEEE Standard 754
    • Established in 1985 as uniform standard for floating point arithmetic
    • Before that, many idiosyncratic formats, each computer manufacturer had their own
    • Supported by all major CPUs
  • Driven by numerical concerns
    • Nice standards for rounding, overflow, underflow
    • Hard to make fast in hardware
    • Numerical analysts predominated over hardware designers in defining the standard

Floating Point Representation

  • High level interpretation: (−1)s * M * 2E
    • s: sign bit
    • exp: exponent which weights the value by a power of 2, encodes E which is written in bias notation
    • frac: fractional part between [1, 2), encodes M which stands for mantissa
  • Remember that what is stored in exp and frac encodes E and M respectively

Aside: Mantissa

  • An old mathematical term
  • Is simply the decimal/fractional part of a number written in scientific notation
  • For example:
    • 623010 = 6.23 * 103
    • 6.23 is the mantissa

Floating Point Precision

  • Single Precision: 32 bits

  • Double Precision: 64 bits


(−1)s * M * 2E

  • Of course nothing is as simple as it seems and floating point is no exception
  • We have 3 cases for interpreting a floating point number:
    • Case 1: Normalized values
    • Case 2: Denormalized values, used for numbers close to 0
    • Case 3: Special values, used for infinity and NaN

Aside: Bias Values

  • We are all familiar with using twos complement for representing negative numbers
  • But could you think of another way we could represent them?
  • One other way is to use biased notation also called offset binary
  • General form: n − K where n is an unsigned number and K is a constant (bias)

Aside: Bias Values

  • For example a 4 bit biased binary number with K = 8:
Dec. Offset (K = 8) Two’s Comp.
7 1111 0111
6 1110 0110
5 1101 0101
4 1100 0100
3 1011 0011
2 1010 0010
1 1001 0001
0 1000 0000
Dec. Offset (K = 8) Two’s Comp.
-1 0111 1111
-2 0110 1110
-3 0101 1101
-4 0100 1100
-5 0011 1011
-6 0010 1010
-7 0001 1001
-8 0000 1000

Practice: Bias Notation

  • Convert the following into 6 bit binary with K = 32
  • 010 = ?2
  • 3110 = ?2
  • −1510 = ?2

Practice: Bias Notation

  • Convert the following into 6 bit binary with K = 32
  • 010 = 1000002
  • 3110 = ?2
  • −1510 = ?2

Practice: Bias Notation

  • Convert the following into 6 bit binary with K = 32
  • 010 = 1000002
  • 3110 = 1111112
  • −1510 = ?2

Practice: Bias Notation

  • Convert the following into 6 bit binary with K = 32
  • 010 = 1000002
  • 3110 = 1111112
  • −1510 = 0100012

Normalized Values

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • exp ≠ 000...0 and exp ≠ 111...1
  • exp is interpreted as a biased value
    • E = exp − bias
    • bias = 2k − 1 − 1, where k is the number of exponent bits
      • Single precision: 127 (exp: 1...254, E: −126...127)
      • Double precision: 1023 (exp: 1...2046, E: −1022...1023)
  • M, mantissa, encoded with implied leading 1:
    • M = 1.x1x2x3...xj
    • x1x2x3...xj is what is stored in the mantissa
    • Minimum: 000...0, M = 1
    • Maximum: 111...1, M = 2 − ϵ

Example: Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • Convert 13.010 into 32 bit floating point

  • Unsigned binary: 11012

  • Scientific notation: 1.1012 * 23

  • Exponent:

    $$ \begin{aligned} E & {}= 3 \\ bias & {}= 127 (k = 8, 2^{k-1} = 128 - 1) \\ exp & {}= 130 (E + bias) = 10000010_2 \end{aligned} $$

Example: Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • Convert 13.010 into 32 bit floating point

  • Unsigned binary: 11012

  • Scientific notation: 1.1012 * 23

  • Mantissa:

    $$ \begin{aligned} M & {}= 1.101_2 \\ frac & {}= \phantom{0.}10100000000000000000000_2 \end{aligned} $$

Example: Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • Convert 13.010 into 32 bit floating point

  • Unsigned binary: 11012

  • Scientific notation: 1.1012 * 23

  • Exponent: 100000102

  • Fraction: 101000000000000000000002

    sign exp frac
    0 10000010 10100000000000000000000

Example: Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • Convert 15213.010 into 32 bit floating point

  • Unsigned binary: 111011011011012

  • Scientific notation: 1.1101101101101 * 213

  • Exponent:

    $$ \begin{aligned} E & {}= 13 \\ bias & {}= 127 (k = 8, 2^{k-1} = 128 - 1) \\ exp & {}= 140 (E + bias) = 10001100_2 \end{aligned} $$

Example: Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • Convert 15213.010 into 32 bit floating point

  • Unsigned binary: 111011011011012

  • Scientific notation: 1.1101101101101 * 213

  • Mantissa:

    $$ \begin{aligned} M & {}= 1.1101101101101_2 \\ frac & {}= \phantom{0.}11011011011010000000000_2 \end{aligned} $$

Example: Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • Convert 15213.010 into 32 bit floating point

  • Unsigned binary: 111011011011012

  • Scientific notation: 1.1101101101101 * 213

  • Exponent: 100011002

  • Fraction: 110110110110100000000002

    sign exp frac
    0 10001100 11011011011010000000000

Practice: Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • 12.010 = ?2
  • 100.010 = ?2
  • 010000101100100000000000000000002

Practice: Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • 12.010 = ?2
    • 010000010100000000000000000000002
  • 100.010 = ?2
  • 010000101100100000000000000000002

Practice: Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • 12.010 = ?2
    • 010000010100000000000000000000002
  • 100.010 = ?2
    • 010000101100100000000000000000002
  • 101111101110100000000000000000002 = ?10

Practice: Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • 12.010 = ?2
    • 010000010100000000000000000000002
  • 100.010 = ?2
    • 010000101100100000000000000000002
  • 101111101110100000000000000000002 = ?10
    • $-\frac{29}{64}_{10} = -0.453125_{10}$

Denormalized Values

(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias

  • When exp = 000...0
  • E = 1 − bias instead of 0 − bias
  • Mantissa encoded with leading 0 (instead of 1): M = 0.xxx...x2
  • Cases:
    • exp = 000...0 and frac = 000...0
      • Represents zero
      • +0 and -0 exist
    • exp = 000...0 and frac ≠ 000...0
      • Numbers closest to zero
      • Equidistant

Special Values

  • When exp = 111...1
  • Cases:
    • exp = 111...1 and frac = 000...0
      • ±∞
      • Operation that overflows
      • $\frac{1.0}{0.0} = \frac{-1.0}{-0.0} = +\infty$
      • $\frac{-1.0}{0.0} = \frac{1.0}{-0.0} = -\infty$
    • exp = 111...1 and frac ≠ 000...0
      • Not-a-Number (NaN)
      • Means no numeric value can be determined
      • $\sqrt{-1} = \infty - \infty = \infty * 0 = NaN$

Table: Floating Point Cases

Type Sign Exp Frac
Normalized 0/1 (000...0, 111...1) [000...0, 111...1]
Denormalized 0/1 000...0 [000...0, 111...1]
Positive Zero 0 000...0 000...0
Negative Zero 1 000...0 000...0
Special 0/1 111...1 [000...0, 111...1]
Infinites 0/1 111...1 000...0
NaN 0/1 111...1 (000…0, 111…1]

Visualize: Floating Point

Simplifying: Floating Point

  • 8-bit floating point representation
  • Same general form as IEEE format just smaller

8-bit Floating Point Positive Range

Practice: 8-bit Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • What is the bias?

Practice: 8-bit Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • What is the bias? 24 − 1 − 1 = 7
  • 2.510 = ?2
  • 011000112 = ?10
  • 100001112 = ?10

Practice: 8-bit Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • What is the bias? 24 − 1 − 1 = 7
  • 2.510 = ?2
    • 010000102
  • 011000112 = ?10
  • 100001112 = ?10

Practice: 8-bit Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • What is the bias? 24 − 1 − 1 = 7
  • 2.510 = ?2
    • 010000102
  • 011000112 = ?10
    • 4410
  • 100001112 = ?10

Practice: 8-bit Floating Point

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • What is the bias? 24 − 1 − 1 = 7
  • 2.510 = ?2
    • 010000102
  • 011000112 = ?10
    • 4410
  • 100001112 = ?10
    • $-1 * \frac{7}{8} * \frac{1}{64} = -\frac{7}{512} = -0.013671875_{10}$

Dichotomy: 8-bit vs 32-bit

(−1)s * M * 2E where M = 1.xxx...x2 and E = exp − bias

  • Represent $\frac{1}{512}_{10}$ in 8-bit and 32-bit
  • Normalized or denormalized?

Example: Denormalized Value 8-bit

(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias

  • Convert $\frac{1}{512}_{10}$

  • Unsigned binary: 0.0000000012

  • Scientific notation: 1.0 * 2−9

  • Exponent:

    $$ \begin{aligned} E & {}= -9 \\ bias & {}= 7 (k = 4, 2^{k-1} = 8 - 1) \\ exp & {}= -2 (E + bias) = ?_2 \end{aligned} $$

Example: Denormalized Value 8-bit

(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias

  • Convert $\frac{1}{512}_{10}$

  • Exponent:

    $$ \begin{aligned} E & {}= -9 \\ bias & {}= 7 (k = 4, 2^{k-1} = 8 - 1) \\ exp & {}= -2 (E + bias) = ?_2 \end{aligned} $$

  • exp cannot be negative! This means $\frac{1}{512}_{10}$ is denormalized

Example: Denormalized Value 8-bit

(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias

  • Convert $\frac{1}{512}_{10}$

  • Unsigned binary: 0.0000000012

  • Scientific notation: 1.0 * 2−9

  • Exponent:

    $$ \begin{aligned} exp & {}= 0 (denorm) \\ bias & {}= 7 (k = 4, 2^{k-1} = 8 - 1) \\ E & {}= -6 {} = 1 - 7 (fixed \text{ } for \text{ } denorm) \end{aligned} $$

Example: Denormalized Value 8-bit

(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias

  • Convert $\frac{1}{512}_{10}$
  • Unsigned binary: 0.0000000012
  • Scientific notation: 1.0 * 2−9
  • Mantissa: becomes more tricky now
    • Must write the number with a leading zero and E is fixed at −6
    $$ \begin{aligned} M * 2^{-6} & {}= 1.0 * 2^{-9} \\ M & {}= 2^{-3} {}= 0.001_2 {}= 0.125_{10} {}= \frac{1}{8} \\ frac & {}= 001 \end{aligned} $$

Example: Denormalized Value 8-bit

(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias

  • Convert $\frac{1}{512}_{10}$

  • Unsigned binary: 0.0000000012

  • Scientific notation: 1.0 * 2−9

  • Exponent: 0000

  • Fraction: 001

    sign exp frac
    0 0000 001

Example: Denormalized Value 32-bit

(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias

  • Convert $\frac{1}{512}_{10}$

  • Unsigned binary: 0.0000000012

  • Scientific notation: 1.0 * 2−9

  • Exponent:

    $$ \begin{aligned} E & {}= -9 \\ bias & {}= 127 (k = 8, 2^{k-1} = 128 - 1) \\ exp & {}= 118 (E + bias) = 01110110_2 \end{aligned} $$

Example: Denormalized Value 32-bit

(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias

  • Convert $\frac{1}{512}_{10}$

  • Unsigned binary: 0.0000000012

  • Scientific notation: 1.0 * 2−9

  • Mantissa:

    $$ \begin{aligned} M & {}= 1.0 \\ frac & {}= \phantom{1.}000...0 \end{aligned} $$

Example: Denormalized Value 32-bit

(−1)s * M * 2E where M = 0.xxx...x2 and E = 1 − bias

  • Convert $\frac{1}{512}_{10}$

  • Unsigned binary: 0.0000000012

  • Scientific notation: 1.0 * 2−9

  • Exponent: 01110110

  • Fraction: 00000000000000000000000

    sign exp frac
    0 01110110 00000000000000000000000

Distribution of Values

  • Notice that the distribution is denser towards zero, why is that?
  • Negative exponents make the number smaller and smaller
  • Whereas positive ones spread the numbers out more and more

Distribution of Values

Nice Properties of IEEE Floating Point

  • Floating point 0 is equal to integer zero (all bits = 0)
  • Comparisons can almost always be done using unsigned integers
    • Needs to consider the sign bit first
    • −0 = 0
    • NaN is problematic
      • Is greater than any other value
      • What should comparison yield?

Floating Point Operations

  • Only approximates real numbers (sec 2.4.4)
    • x+fy = round(x + y)
    • x*fy = round(x * y)
  • Basic Idea:
    1. Compute exact result
    2. Make it fit into desired precision
    • Possibility of overflow if exponent is too large
    • Possibility of rounding to fit into fraction part


Value To-Even (default) Toward-zero Round down (−∞) Round up (+∞)
$1.40 $1 $1 $1 $2
$1.60 $2 $1 $1 $2
$1.50 $2 $1 $1 $2
$2.50 $2 $2 $2 $3
-$1.50 -$2 -$1 -$2 -$1


  • Default rounding mode
    • Hard to get anything else without using assembly directly
  • Applying to other decimal places/bit positions
    • When exactly halfway between two possible vales:
      • Round so the least significant digit is even
    • For example rounding to the nearest hundredth:
    Value Rounded Value Note
    7.8949999 7.89 (Less than half way)
    7.8950001 7.90 (Greater than half way)
    7.8950000 7.90 (Half way—round up)
    7.8850000 7.88 (Half way—round down)

Rounding Binary Numbers

  • Binary fractional numbers
    • even” is when least significant bit i 0
    • halfway” is when bits to the right of rounding position  = 100...2
  • For example rounding to the nearest quarter ($\frac{1}{4}$ 2 bits)
Value Binary Rounded Action Rounded Value
$2 \frac{3}{32}$ 10.000112 10.002 (<$\frac{1}{2}$—down) 2
$2 \frac{3}{16}$ 10.001102 10.012 (>$\frac{1}{2}$—up) $2 \frac{1}{4}$
$2 \frac{7}{8}$ 10.111002 11.002 ($\frac{1}{2}$—up) 3
$2 \frac{5}{8}$ 10.101002 10.102 ($\frac{1}{2}$—down) $2 \frac{1}{2}$

Practice: Round To Even

  • Round to the $\frac{1}{2}$
  • 10.0102 = ?2
  • 10.0112 = ?2
  • 10.1102 = ?2
  • 11.0012 = ?2

Practice: Round To Even

  • Round to the $\frac{1}{2}$
  • 10.0102 = 10.02
  • 10.0112 = ?2
  • 10.1102 = ?2
  • 11.0012 = ?2

Practice: Round To Even

  • Round to the $\frac{1}{2}$
  • 10.0102 = 10.02
  • 10.0112 = 10.12
  • 10.1102 = ?2
  • 11.0012 = ?2

Practice: Round To Even

  • Round to the $\frac{1}{2}$
  • 10.0102 = 10.02
  • 10.0112 = 10.12
  • 10.1102 = 11.02
  • 11.0012 = ?2

Practice: Round To Even

  • Round to the $\frac{1}{2}$
  • 10.0102 = 10.02
  • 10.0112 = 10.12
  • 10.1102 = 11.02
  • 11.0012 = 11.02

Floating Point Multiplication (sec 2.4.5)

  • (−1)s1M1 2E1 * (−1)s2M2 2E2

  • Exact Result: (−1)sM 2E

    Part Result
    Sign s s1 ^ s2
    Mantissa M M1 * M2
    Exponent E E1 + E2
  • If M ≥ 2, shift M right, increment E

  • If E is out of range, overflow and round M to fit frac precision

  • Helpful Site

Floating Point Addition

  • (−1)s1M1 2E1 + (−1)s2M2 2E2
  • Assume E1 > E2
  • Exact Result: (−1)sM 2E
  • Sign s, mantissa M:
    • Result of signed align & add
    • Exponent E: E1
  • Fixing
    • If M ≥ 2, shift M right, increment E
    • If M < 1, shift M left k positions, decrement E by k
    • If E is out of range, overflow and round M to fit frac precision
    • Helpful Site

Mathematical Properties of Floating Point: Add

  • x+fy = round(x + y)
  • Closed under addition? Yes
    • Could still generate infinity of NaN though
  • Commutative? Yes
  • Associative? No!
    • Due to overflow and inexactness of rounding
    • (3.14 + 1e10) − 1e10 = 0, 3.14 + (1e10 − 1e10) = 3.14
  • 0 is additive identity? Yes
  • Every element has additive inverse? Almost!
    • Everything except infinities and NaNs
  • Monotonicity (a ≥ b ⟹ a+fc ≥ b+fc)? Almost!
    • Everything except infinities and NaNs

Mathematical Properties of Floating Point: Mul

  • Closed under multiplication? Yes
    • Could still generate infinity or NaN though
  • Commutative? Yes
  • Associative? No!
    • Due to overflow and inexactness of rounding
    • (1e20 * 1e20) * 1e − 20 = inf, 1e20 * (1e20 * 1e − 20) = 1e20
  • 1 is multiplicative identity? Yes
  • Multiplication distributes over addition? No!
    • Due to overflow and inexactness of rounding
    • 1e20 * (1e20 − 1e20) = 0.0, 1e20 * 1e20 − 1e20 * 1e20 = NaN
  • Monotonicity (a ≥ b & c ≥ 0 ⟹ a * c ≥ b * c)? Almost!
    • Everything except infinities and NaNs

Why You Need To Know

  • Rounding and overflows are a fact of life and need to be mitigated
  • Addition and multiplication are not associative or distributive!
  • Some things aren’t exact such as 0.1:
for (double i = 0.0; i < 1.0; i += 0.1)
  printf(“%.19f “, i);
0.0000000000000000000  0.1000000000000000056  0.2000000000000000111  
0.3000000000000000444  0.4000000000000000222  0.5000000000000000000
0.5999999999999999778  0.6999999999999999556  0.7999999999999999334 
0.8999999999999999112  0.9999999999999998890
  • Every number has an additive and multiplicative inverse
  • Sometimes you need to be very careful about the order things are done

Floating Point in C (sec 2.4.6)

  • C guarantees two levels
    • float (single precision)
    • double (double precision)
  • Conversions/Casting
    • Casting between int, float, and double changes bit pattern
    • double/float int
      • Truncates fractional part
      • Behaves like rounding toward zero
      • Not defined when out of range or NaN, but generally sets to TMin
    • int double
      • Exact conversion
    • int float
      • Will round according to rounding mode


  • IEEE Floating Point has clear mathematical properties
  • Represents numbers of form Mx2E
  • One can reason about operations independent of implementation
    • As if computed with perfect precision and then rounded
  • Not the same as real arithmetic
    • Violates associativity/distributivity
    • Makes life difficult for compilers & serious numerical applications programmers