Sunday, 13 October 2013

Floating Point Representation

Computers represent real values in a form similar to that of scientific notation. Consider the value
1.23 x 10^4
The number has a sign (+ in this case)
The significand (1.23) is written with one non-zero digit to the left of the decimal point.
The base (radix) is 10.
The exponent (an integer value) is 4. It too must have a sign.
There are standards which define what the representation means, so that across computers there will be consistancy.
Note that this is not the only way to represent floating point numbers, it is just the IEEE standard way of doing it.
Here is what we do:
the representation has three fields:
     ----------------------------
     | S |   E     |     F      |
     ----------------------------

S is one bit representing the sign of the number
E is an 8-bit biased integer representing the exponent
F is an unsigned integer the decimal value represented is:
              S        e
   (-1)  x f x 2
where
     e = E - bias

     f = ( F/(2^n) ) + 1
for single precision representation (the emphasis in this class)
n = 23
bias = 127
for double precision representation (a 64-bit representation)
n = 52 (there are 52 bits for the mantissa field)
bias = 1023 (there are 11 bits for the exponent field)

Biased Integer Representation

Since floating point representations use biased integer representations for the exponent field, here is a brief discussion of biased integers.
An integer representation that skews the bit patterns so as to look just like unsigned but actually represent negative numbers.
It represents a range of values (different from unsigned representation) using the unsigned representation. Another way of saying this: biased representation is a re-mapping of the unsigned integers.
visual example (of the re-mapping):
        bit pattern:        000  001  010  011  100  101  110  111

        unsigned value:      0    1    2    3    4    5    6    7

        biased-2 value:      -2   -1   0    1    2    3    4    5
This is biased-2. Note the dash character in the name of this representation. It is not a negative sign. Example:
Given 4 bits, bias values by 2**3 = 8
(This choice of bias results in approximately half the represented values being negative.)
   TRUE VALUE to be represented      3
   add in the bias                  +8
      ----
   unsigned value                   11

   so the 4-bit, biased-8 representation of the value 3
   will be  1011
Example:
   Going the other way, suppose we were given a
   4-bit, biased-8 representation of   0110

   unsigned 0110  represents 6
   subtract out the bias   - 8
      ----
   TRUE VALUE represented   -2
On choosing a bias:
The bias chosen is most often based on the number of bits available for representing an integer. To get an approx. equal distribution of values above and below 0, the bias should be
      2 ^ (n-1)      or   (2^(n-1)) - 1

Now, what does all this mean?
  • S, E, F all represent fields within a representation. Each is just a bunch of bits.
  • S is just a sign bit. 0 for positive, 1 for negative. This is the sign of the number.
  • E is an exponent field. The E field is a biased-127 integer representation. So, the true exponent represented is (E - bias).
  • The radix for the number is ALWAYS 2.
    Note: Computers that did not use this representation, like those built before the standard, did not always use a radix of 2. For example, some IBM machines had radix of 16.
  • F is the mantissa (significand). It is in a somewhat modified form. There are 23 bits available for the mantissa. It turns out that if floating point numbers are always stored in their normalized form, then the leading bit (the one on the left, or MSB) is ALWAYS a 1. So, why store it at all? It gets put back into the number (giving 24 bits of precision for the mantissa) for any calculation, but we only have to store 23 bits.
    This MSB is called the hidden bit.
An example: Put the decimal number 64.2 into the IEEE standard single precision floating point representation.
 first step:
   get a binary representation for 64.2
   to do this, get unsigned binary representations for the stuff to the left
   and right of the decimal point separately.

   64  is   1000000

   .2 can be gotten using the algorithm:

   .2 x 2 =  0.4      0
   .4 x 2 =  0.8      0
   .8 x 2 =  1.6      1
   .6 x 2 =  1.2      1

   .2 x 2 =  0.4      0  now this whole pattern (0011) repeats.
   .4 x 2 =  0.8      0
   .8 x 2 =  1.6      1
   .6 x 2 =  1.2      1
     

     so a binary representation for .2  is    .001100110011. . .

                 ----
     or  .0011  (The bar over the top shows which bits repeat.)


 Putting the halves back together again:
    64.2  is     1000000.0011001100110011. . .


      second step:
 Normalize the binary representation. (make it look like
 scientific notation)

      6
 1.000000 00110011. . . x 2

      third step:
 6 is the true exponent.  For the standard form, it needs to
 be in 8-bit, biased-127 representation.

       6
   + 127
   -----
     133

 133 in 8-bit, unsigned representation is 1000 0101

 This is the bit pattern used for E in the standard form.

      fourth step:
 the mantissa stored (F) is the stuff to the right of the radix point
 in the normalized form.  We need 23 bits of it.

   000000 00110011001100110


      put it all together (and include the correct sign bit):

  S     E               F
  0  10000101  00000000110011001100110

      the values are often given in hex, so here it is

  0100 0010 1000 0000 0110 0110 0110 0110

     0x   4    2    8    0    6    6    6    6

Some extra details:
  • Since floating point numbers are always stored in a normalized form, how do we represent 0? We take the bit patterns 0x0000 0000 and 0x8000 0000 to represent the value 0.
    (What floating point numbers cannot be represented because of this?)
    Note that the hardware that does arithmetic on floating point numbers must be constantly checking to see if it needs to use a hidden bit of a 1 or a hidden bit of 0 (for 0.0).
    Values that are very close to 0.0, and would require the hidden bit to be a zero are called denormalized or subnormal numbers.
                    S         E           F
       0.0         0 or 1   00000000   00000000000000000000000
                   (hidden bit is a 0)
    
       subnormal   0 or 1   00000000   not all zeros
                   (hidden bit is a 0)
    
       normalized  0 or 1   > 0        any bit pattern
                   (hidden bit is a 1)
    
  • Other special values:
                                  S  E        F
           +infinity              0 11111111 00000... (0x7f80 0000)
           -infinity              1 11111111 00000... (0xff80 0000)
    
           NaN (Not a Number)     ? 11111111 ?????... 
       (S is either 0 or 1, E=0xff, and F is anything but all zeros)
    
  • Single precision representation is 32 bits.
    Double precision representation is 64 bits. For double precision:
    • S is the sign bit (same as for single precision).
    • E is an 11-bit, biased-1023 integer for the exponent.
    • F is a 52-bit mantissa, using same method as single precision (hidden bit is not expicit in the representation). 
     
  •   
  • arithmetic operations on floating point numbers consist of
     addition, subtraction, multiplication and division
    
    the operations are done with algorithms similar to those used
      on sign magnitude integers (because of the similarity of
      representation) -- example, only add numbers of the same
      sign.  If the numbers are of opposite sign, must do subtraction.
    
    

    Addition

     example on decimal value given in scientific notation:
    
           3.25 x 10 ** 3
         + 2.63 x 10 ** -1
         -----------------
    
         first step:  align decimal points
         second step:  add
    
          
           3.25     x 10 ** 3
         + 0.000263 x 10 ** 3
         --------------------
           3.250263 x 10 ** 3
     (presumes use of infinite precision, without regard for accuracy)
    
         third step:  normalize the result (already normalized!)
    
    
     example on fl pt. value given in binary:
    
             S    E       F
     .25 =   0 01111101 00000000000000000000000
    
     100 =   0 10000101 10010000000000000000000
    
    
        to add these fl. pt. representations,
        step 1:  align radix points
    
    
      shifting the mantissa LEFT by 1 bit DECREASES THE EXPONENT by 1
    
      shifting the mantissa RIGHT by 1 bit INCREASES THE EXPONENT by 1
    
      we want to shift the mantissa right, because the bits that
      fall off the end should come from the least significant end
      of the mantissa
    
           -> choose to shift the .25, since we want to increase it's exponent.
           -> shift by  10000101
         -01111101
         ---------
          00001000   (8) places.
    
           with hidden bit and radix point shown, for clarity:
                0 01111101 1.00000000000000000000000 (original value)
                0 01111110 0.10000000000000000000000 (shifted 1 place)
             (note that hidden bit is shifted into msb of mantissa)
                0 01111111 0.01000000000000000000000 (shifted 2 places)
                0 10000000 0.00100000000000000000000 (shifted 3 places)
                0 10000001 0.00010000000000000000000 (shifted 4 places)
                0 10000010 0.00001000000000000000000 (shifted 5 places)
                0 10000011 0.00000100000000000000000 (shifted 6 places)
                0 10000100 0.00000010000000000000000 (shifted 7 places)
                0 10000101 0.00000001000000000000000 (shifted 8 places)
    
    
        step 2: add (don't forget the hidden bit for the 100)
    
             0 10000101 1.10010000000000000000000  (100)
          +  0 10000101 0.00000001000000000000000  (.25)
          ---------------------------------------
      0 10000101 1.10010001000000000000000
    
    
    
        step 3:  normalize the result (get the "hidden bit" to be a 1)
    
          it already is for this example.
    
       result is
      0 10000101 10010001000000000000000
    
       suppose that the result of an addition of aligned mantissas
       gives
      10.11110000000000000000000
       and the exponent to go with this is 10000000.
    
    
       We must put the mantissa back in the normalized form.
    
       Shift the mantissa to the right by one place, and increase the
       exponent by 1.
    
       The exponent and mantissa become
          10000001   1.01111000000000000000000    0 (1 bit is lost off the least
                                                    significant end)
    
    
    

    Subtraction

         like addition as far as alignment of radix points
    
         then the algorithm for subtraction of sign mag. numbers takes over.
    
    
         before subtracting,
           compare magnitudes (don't forget the hidden bit!)
           change sign bit if order of operands is changed.
    
         don't forget to normalize number afterward.
    
         EXAMPLE:
    
      0 10000001 10010001000000000000000 (the representations)
           - 0 10000000 11100000000000000000000
          ---------------------------------------
    
    
        step 1:  align radix points
        
             0 10000000 11100000000000000000000
      becomes
             0 10000001 11110000000000000000000 (notice hidden bit shifted in)
    
    
      0 10000001 1.10010001000000000000000
           - 0 10000001 0.11110000000000000000000
          ---------------------------------------
    
        step 2:  subtract mantissa
    
       1.10010001000000000000000
           -  0.11110000000000000000000
          -------------------------------
              0.10100001000000000000000
    
        step 3:  put result in normalized form
    
            Shift mantissa left by 1 place, implying a subtraction of 1 from
     the exponent.
    
             0 10000000 01000010000000000000000
    
    
    

    Multiplication

     example on decimal values given in scientific notation:
    
           3.0 x 10 ** 1
         + 0.5 x 10 ** 2
         -----------------
    
         algorithm:  multiply mantissas (use unsigned multiplication)
       add exponents
    
           3.0 x 10 ** 1
         + 0.5 x 10 ** 2
         -----------------
          1.50 x 10 ** 3
    
    
     example in binary:    use a mantissa that is only 4 bits as an example
    
    
         0 10000100 0100
       x 1 00111100 1100
       -----------------
    
    
       The multiplication is unsigned multiplication (NOT two's complement),
       so, make sure that all bits are included in the answer.  Do NOT
       do any sign extension!
    
       mantissa multiplication:           1.0100
        (don't forget hidden bit)     x 1.1100
            ------
             00000
            00000
           10100
          10100
         10100
         ---------
        1000110000
                          becomes   10.00110000
    
    
        add exponents:       always add true exponents
        (otherwise the bias gets added in twice)
    
         biased:
         10000100
       + 00111100
       ----------
    
    
       10000100         01111111  (switch the order of the subtraction,
     - 01111111       - 00111100   so that we can get a negative value)
     ----------       ----------
       00000101         01000011
       true exp         true exp
         is 5.           is -67
    
    
         add true exponents      5 + (-67) is -62.
    
         re-bias exponent:     -62 + 127 is 65.
       unsigned representation for 65 is  01000001.
    
    
    
         put the result back together (and add sign bit).
    
    
         1 01000001  10.00110000
    
    
         normalize the result:
      (moving the radix point one place to the left increases
       the exponent by 1. Can also think of this as a logical
       right shift.)
    
         1 01000001  10.00110000
           becomes
         1 01000010  1.000110000
    
    
         this is the value stored (not the hidden bit!):
         1 01000010  000110000
    
    
    

    Division

       similar to multiplication.
    
       true division:
       do unsigned division on the mantissas (don't forget the hidden bit)
       subtract TRUE exponents
    
    
       The IEEE standard is very specific about how all this is done.
       Unfortunately, the hardware to do all this is pretty slow.
    
       Some comparisons of approximate times:
           2's complement integer add      1 time unit
           fl. pt add                      4 time units
           fl. pt multiply                 6 time units
           fl. pt. divide                 13 time units
    
       There is a faster way to do division.  Its called 
       division by reciprocal approximation.  It takes about the same
       time as a fl. pt. multiply.  Unfortunately, the results are
       not always correct.
    
       Division by reciprocal approximation:
    
    
          instead of doing     a / b
    
          they do   a x  1/b.
    
          figure out a reciprocal for b, and then use the fl. pt.
          multiplication hardware.
    
    
      example of a result that isn't the same as with true division.
    
           true division:     3/3 = 1  (exactly)
    
    
           reciprocal approx:   1/3 = .33333333
         
         3 x .33333333 =  .99999999, not 1
    
        It is not always possible to get a perfectly accurate reciprocal.
    
    
    Current fastest (and correct) division algorithm is called SRT
    
        Sweeney, Robertson, and Tocher
    
        Uses redundant quotient representation
        E.g., base 4 usually has digits {0,1,2,3}
        SRT's redundant base 4 has digits  {-2,-1,0,+1,+2}
    
        Allows division algorithm to guess digits approximately
        with a table lookup.
        
        Approximations are fixed up when less-sigificant digits
        are calculated
    
        Final result in completely-accurate binary.
    
        In 1994, Intel got a few corner cases of the table wrong
        Maximum error less than one part in 10,000
        Nevertheless, Intel took a $300M write-off to replace chip
    
        Compare with software bugs that give the wrong answer
        and the customer pays for the upgrade
    
    
    

    Issues in Floating Point

      note: this discussion only touches the surface of some issues that
      people deal with.  Entire courses are taught on floating point
      arithmetic.
    
    
    
    use of standards
    ----------------
    --> allows all machines following the standard to exchange data
        and to calculate the exact same results.
    
    --> IEEE fl. pt. standard sets
     parameters of data representation (# bits for mantissa vs. exponent)
    
    --> MIPS architecture follows the standard
        (All architectures follow the standard now.)
    
    
    overflow and underflow
    ----------------------
    Just as with integer arithmetic, floating point arithmetic operations
    can cause overflow.  Detection of overflow in fl. pt. comes by checking
    exponents before/during normalization.
    
    Once overflow has occurred, an infinity value can be represented and
    propagated through a calculation.
    
    
    
    Underflow occurs in fl. pt. representations when a number is
    to small (close to 0) to be represented.  (show number line!)
    
    if a fl. pt. value cannot be normalized
        (getting a 1 just to the left of the radix point would cause
         the exponent field to be all 0's)
        then underflow occurs.
    
    Underflow may result in the representation of denormalized values.
    These are ones in which the hidden bit is a 0.  The exponent field
    will also be all 0s in this case.  Note that there would be a
    reduced precision for representing the mantissa (less than 23 bits
    to the right of the radix point).
    
    Defining the results of certain operations
    ------------------------------------------
    Many operations result in values that cannot be represented.
    Other operations have undefined results.  Here is a list
    of some operations and their results as defined in the standard.
    
    1/0 = infinity
    any positive value / 0 = positive infinity
    any negative value / 0 = negative infinity
    infinity * x = infinity
    1/infinity = 0
    
    0/0 = NaN
    0 * infinity = NaN
    infinity * infinity = NaN
    infinity - infinity = NaN
    infinity/infinity = NaN
    
    x + NaN = NaN
    NaN + x = NaN
    x - NaN = NaN
    NaN - x = NaN
    sqrt(negative value) = NaN
    NaN * x = NaN
    NaN * 0 = NaN
    1/NaN = NaN
                        (Any operation on a NaN produces a NaN.)
    
    
    HW vs. SW computing
    -------------------
    floating point operations can be done by hardware (circuitry)
    or by software (program code).
    
    -> a programmer won't know which is occuring, without prior knowledge
       of the HW.
    
    -> SW is much slower than HW.  by approx. 1000 times.
    
    A difficult (but good) exercize for students would be to design
    a SW algorithm for doing fl. pt. addition using only integer
    operations.
    
    SW to do fl. pt. operations is tedious.  It takes lots of shifting
    and masking to get the data in the right form to use integer arithmetic
    operations to get a result -- and then more shifting and masking to put
    the number back into fl. pt. format.
    
    A common thing for manufacturers to do is to offer 2 versions of the
    same architecture, one with HW, and the other with SW fl. pt. ops.
    
    

    on Rounding

    arithmetic operations on fl. pt. values compute results that cannot
    be represented in the given amount of precision.  So, we round results.
    
    There are MANY ways of rounding.  They each have "correct" uses, and
    exist for different reasons.  The goal in a computation is to have the
    computer round such that the end result is as "correct" as possible.
    There are even arguments as to what is really correct.
    
    
    First, how do we get more digits (bits) than were in the representation?
      The IEEE standard requires the use of 3 extra bits of less significance
      than the 24 bits (of mantissa) implied in the single precision
      representation.
    
      mantissa format plus extra bits:
    
         1.XXXXXXXXXXXXXXXXXXXXXXX   0   0   0
    
         ^         ^                 ^   ^   ^
         |         |                 |   |   |
         |         |                 |   |   -  sticky bit (s)
         |         |                 |   -  round bit (r)
         |         |                 -  guard bit (g)
         |         -  23 bit mantissa from a representation
         -  hidden bit
    
      This is the format used internally (on many, not all processors) for
      a single precision floating point value.
    
      When a mantissa is to be shifted in order to align radix points,
      the bits that fall off the least significant end of the mantissa
      go into these extra bits (guard, round, and sticky bits).
      These bits can also be set by the normalization step in multiplication,
      and by extra bits of quotient (remainder?) in division.
    
      The guard and round bits are just 2 extra bits of precision that
      are used in calculations.  The sticky bit is an indication of
      what is/could be in lesser significant bits that are not kept.
      If a value of 1 ever is shifted into the sticky bit position,
      that sticky bit remains a 1 ("sticks" at 1), despite further shifts.
    
      Example:
        mantissa from representation,  11000000000000000000100
        must be shifted by 8 places (to align radix points)
    
                                                       g r s
        Before first shift:  1.11000000000000000000100 0 0 0
        After 1 shift:       0.11100000000000000000010 0 0 0
        After 2 shifts:      0.01110000000000000000001 0 0 0
        After 3 shifts:      0.00111000000000000000000 1 0 0
        After 4 shifts:      0.00011100000000000000000 0 1 0
        After 5 shifts:      0.00001110000000000000000 0 0 1
        After 6 shifts:      0.00000111000000000000000 0 0 1
        After 7 shifts:      0.00000011100000000000000 0 0 1
        After 8 shifts:      0.00000001110000000000000 0 0 1
    
    
    The IEEE standard for floating point arithmetic requires that the
    programmer be allowed to choose 1 of 4 methods for rounding:
    
    Method 1.    round toward 0 (also called truncation)
         figure out how many bits (digits) are available.  Take that many
         bits (digits) for the result and throw away the rest.
         This has the effect of making the value represented closer
         to 0.0
    
         example in decimal:
      .7783      if 3 decimal places available, .778
          if 2 decimal places available, .77
          if 1 decimal place available,  .7
    
         examples in binary, where only 2 bits are available to the right
         of the radix point:
           (underlined value is the representation chosen)
    
                    1.1101
        |
              1.11    |    10.00
             ------
    
                    1.001
        |
              1.00    |    1.01
             -----
    
    
                   -1.1101
        |
             -10.00   |    -1.11
                           ------
    
                   -1.001
        |
             -1.01    |   -1.00
                          -----
    
       With results from floating point calculations that generate
       guard, round, and sticky bits, just leave them off.  Note that
       this is VERY easy to implement!
    
       examples in the floating point format with guard, round and sticky bits:
                                    g r s
          1.11000000000000000000100 0 0 0
          1.11000000000000000000100  (mantissa used) 
    
          1.11000000000000000000110 1 1 0
          1.11000000000000000000110  (mantissa used)
    
          1.00000000000000000000111 0 1 1
          1.00000000000000000000111  (mantissa used)
    
    
    Method 2.  round toward positive infinity
        regardless of the value, round towards +infinity.
    
         example in decimal:
      1.23       if 2 decimal places, 1.3
     -2.86       if 2 decimal places, -2.8
    
         examples in binary, where only 2 bits are available to the right
         of the radix point:
                    1.1101
        |
              1.11    |    10.00
             ------
    
                    1.001
        |
              1.00    |    1.01
             -----
    
       examples in the floating point format with guard, round and sticky bits:
                                    g r s
          1.11000000000000000000100 0 0 0
          1.11000000000000000000100  (mantissa used, exact representation) 
    
          1.11000000000000000000100 1 0 0
          1.11000000000000000000101  (rounded "up")
    
         -1.11000000000000000000100 1 0 0
         -1.11000000000000000000100  (rounded "up")
    
          1.11000000000000000000001 0 1 0
          1.11000000000000000000010  (rounded "up")
    
          1.11000000000000000000001 0 0 1
          1.11000000000000000000010  (rounded "up")
    
    
    Method 3.  round toward negative infinity 
        regardless of the value, round towards -infinity.
    
         example in decimal:
      1.23       if 2 decimal places, 1.2
     -2.86       if 2 decimal places, -2.9
    
         examples in binary, where only 2 bits are available to the right
         of the radix point:
    
                    1.1101
        |
              1.11    |    10.00
             ------
    
                    1.001
        |
              1.00    |    1.01
             -----
    
       examples in the floating point format with guard, round and sticky bits:
                                    g r s
          1.11000000000000000000100 0 0 0
          1.11000000000000000000100  (mantissa used, exact representation) 
    
          1.11000000000000000000100 1 0 0
          1.11000000000000000000100  (rounded "down")
    
         -1.11000000000000000000100 1 0 0
         -1.11000000000000000000101  (rounded "down")
    
    
    Method 4.  round to nearest
         use representation NEAREST to the desired value.
         This works fine in all but 1 case:  where the desired value
         is exactly half way between the two possible representations.
    
         The half way case:
         1000...   to the right of the number of digits to be kept,
         then round toward nearest uses the representation that has
         zero as its least significant bit.
    
         Examples:
    
                    1.1111   (1/4 of the way between, one is nearest)
        |
              1.11    |    10.00
                           ------
    
    
                    1.1101   (1/4 of the way between, one is nearest)
        |
              1.11    |    10.00
             ------
    
    
                    1.001  (the case of exactly halfway between)
        |
              1.00    |    1.01
             -----
    
    
                   -1.1101  (1/4 of the way between, one is nearest)
        |
             -10.00   |    -1.11
                           ------
    
    
                   -1.001  (the case of exactly halfway between)
        |
             -1.01    |   -1.00
                          -----
    
     NOTE: this is a bit different than the "round to nearest" algorithm
     (for the "tie" case, .5) learned in elementary school for decimal numbers.
    
     
    
       examples in the floating point format with guard, round and sticky bits:
                                    g r s
          1.11000000000000000000100 0 0 0
          1.11000000000000000000100  (mantissa used, exact representation) 
    
          1.11000000000000000000000 1 1 0
          1.11000000000000000000001
    
          1.11000000000000000000000 0 1 0
          1.11000000000000000000000
    
          1.11000000000000000000000 1 1 1
          1.11000000000000000000001
    
          1.11000000000000000000000 0 0 1
          1.11000000000000000000000
    
          1.11000000000000000000000 1 0 0 (the "halfway" case)
          1.11000000000000000000000  (lsb is a zero)
    
          1.11000000000000000000001 1 0 0 (the "halfway" case)
          1.11000000000000000000010  (lsb is a zero)
    
    A complete example of addition, using rounding.
    
    
        S    E          F 
        1  10000000   11000000000000000011111
      + 1  10000010   11100000000000000001001
      -------------------------------------------
    
    
      First, align the radix points by shifting the top value's mantissa
      2 places to the right (increasing the exponent by 2)
    
        S    E          mantissa (+h.b)          g  r  s
        1  10000000   1.11000000000000000011111  0  0  0 (before shifting)
        1  10000001   0.11100000000000000001111  1  0  0 (after 1 shift)
        1  10000010   0.01110000000000000000111  1  1  0 (after 2 shifts)
    
      Add mantissas
        1.11100000000000000001001  0  0  0
      + 0.01110000000000000000111  1  1  0
      --------------------------------------
       10.01010000000000000010000  1  1  0
    
    
       This must now be put back in the normalized form,
    
         E          mantissa                  g  r  s
       10000010   10.01010000000000000010000  1  1  0
    
    
       (shift mantissa right by 1 place, causing the exponent to increase by 1)
       10000011   1.00101000000000000001000   0  1  1
    
       S    E          mantissa                  g  r  s
       1  10000011   1.00101000000000000001000   0  1  1
    
    
       Now, we round.
    
       If round toward zero,
       1  10000011   1.00101000000000000001000
         giving a representation of
          1 10000011 00101000000000000001000
         in hexadecimal:
          1100 0001 1001 0100 0000 0000 0000 1000
          0x c 1 9 4 0 0 0 8
    
       If round toward +infinity,
       1  10000011   1.00101000000000000001000
         giving a representation of
          1 10000011 00101000000000000001000
         in hexadecimal:
          1100 0001 1001 0100 0000 0000 0000 1000
          0x c 1 9 4 0 0 0 8
    
       If round toward -infinity,
       1  10000011   1.00101000000000000001001
         giving a representation of
          1 10000011 00101000000000000001001
         in hexadecimal:
          1100 0001 1001 0100 0000 0000 0000 1001
          0x c 1 9 4 0 0 0 9
    
       If round to nearest,
       1  10000011   1.00101000000000000001000
         giving a representation of
          1 10000011 00101000000000000001000
         in hexadecimal:
          1100 0001 1001 0100 0000 0000 0000 1000
          0x c 1 9 4 0 0 0 8
    
    A diagram to help with rounding when doing floating point multiplication.
    
    
                         [             ]
                       X [             ]
                     --------------------
         [             ] [ ][ ][           ]
          result in the   g  r  combined to
           given                produce
           precision            a sticky bit
    
     

No comments:

Post a Comment