### 3.1. Arithmetic and Logic Operations

**Reading Assignments and Exercises**

The ALU is the core of the computer - it performs
arithmetic and logic operations on data that not only realize the
goals of various applications (e.g., scientific and engineering
programs), but also manipulate addresses (e.g., pointer arithmetic).
In this section, we will overview algorithms used for the basic
arithmetic and logical operations. A key assumption is that twos
complement representation will be employed, unless otherwise noted.

#### 3.1.1. Boolean Addition

When adding two numbers, if the sum of the digits
in a given position equals or exceeds the modulus, then a

*carry*is propagated. For example, in Boolean addition, if two ones are added, the sum is obviously two (base 10), which exceeds the modulus of 2 for Boolean numbers (**B**=**Z**_{2}= {0,1}, the integers modulo 2). Thus, we record a zero for the sum and propagate a carry valued at one into the next more significant digit, as shown in Figure 3.1.**Figure 3.1.**Example of Boolean addition with carry propagation, adapted from [Maf01].

#### 3.1.2. Boolean Subtraction

When subtracting two numbers, two alternatives
present themselves. First, one can formulate a subtraction algorithm,
which is distinct from addition. Second, one can negate the
subtrahend (i.e., in

*a - b*, the subtrahend is*b*) then perform addition. Since we already know how to perform addition as well as twos complement negation, the second alternative is more practical. Figure 3.2 illustrates both processes, using the decimal subtraction 12 - 5 = 7 as an example.**Figure 3.2.**Example of Boolean subtraction using (a) unsigned binary representation, and (b) addition with twos complement negation - adapted from [Maf01].

Just as we have a carry in addition, the subtraction
of Boolean numbers uses a

*borrow*. For example, in Figure 3.2a, in the first (least significant) digit position, the difference 0 - 1 in the one's place is realized by borrowing a one from the two's place (next more significant digit). The borrow is propagated upward (toward the most significant digit) until it is zeroed (i.e., until we encounter a difference of 1 - 0).#### 3.1.3. Overflow

Overflow occurs when there are insufficient bits in a
binary number representation to portray the result of an arithmetic
operation. Overflow occurs because computer arithmetic is not closed
with respect to addition, subtraction, multiplication, or division.
Overflow

*cannot*occur in addition (subtraction), if the operands have different (resp. identical) signs.
To detect and compensate for overflow, one needs n+1
bits if an n-bit number representation is employed. For example, in
32-bit arithmetic, 33 bits are required to detect or compensate for
overflow. This can be implemented in addition (subtraction) by
letting a carry (borrow) occur into (from) the sign bit. To make a
pictorial example of convenient size, Figure 3.3 illustrates the
four possible sign combinations of differencing 7 and 6 using a
number representation that is four bits long (i.e., can represent
integers in the interval [-8,7]).

**Figure 3.3.**Example of overflow in Boolean arithmetic, adapted from [Maf01].

#### 3.1.4. MIPS Overflow Handling

MIPS raises an

*exception*when overflow occurs. Exceptions (or interrupts) act like procedure calls. The register`$epc`

stores the address of the instruction that
*caused*the interrupt, and the instruction`mfc `**register**, $epc

moves the contents of

`$epc`

to
*register*. For example,*register*could be`$t1`

. This is an efficient approach, since no conditional
branch is needed to test for overflow.
Two's complement arithmetic operations
(

`add`

, `addi`

, and `sub`

instructions) raise exceptions on overflow. In contrast,
unsigned arithmetic (`addu`

and `addiu`

)
instructions do not raise an exception on overflow, since they
are used for arithmetic operations on addresses (recall our
discussion of pointer arithmetic in Section 2.6). In terms
of high-level languages, C ignores overflows (always uses
`addu`

, `addiu`

, and `subu`

),
while FORTRAN uses the appropriate instruction to detect overflow.
Figure 3.4 illustrates the use of conditional branch on overflow
for signed and unsigned addition operations.
**Figure 3.4.**Example of overflow in Boolean arithmetic, adapted from [Maf01].

#### 3.1.5. Logical Operations

Logical operations apply to fields of bits within a 32-bit word, such as bytes or bit fields (in C, as discussed in the next paragraph). These operations include shift-left and shift-right operations (`sll`

and `srl`

), as well as bitwise
*and*,

*or*(

`and`

, `andi`

,
`or`

, `ori`

). As we saw in Section 2, bitwise
operations treat an operand as a vector of bits and operate on each
bit position.
C bit fields are used, for example, in programming
communications hardware, where manipulation of a bit stream is
required. In Figure 3.5 is presented C code for an example
communications routine, where a structure called

`receiver`

is formed from an 8-bit field called *receivedByte*and two one-bit fields called*ready*and*enable*. The C routine sets`receiver.ready`

to 0 and `receiver.enable`

to 1.
**Figure 3.5.**Example of C bit field use in MIPS, adapted from [Maf01].

Note how the MIPS code implements the functionality
of the C code, where the state of the registers $s0 and $s1 is
illustrated in the five lines of diagrammed register contents below
the code. In particular, the initial register state is shown in the
first two lines. The

`sll`

instruction loads the contents
of `$s1`

(the receiver) into `$s0`

(the data
register), and the result of this is shown on the second line of the
register contents. Next, the `srl`

instruction left-shifts
`$s0`

24 bits, thereby discarding the *enable*and*ready*field information, leaving just the received byte. To signal the receiver that the data transfer is completed, the`andi`

and `ori`

instructions are used to set
the enable and ready bits in `$s1`

, which corresponds to
the *receiver*. The data in`$s0`

has already been
received and put in a register, so there is no need for its further
manipulation.
### 3.2. Arithmetic Logic Units and the MIPS ALU

**Reading Assignments and Exercises**

In this section, we discuss hardware building blocks,
ALU design and implementation, as well as the design of a 1-bit ALU
and a 32-bit ALU. We then overview the implementation of the MIPS
ALU.

#### 3.2.1. Basic Concepts of ALU Design

ALUs are implemented using lower-level components
such as logic gates, including

*and*,*or*,*not*gates and multiplexers. These building blocks work with individual bits, but the actual ALU works with 32-bit registers to perform a variety of tasks such as arithmetic and shift operations.
In principle, an ALU is built from 32 separate 1-bit
ALUs. Typically, one constructs separate hardware blocks for each
task (e.g., arithmetic and logical operations), where each operation
is applied to the 32-bit registers in parallel, and the selection of
an operation is controlled by a multiplexer. The advantage of this
approach is that it is easy to add new operations to the instruction
set, simply by associating an operation with a multiplexer control
code. This can be done provided that the mux has sufficient capacity.
Otherwise, new data lines must be added to the mux(es), and the CPU
must be modified to accomodate these changes.

#### 3.2.2. 1-bit ALU Design

As a result, the ALU consists of 32 muxes (one for
each output bit) arranged in parallel to send output bits from each
operation to the ALU output.

**3.2.2.1. And/Or Operations.**As shown in Figure 3.6, a simple (1-bit) ALU operates

*in parallel*, producing all possible results that are then selected by the multiplexer (represented by an oval shape at the output of the

*and*/

*or*gates. The output C is thus selected by the multiplexer. (

*Note*: If the multiplexer were to be applied at the input(s) rather than the output, twice the amount of hardware would be required, because there are two inputs versus one output.)

**Figure 3.6.**Example of a simple 1-bit ALU, where the oval represents a multiplexer with a control code denoted by

*Op*and an output denoted by C - adapted from [Maf01].

**3.2.2.2. Full Adder.**Now let us consider the one-bit adder. Recalling the carry situation shown in Figure 3.1, we show in Figure 3.7 that there are two types of carries -

*carry in*(occurs at the input) and

*carry out*(at the output).

**Figure 3.7.**Carry-in and carry-out in Boolean addition, adapted from [Maf01].

Here, each bit of addition has three input bits
(A

_{i}, B_{i}, and CarryIn_{i}), as well as two output bits (Sum_{i}, CarryOut_{i}), where CarryIn_{i+1}= CarryOut_{i}. (Note: The "i" subscript denotes the i-th bit.) This relationship can be seen when considering the full adder's truth table, shown below:
Given the four one-valued results in the truth table,
we can use the sum-of-products method to construct a one-bit adder
circuit from four three-input

*and*gates and one four-input*or*gate, as shown in Figure 3.8a. The CarryOut calculation can be similarly implemented with three two-input*and*gates and one three-input*or*gate, as shown in Figure 3.8b. These two circuits can be combined to effect a one-bit full adder with carry, as shown in Figure 3.8c.(a) (b)

(c)

**Figure 3.7.**Full adder circuit (a) sum-of-products form from above-listed truth table, (b) CarryOut production, and (c) one-bit full adder with carry - adapted from [Maf01].

Recalling the symbol for the one-bit adder, we can add an
addition operation to the one-bit ALU shown in Figure 3.6. This is done
by putting two control lines on the output mux, and by having an
additional control line that inverts the

*b*input (shown as "Binvert") in Figure 3.9).(a) (b)

**Figure 3.9.**One-bit ALU with three operations:

*and*,

*or*, and addition: (a) Least significant bit, (b) Remaining bits - adapted from [Maf01].

#### 3.2.3. 32-bit ALU Design

The final implementation of the preceding technique
is in a 32-bit ALU that incorporates the

*and*,*or*, and addition operations. The 32-bit ALU can be simply constructed from the one-bit ALU by chaining the carry bits, such that CarryIn_{i+1}= CarryOut_{i}, as shown in Figure 3.10.**Figure 3.10.**32-bit ALU with three operations:

*and*,

*or*, and addition - adapted from [Maf01].

This yields a composite ALU with two 32-bit input
vectors

**a**and**b**, whose i-th bit is denoted by**a**_{i}and**b**_{i}, where i = 0..31. The result is also a 32-bit vector, and there are two control buses - one for Binvert, and one for selecting the operation (using the mux shown in Figure 3.9). There is one CarryOut bit (at the bottom of Figure 3.10), and no CarryIn.
We next examine the MIPS ALU and how it supports
operations such as shifting and branching.

#### 3.2.4. MIPS ALU Design

We begin by assuming that we have the generic one-bit
ALU designed in Sections 3.2.1-3.2.3, and shown below:

Here, the

*Bnegate*input is the same as the*Binvert*input in Figure 3.9, and we assume that we have three control inputs to the mux whose control line configuration is associated with an operation, as follows:**3.2.4.1. Support for the**The

`slt`

Instruction.`slt`

instruction (set on less-than)
has the following format:
**slt rd, rs, rt**

where rd = 1 if rs < rt, and rd = 0 otherwise.

Observe that the inputs rs and rt can represent high-level
language input variables A and B. Thus, we have the following implication:

A < B => A - B < 0 ,

which is implemented as follows:

**Step 1.**Perform subtraction using negation and a full adder

**Step 2.**Check most significant bit (sign bit)

**Step 3.**Sign bit tells us whether or not A < B

To implement

`slt`

, we need (a) new input
line called *Less*that goes directly to the mux, and (b) a new control code (111) to select the`slt`

operation.
Unfortunately, the result for `slt`

cannot be taken
directly as the output from the adder. Instead, we need a new output
line called *Set*that is used only for the`slt`

instruction. Overflow detection logic is also associated with this
bit. The additional logic that supports `slt`

is shown in
Figure 3.11.
**Figure 3.11.**One-bit ALU with additional logic for

*slt*operation - adapted from [Maf01].

Thus, for a 32-bit ALU, the additional cost of the

`slt`

instruction is (a) augmentation of each of 32 muxes
to have three control lines instead of two, (b) augmentation of each
of 32 one-bit ALU's control signal structure to have an additional
(*Less*) input, and (c) the addition of overflow detection circuitry, a*Set*output, and an*xor*gate on the output of the sign bit.**3.2.4.2. Support for the**Recall the branch-on-not-equal instruction

`bne`

Instruction.```
bne
r1, r2, Label
```

, where r1 and r2 denote registers and Label is a
branch target label or address. To implement `bne`

, we
observe that the following implication holds:
A - B = 0 => A = B .

then add hardware to test if the comparison between A
and B implemented as (A - B) is zero. Again, this can be done using
negation and the full adder that we have already designed as part of
the ALU. The additional step is to

*or*all 32 results from each of the one-bit ALUs, then invert the output of the*or*operation. Thus, if all 32 bits from the one-bit full adders are zero, then the output of the*or*gate will be zero (inverted, it will be one). Otherwise, the output of the*or*gate wil be one (inverted, it will be zero). We also need to consider A - B, to see if there is overflow when A = 0. A block diagram of the hardware modification is shown in Figure 3.12.**Figure 3.12.**32-bit ALU with additional logic to support

*bne*and

*slt*instructions - adapted from [Maf01].

Here, the additional hardware involves 32 separate
output lines from the 342 one-bit adders, as well as a cascade of

*or*gates to implement a 32-input*nor*gate (which doesn't exist in practice, due to excessive fan-in requirement).**3.2.4.3. Support for Shift Instructions.**Considering the

`sll`

, `srl`

, and `sra`

instructions, these are supported in the ALU under design by adding a
data line for the shifter (both left and right). However, the shifters
are much more easily implemented at the transistor level (e.g., outside
the ALU) rather than trying to fit more circuitry onto the ALU itself.
In order to implement a shifter external to the ALU,
we consider the design of a

*barrel shifter*, shown schematically in Figure 3.13. Here, the closed siwtch pattern, denoted by black filled circles, is controlled by the CPU through control lines to a mux or decoder. This allows data line x_{i}to be sent to output x_{j}, where i and j can be unequal.**Figure 3.13.**Four bit barrel shifter, where "x >> 1" denotes a shift amount greater than one - adapted from [Maf01].

This type of N-bit shifter is well understood and
easy to construct, but has space complexity of

**O**(N^{2}).**3.2.4.4. Support for Immediate Instructions.**In the MIPS immediate instruction formats, the first input to the ALU is the first register (we'll call it

*rs*) in the immediate command, while the second input is either data from a register

*rt*or a zero or sign-extended constant (immediate). To support this type of instruction, we need to add a mux at the second input of the ALU, as shown in Figure 3.14. This allows us to select whether

*rt*or the sign-extended immediate is input to the ALU.

**Figure 3.14.**Supporting immediate instructions on a MIPS ALU design, where IR denotes the instruction register, and (/

^{16}) denotes a 16-bit parallel bus - adapted from [Maf01].

#### 3.2.5. ALU Performance Issues

When estimating or measuring ALU performance, one
wonders if a 32-bit ALU is as fast as a 1-bit ALU - what is the degree
of parallelism, and do all operations execute in parallel? In
practice, some operations on N-bit operands (e.g., addition with
sequential propagation of carries) take

**O**(N) time. Other operations, such as bitwise logical operations, take**O**(1) time. Since addition can be implemented in a variety of ways, each with a certain level of parallelism, it is wise to consider the possibility of a full adder being a computational bottleneck in a simple ALU.
We previously discussed the ripple-carry adder
(Figure 3.10) that propagates the carry bit from stage i to stage i+1.
It is readily seen that, for an N-bit input,

**O**(N) time is required to propagate the carry to the most significant bit. In contrast, the fastest N-bit adder uses**O**(log_{2}N) stages in a tree-structured configuration with N-1 one-bit adders. Thus, the complexity of this technique is**O**(log_{2}N) work. In a sequential model of computation, this translates to**O**(log_{2}N) time. If one is adding smaller numbers (e.g., up to 10-bit integers with current memory technology), then a*lookup table*can be used that (1) forms a memory address A by concatenating binary representations of the two operands, and (2) produces a result stored in memory that is accessed using A. This takes**O**(1) time, that is dependent upon memory bandwidth.
An intermediate approach between these extremes is to
use a

*carry-lookahead adder*(CLA). Suppose we do not know the value of the carry-in bit (which is usually the case). We can express the generation (g) of a carry bit for the i-th position of two operands*a*and*b*, as follows:
g

_{i}= a_{i}b_{i},
where the i-th bits of

*a*and*b*are*and*-ed. Similarly, the propagated carry is expressed as:
p

_{i}= a_{i}+ b_{i},
where the i-th bits of

*a*and*b*are*or*-ed. This allows us to recursively express the carry bits in terms of the carry-in c_{0}, as follows:
Did we get rid of the ripple? (Well, sort of...) What
we did was transform the work involved in carry propagation from the
adder circuitry to a large equation for c

_{N}. However, this equation must still be computed in hardware. (*Lesson*: In computing, you don't get much for free.)
Unfortunately, it is prohibitively costly to build a
CLA circuit for operands as large as 16 bits. Instead, we can use the
CLA principle to create a two-tiered circuit, for example, at the
bottom level an array of four 4-bit full adders (economical to construct),
connected at the top level by a CLA, as shown below:

Using a two-level CLA architecture, where lower-
(upper-)case g and p denote the first (second) level generates and
carries, we have the following equations:

P

P

P

P

G

G

G

G

_{0}= p_{3}+ p_{2}+ p_{1}+ p_{0}P

_{1}= p_{7}+ p_{6}+ p_{5}+ p_{4}P

_{2}= p_{11}+ p_{10}+ p_{9}+ p_{8}P

_{3}= p_{15}+ p_{14}+ p_{13}+ p_{12}G

_{0}= g_{3}+ p_{3}g_{2}+ p_{3}p_{2}g_{1}+ p_{3}p_{2}p_{1}g_{0}G

_{1}= g_{7}+ p_{7}g_{6}+ p_{7}p_{6}g_{5}+ p_{7}p_{6}p_{5}g_{4}G

_{2}= g_{11}+ p_{11}g_{10}+ p_{11}p_{10}g_{9}+ p_{11}p_{10}p_{9}g_{8}G

_{3}= g_{15}+ p_{15}g_{14}+ p_{15}p_{14}g_{13}+ p_{15}p_{14}p_{13}g_{12}
Assuming that

*and*as well as*or*gates have the same propagation delay, comparative analysis of the ripple carry vs. carry lookahead adders reveals that the total time to compute a CLA result is the summation of all gate delays along the longest path through the CLA. In the case of the 16-bit adder exemplified above, the CarryOut signals c_{16}and C_{4}define the longest path. For the ripple carry adder, this path has length 2(16) = 32.
For the two-level CLA, we get two levels of logic in
terms of the architecture (P and G versus p and g). P

_{i}is specified in one level of logic using p_{i}. G_{i}is specified in one level of logic using p_{i}and g_{i}. Also, p_{i}and g_{i}each represent one level of logic computed in terms of inputs a_{i}and b_{i}. Thus, the CLA critical path length is 2 + 2 + 1 = 5, which means that two-level 16-bit CLA is 6.4 = 32/5 times faster than a 16-bit ripple carry adder.
It is also useful to note that the logic equation
for a one-bit adder can be expressed more simply with

*xor*logic, for example:
A + B = A

*xor*B*xor*CarryIn .
In some technologies,

*xor*is more efficient than*and*/*or*gates. Also, processors are now designed in CMOS technology, which allows fewer muxes (this also applies to the barrel shifter). However, the design principles are similar.#### 3.2.6. Summary

We have shown that it is feasible to build an ALU to
support the MIPS ISA. The key idea is to use a multiplexer to select
the output from a collection of functional units operating in
parallel. We can replicate a 1-bit ALU that uses this principle, with
appropriate connections between replicates, to produce an N-bit ALU.

Important things to remember about ALUs are: (a) all
of the gates are working in parallel, (b) the speed of a gate is
affected by the number of inputs (degree of

*fan-in*), and (c) the speed of a circuit depends on the number of gates in the longest computational path through the circuit (this can vary per operation). Finally, we have shown that changes in architectural organization can improve performance, similar to better algorithms in software.### 3.3. Boolean Multiplication and Division

**Reading Assignments and Exercises**

Multiplication is more complicated than addition,
being implemented by shifting as well as addition. Because of the
partial products involved in most multiplication algorithms, more time
and more circuit area is required to compute, allocate, and sum the
partial products to obtain the multiplication result.

#### 3.3.1. Multiplier Design

We herein discuss three versions of the multiplier
design based on the

*pencil-and-paper algorithm for multiplication*that we all learned in grade school, which operates on Boolean numbers, as follows:Multiplicand: 0010 # Stored in register r1 Multiplier: x 1101 # Stored in register r2 -------------------- Partial Prod 0010 # No shift for LSB of Multiplier " " 0000 # 1-bit shift of zeroes(can omit)" " 0010 # 2-bit shift for bit 2 of Multiplier " " 0010 # 3-bit shift for bit 3 of Multiplier -------------------- # Zero-fill the partial products and add PRODUCT 0011010 # Sum of all partial products -> r3

A flowchart of this algorithm, adapted for multiplication
of 32-bit numbers, is shown in Figure 3.15, below, together with a
schematic representation of a simple ALU circuit that implements this
version of the algorithm. Here, the multiplier and the multiplicand
are shifted relative to each other, which is more efficient than
shifting the partial products alone.

(a)

(b)

**Figure 3.15.**Pencil-and-paper multiplication of 32-bit Boolean number representations: (a) algorithm, and (b) simple ALU circuitry - adapted from [Maf01].

The second version of this algorithm is shown in
Figure 3.16. Here, the product is shifted with respect to the
multiplier, and the multiplicand is shifted after the product register
has been shifted. A 64-bit register is used to store both the
multiplicand and the product.

(a)

(b)

**Figure 3.16.**Second version of pencil-and-paper multiplication of 32-bit Boolean number representations: (a) algorithm, and (b) schematic diagram of ALU circuitry - adapted from [Maf01].

The final version puts results in the product
register if and only if the least significant bit of the product
produced on the previous iteration is one-valued. The product
register only is shifted. This reduces by approximately 50 percent
the amount of shifting that has to be done, which reduces time and
hardware requirements. The algorithm and ALU schematic diagram is
shown in Figure 3.17.

(a)

(b)

**Figure 3.17.**Third version of pencil-and-paper multiplication of 32-bit Boolean number representations: (a) algorithm, and (b) schematic diagram of ALU circuitry - adapted from [Maf01].

Thus, we have the following shift-and-add scheme for
multiplication:

The preceding algorithms and circuitry does not hold
for signed multiplication, since the bits of the multiplier no longer
correspond to shifts of the multiplicand. The following example is
illustrative:

A solution to this problem is Booth's Algorithm,
whose flowchart and corresponding schematic hardware diagram are
shown in Figure 3.18. Here, the examination of the multiplier
is performed with

*lookahead*toward the next bit. Depending on the bit configuration, the multiplicand is positively or negatively signed, and the multiplier is shifted or unshifted.(a)

(b)

**Figure 3.18.**Booth's procedure for multiplication of 32-bit Boolean number representations: (a) algorithm, and (b) schematic diagram of ALU circuitry - adapted from [Maf01].

Observe that Booth's algorithm requires only the
addition of a subtraction step and the comparison operations for
the two-bit codes, versus the one-bit comparison in the preceding
three algorithms. An example of Booth's algorithm follows:

Here N = 4 iterations of the loop are required
to produce a product from two N = 4 digit operands. Four shifts and
two subtractions are required. From the analysis of the algorithm
shown in Figure 3.18a, it is easily seen that the maximum work for
multiplying two N-bit numbers is given by

**O**(N) shift and addition operations. From this, the worst-case computation time can be computed given CPI for the shift and addition instructions, as well as cycle time of the ALU.#### 3.3.2. Design of Arithmetic Division Hardware

Division is a similar operation to multiplication,
especially when implemented using a procedure similar to the algorithm
shown in Figure 3.18a. For example, consider the pencil-and-paper
method for dividing the byte 10010011 by the nybble 1011:

Dividend = Quotient · Divisor + Remainder .

**3.3.2.1. Unsigned Division.**The

*unsigned*division algorithm that is similar to Booth's algorithm is shown in Figure 3.19a, with an example shown in Figure 3.19b. The ALU schematic diagram in given in Figure 3.19c. The analysis of the algorithm and circuit is very similar to the preceding discussion of Booth's algorithm.

(a)

(b)

(c)

**Figure 3.19.**Division of 32-bit Boolean number representations: (a) algorithm, (b) example using division of the unsigned integer 7 by the unsigned integer 3, and (c) schematic diagram of ALU circuitry - adapted from [Maf01].

**3.3.2.2. Signed Divisiion.**With signed division, we negate the quotient if the signs of the divisor and dividend disagree. The remainder and the divident must have the same signs. The governing equation is as follows:

Remainder = Divident - (Quotient · Divisor) ,

and the following four cases apply:

We present the preceding division algorithm, revised for
signed numbers, as shown in Figure 3.20a. Four examples, corresponding
to each of the four preceding sign permutations, are given in Figure 3.20b
and 3.20c.

(a)

(b)

(c)

**Figure 3.20.**Division of 32-bit Boolean number representations: (a) algorithm, and (b,c) examples using division of +7 or -7 by the integer +3 or -3; adapted from [Maf01].

Self-Exercise.Be able to trace each example shown in Figure 3.20b,c through the algorithm whose flowchart is given in Figure 3.20a. Know how each part of the algorithm works, and why it behaves that way.Hint: This exercise, or a part of it, is likely to be an exam question.

**3.3.2.3. Divisiion in MIPS.**MIPS supports multiplication and division using existing hardware, primarily the ALU and shifter. MIPS needs one extra hardware component - a 64-bit register able to support

`sll`

and `sra`

instructions. The upper (high) 32 bits of the register contains the
remainder resulting from division. This is moved into a register in
the MIPS register stack (e.g., `$t0`

) by the
`mfhi`

command. The lower 32 bits of the 64-bit register
contains the quotient resulting from division. This is moved into a
register in the MIPS register stack by the `mflo`

command.
In MIPS assembly language code, signed division is
supported by the

`div`

instruction and unsigned division,
by the `divu`

instruction. MIPS hardware does not check for
division by zero. *Thus, divide-by-zero exception must be detected and handled in system software*. A similar comment holds for overflow or underflow resulting from division.
Figure 3.21 illustrates the MIPS ALU that supports
integer arithmetic operations (+,-,x,/).

**Figure 3.21.**MIPS ALU supporting the integer arithmetic operations (+,-,x,/), adapted from [Maf01].

Self-Exercise.Show how the MIPS ALU in Figure 3.21 supports the integer arithmetic operations (+,-,x,/) using the algorithms and hardware diagrams given thus far.Hint: This exercise, or a part of it, is likely to be an exam question.

### 3.4. Floating Point Arithmetic

**Reading Assignments and Exercises**

Floating point (FP) representations of decimal numbers are
essential to scientific computation using

*scientific notation*. The standard for floating point representation is the IEEE 754 Standard. In a computer, there is a tradeoff between range and precision - given a fixed number of binary digits (bits), precision can vary inversely with range. In this section, we overview decimal to FP conversion, MIPS FP instructions, and how registers are used for FP computations.
We have seen that an n-bit register can represent
unsigned integers in the range 0 to 2

^{n}-1, as well as signed integers in the range -2^{n-1}to -2^{n-1}-1. However, there are very large numbers (e.g., 3.15576 · 10^{23}), very small numbers (e.g., 10^{-25}), rational numbers with repeated digits (e.g., 2/3 = 0.666666...), irrationals such as 2^{1/2}, and transcendental numbers such as e = 2.718..., all of which need to be represented in computers for scientific computation to be supported.
We call the manipulation of these types of numbers

*floating point arithmetic*because the decimal point is not fixed (as for integers). In C, such variables are declared as the`float`

datatype.
#### 3.4.1. Scientific Notation and FP Representation

Scientific notation has the following configuration:

and can be in

*normalized form*(mantissa has exactly one digit to the left of the decimal point, e.g., 2.3425 · 10^{-19}) or*non-normalized form*. Binary scientiic notation has the folowing configuration, which corresponds to the decimal forms:
Assume that we have the following

*normal format*for scientific notation in Boolean numbers:
+1.xxxxxxx

_{2}· w^{yyyyy2},
where "xxxxxxx" denotes the

*significand*and "yyyyy" denotes the*exponent*and we assume that the number has sign S. This implies the following 32-bit representation for FP numbers:
which can represent decimal numbers ranging from -2.0
· 10

^{-38}to 2.0 · 10^{38}.#### 3.4.2 Overflow and Underflow

In FP, overflow and underflow are slightly different
than in integer numbers. FP overflow (underflow) refers to the
positive (negative) exponent being too large for the number of bits
alloted to it. This problem can be somewhat ameliorated by the use of

*double precision*, whose format is shown as follows:
Here, two 32-bit words are combined to support an
11-bit signed exponent and a 52-bit significand. This representation
is declared in C using the

`double`

datatype, and can support
numbers with exponents ranging from -308_{10}to 308_{10}. The primary advantage is greater precision in the mantissa.
The following chart illustrates specific types of
overflow and underflow encountered in standard FP representation:

#### 3.4.3. IEEE 754 Standard

Both single- and double-precision FP representations
are supported by the IEEE 754 Standard, which is used in the vast
majority of computers since its publication in 1980. IEEE 754
facilitates the porting of FP programs, and ensures minimum standards
of quality for FP computer arithmetic. The result is a signed
representation - the sign bit is 1 if the FP number represented by
IEEE754 is negative. Otherwise, the sign is zero. A leading value of
1 in the significand is implicit for normalized numbers. Thus, the
significand, which always has a value between zero and one, occupies
23 + 1 bits in single-precision FP and 52 + 1 bits in double
precision. Zero is represented by a zero significand and a zero
exponent - there is no leading value of one in the significand. The
IEEE 754 representation is thus computed as:

FPnumber = (-1)

^{S}· (1 + Significand) · 2^{Exponent}.
As a parenthetical note, the significand can be translated
into decimal values via the following expansion:

With IEEE 754, it is possible to manipulate FP
numbers without having special-purpose FP hardware. For example,
consider the sorting of FP numbers. IEEE 754 facilitates breaking FP
numbers up into three parts (sign, significant, exponent). The
numbers to be sorted are ordered first according to sign (negative <
positive), second according to exponent (larger exponent => larger number),
and third according to significand (when one has at least two numbers with
the same exponents).

Another issue of interest in IEEE 754 is

*biased notation*for exponents. Observe that twos complement notation does not work for exponents: the largest negative (positive) exponent is 00000001_{2}(11111111_{2}). Thus, we must add a*bias term*to the exponent to center the range of exponents on the bias number, which is then equated to zero. The bias term is 127 (1023) for the IEEE 754 single-precision (double-precision) representation. This implies that
FPnumber = (-1)

^{S}· (1 + Significand) · 2^{(Exponent - Bias)}.
As a result, we have the following example of binary
to decimal floating point conversion:

Decimal-to-binary FP conversion is somewhat more
difficult. Three cases pertain: (1) the decimal number can be
expressed as a fraction n/d where d is a power of two; (2) the decimal
number has repeated digits (e.g., 0.33333); or (3) the decimal number
does not fit either Case 1 or Case 2. In Case 1, one selects the
exponent as -log

_{2}(d), and converts n to binary notation. Case 3 is more difficult, and will not be discussed here. Case 2 is exemplified in the following diagram:
Here, the significand is 101 0101 0101 0101 0101 0101,
the sign is negative (representation = 1), and the exponent is computed
as 1 + 127 = 128

_{10}= 1000 0000_{2}. This yields the following representation in IEEE 754 standard notation:
The following table summarizes special values that
can be represented using the IEEE 754 standard.

**Table 3.1.**Special values in the IEEE 754 standard.

Of particular interest in the preceding table is the

*NaN*(not a number) representation. For example, when taking the square root of a negative number, or when dividing by zero, we encounter operations that are undefined in the arithmetic operations over real numbers. These results are called NaNs and are represented with an exponent of 255 and a zero significand. NaNs can help with debugging, but they contaminate calculations (e.g., NaN +*x*= NaN). The recommended approach to NaNs, especially for software designers or engineers early in their respective careers, is not to use NaNs.
Another variant of FP representation is denormalized
numbers, also called

*denorms*. These number representations were developed to remedy the problem of a gap among representable FP numbers near zero. For example, the smallest positive number is x = 1.00... · 2^{-127}, and the second smallest positive number is y = 1.001_{2}· 2^{-127}= 2^{-127}+ 2^{-150}. This implies that the gap between zero and x is 2^{-127}and that the gap between x and y is 2^{-150}, as shown in Figure 3.22a.(a) (b)

**Figure 3.22.**Denorms: (a) Gap between zero and 2

^{-127}, and (b) Denorms close this gap - adapted from [Maf01].

This situation can be remedied by omitting the leading one
from the significand, thereby

*denormalizing*the FP representation. The smallest positive number is now the denorm 0.0...1 · 2^{-127}= 2^{-150}, and the second smallest positive number is 2^{-149}.#### 3.4.4. FP Arithmetic

Applying mathematical operations to real numbers
implies that some error will occur due to the floating point
representation. This is due to the fact that FP addition and
subtraction are not associative, because the FP representation is only
an approximation to a real number.

Example 1.Using decimal numbers for clarity, let x = -1.5 · 10^{38}, y = 1.5 · 10^{38}, and z = 1.0. With floating point representation, we have:x + (y + z) = -1.5 · 10and^{38}+ (1.5 · 10^{38}+ 1.0) = 0.0

(x + y) + z = (-1.5 · 10^{38}+ 1.5 · 10^{38}) + 1.0 = 1.0The difference occurs because the value 1.0 cannot be distinguished in the significand of 1.5 · 10^{38}due to insufficient precision (number of digits) of the significand in the FP representation of these numbers (IEEE 754 assumed).

The preceding example leads to several implementational
issues in FP arithmetic. Firstly,

*rounding*occurs when performing math on real numbers, due to lack of sufficient precision. For example, when multiplying two N-bit numbers, a 2N-bit product results. Since only the upper N bits of the 2N bit product are retained, the lower N bits are*truncated*. This is also called*rounding toward zero*.
Another type of rounding is called

*rounding to infinity*. Here, if rounding toward +infinity, then we always round up. For example, 2.001 is rounded up to 3, -2.001 is rounded up to 2. Conversely, if rounding toward -infinity, then we always round down. For example, 1.999 is rounded down to 1, -1.999 is rounded down to -2. There is a more familiar technique, for example, where 3.7 is rounded to 4, and 3.1 is rounded to 3. In this case, we resolve rounding from*n*.5 to the nearest even number, e.g., 3.5 is rounded to 4, and -2.5 is rounded to 2.
A second implementational issue in FP arithmetic is
addition and subtraction of numbers that have nonzero significands and
exponents. Unlike integer addition, we can't just add the significands.
Instead, one must:

- Denormalize the operands and shift one of the operands to make the exponents of both numbers equal (we denote the exponent by E).
- Add or subtract the significands to get the resulting significand.
- Normalize the resulting significand and change E to reflect any shifts incurred by normalization.

### 3.5. Floating Point in MIPS

**Reading Assignments and Exercises**

The MIPS FP architecture uses separate floating point
insturctions for IEEE 754 single and double precision. Single
precision uses

`add.s`

, `sub.s`

,
`mul.s`

, and `div.s`

, whereas double precision
instructions are `add.d`

, `sub.d`

,
`mul.d`

, and `div.d`

. These instructions are
much more complicated than their integer counterparts. Problems with
implementing FP arithmetic include inefficiencies in having different
instructions that take significantly different times to execute (e.g.,
division versus addition). Also, FP operations require much more
hardware than integer operations.
Thus, in the spirit of RISC design philosophy, we
note that (a) a particular datum is not likely to change its datatype
within a program, and (b) some types of programs do not require FP
computation. Thus, in 1990, the MIPS designers decided to separate
the FP computations from the remainder of the ALU operations, and use
a separate chip for FP (called the

*coprocessor*). A MIPS coprocessor contains 32 32-bit registers designated as`$f0`

, `$f1`

, ..., etc. Most of these registers
are specified in the `.s`

and `.d`

instructions.
Double precision operands are stored in *register pairs*(e.g.,`$f0,$f1`

up to `$f30,$f31`

).
The CPU thus handles all the regular computation,
while the coprocessor handles the floating point operations. Special
instructions are required to move data between the coprocessor(s) and
CPU (e.g.,

`mfc0`

, `mtc0`

, `mfc0`

,
`mtc0`

, etc.), where c*n*refers to coprocessor #*n*. Similarly, special I/O operations are required to load and store data between the coprocessor and memory (e.g.,`lwc0`

, `swc0`

, `lwc1`

,
`swc1`

, etc.)
FP coprocessors require very complex hardware, as shown
in Figure 3.23, which portrays only the hardware required for addition.

**Figure 3.23.**MIPS ALU supporting floating point addition, adapted from [Maf01].

The use of floating point operations in MIPS assembly
code is described in the following simple example, which implements a
C program designed to convert Fahrenheit temperatures to Celsius.

Here, we assume that there is a coprocessor c1
connected to the CPU. The values 5.0 and 9.0 are respectively loaded
into registers

`$f16`

and `$f18`

using the
`lwc1`

instruction with the global pointer as base address
and the variables `const5`

and `const9`

as
offsets. The single precision division operation puts the quotient of
5.0/9.0 into `$f16`

, and the remainder of the computation
is straightforward. As in all MIPS procedure calls, the
`jr`

instruction returns control to the address stored in
the `$ra`

register.
## No comments:

## Post a comment