Real Computer Science begins where we almost stop reading ...: Computer Abstractions, and Technology

Sunday, 6 October 2013

Computer Abstractions, and Technology

Overview of Computer Abstractions

As shown in Figure 1.1, computer systems span many levels of detail, which in computer science we call levels of abstraction. This hierarchy is reviewed in Appendix B.

Figure 1.1. Levels of abstraction in computer systems, adapted from [Maf01]

In this course, we are concerned with the following components:

Operating System - Provides a convenient interface between (a) the user and his/her application software, and (b) the hardware (sometimes called the bare machine).
Assembler - Translates assembly language, a primitive type of programming language, into machine code, which is a stream of ones and zeroes.
Instruction Set Architecture(ISA) - Interfaces the software (listed above) to the hardware (listed below), and provides support for programming.
Processor, Memory, and I/O System - These components support the execution of machine code instructions expressed in terms of the ISA.
Datapath and Control - Provide a convenient abstraction for connecting the processor, memory, and I/O system and controlling their function efficiently.

If you have taken Dr. Mafla's version of this course, you will be acquainted with the following five components of a computer:

Input - Provides data and program information
Datapath - Mediates I/O
Control - Implements control of calculation and communication or I/O processes
Memory - Storage and retrieval of programs or data
Output - Result of running program on processor using input

1.2. Overview of Computer Abstractions

Abstractions help us express intangible concepts in visible representations that can be manipulated. For example, a data structure is an abstraction that organizes concepts or numbers in a comprehensible, organized representation that can be manipulated by a computer program. Figure 1.2 illustrates another view of a computer system, which is comprised of different levels of language and means of translating these languages into lower-level languages. Finally, the microprogram is loaded onto the hardware.

1.2.1. Different Abstract Views

Figure 1.2. Another view of levels of abstraction in computer systems, adapted from [Maf01].

The salient concepts for Figure 1.2 are listed as follows:

Level 5 - Problem Oriented Language - Provides a convenient interface and applications engine that helps the user produce results specific to a given application area. For example, Microsoft Word is used for document creation or editing, Excel for accounting spreadsheets, etc. The language at this level is usually a sequence of keystrokes or a high-level scripting language. In software design, the language is a high-level programming language such as C, C++, or Java.
Level 4 - Assembly Language - Assembly is a very detailed language that helps the systems programmer or software designer move information around in a computer architecture in a highly specific way. For example, many compilers (programs that translate programming language into an assembly-like language) are written in assembly language. The advantage to assembly language is speed and power in accessing various features of the hardware.
Level 3 - Operating System Machine - Provides a convenient interface between assembly language and the abstraction of the hardware architecture's instruction set. Operating systems generally contain many libraries that help a software developer or programmer connect to lower-level system functions (e.g., I/O, memory allocation, etc.) in an organized way.
Level 2 - Instruction Set Architecture (ISA) - One of the most important parts of a computer is the interface between the lowest-level software and the hardware. The ISA includes anything programmers use to make a binary machine language program work correctly, including instructions, I/O, etc. The ISA facilitates design of functionality independent of the hardware.
Level 1 - Microarchitectural Level - Microinstructions are low-level control instructions that define the set of datapath control signals which apply to a given state of a computing machine. The microinstructions, together with their sequencing, comprise the microarchitecture, whose purpose is to rigorously and consistently express the control of logic circuits that comprise the computer hardware. Designing this control in terms of a program that implements machine instructions in terms of simpler microinstructions is called microprogramming.
Level 0 - Digital Logic - The circuitry that makes a digital computer run is called logic. All processes of a digital computer are expressed in terms of functions of ones and zeros, for example, and, or, and not functions. We will review these logic functions in Section 1.4.

A more general view of the abstractions shown in Figure 1.2 is given in Figure 1.3. Note the recursive levels of translation, from an actual machine (digital logic level in Figure 1.2) to various virtual machines (higher levels of abstraction).

Figure 1.3. Generalized view of levels of abstraction in computer systems, adapted from [Maf01].

This leads to the following implementational view, where the scope of this course is circled in red.

Figure 1.4. Another view of levels of abstraction in computer systems, adapted from [Maf01].

1.2.2. RISC versus CISC

When computers were first developed, they had very small instruction sets, because the algorithms and hardware for complicated tasks had not yet been developed. As computer design continued into the 1960s, high-level languages (e.g., FORTRAN, Cobol, Lisp) evolved. Hardware designers formed the idea that it might be useful to develop different computers for each type of language - a FORTRAN machine, a COBOL machine, etc. When this approach was tried, it resulted in very complicated instruction sets. Parallel to this development was the IBM philosophy of upward compatibility, which they also tried to implement in hardware. This produced a huge collection of instructions to support all their old computer programs (called legacy code). The result of this was called Complex Instruction Set Computing (CISC), whose philosophy is summarized as follows:

Bigger is better!
Make the hardware as "smart" (and, hence, as complex) as possible.
If you have a great sprawling architecture, that's ok. The hardware fabricators will figure out what to do with your design.
Don't worry about whether or not the system design is neatly partitioned into layers. (One great sea of logic gates would be ok until we figure out something else that works.)
When one little part fails, the whole system dies and we will never find out why. That's ok - just build another CPU chip from the ground up. Maybe that one will work better.

CISC has many problems. Some of the bigger problems include lack of maintainability, lack of verifiability, and brittleness. In practice, humans don't know how to verify or maintain really complicated designs. And, we don't yet have software that can perform all the verification and maintenance tasks for us. As a result, as CISC machines got more and more complex, they failed considerably more frequently. This yielded brittle (non-robust) performance in practical computing problems. As the world became more dependent on computers, the CISC design philosophy gradually became unacceptable.

In response to this problem, computer designers returned back to the primitive roots of computer science, and developed the Reduced Instruction Set Computing (RISC) philosophy. The main concept in RISC is that of a very simple Instruction Set Architecture. There is a compact microinstruction set, into which every high-level command or instruction is translated by the compiler. RISC computers tend to run faster, are smaller, and have fewer problems because they have a simple instruction set. The RISC philosophy includes the following concepts:

Small is beautiful.
Keep the hardware simple and stupid (KISS design philosophy).
Hardware and software designers should work together to make the architecture simple and modular.
Neatly partition the system design into layers, as shown in Figures 1.2 and 1.3. Then, take the vast majority of the functionality that CISC implements in hardware, and put it into software (using compiler transformations) instead.
Make the hardware and compiler robust, so the entire system can perform reliably.

By keeping the hardware small and modular, the design and fabrication, maintenance, and debugging costs are reduced. This makes sense from an economic perspective. It is also easier to make new generations of RISC computer chips, and to produce them more quickly. This implies potentially greater profits by shortening both the development and product life cycles.

Modularity and simplicity in hardware and software help designers and engineers achieve greater robustness, because the system can be maintained and understood more easily. With simple hardware and software, the functionality of the design can be verified more extensively using software-based testing and proof-of-correctness. Finally, it makes sense to put the CISC complexity into the compiler software, because the software can be modified and corrected much more easily, inexpensively, and quickly than hardware can be changed or updated. Again, this makes good economic sense, because development and re-work costs are significantly reduced.

1.2.3. Concept Summary

Thus far, we have seen that computer hardware (software) evolved in the following stepwise fashion:

Step 1. - Bare logic circuits (plugboard based programming)
Step 2. - Microarchitectural control (machine language programming)
Step 3. - Operating system (batch -> multiprogramming -> time-sharing -> parallelism)
Step 4. - Complex architectures (compilers for high-level languages, e.g., FORTRAN, COBOL -> Pascal -> C -> C++ -> Java)
Step 5. - Hardware/Software interface development (ISA -> CISC -> RISC)

We next overview example computer technologies and architectures.

1.3. Overview of Computer Technology Trends

In Section 1.1, we briefly summarized how technology developed from vacuum tube circuits to modern VLSI techniques. The ability to place more logic gates per unit area has been the primary motivation for exponentially increasing memory and processor capacity and performance. This development has followed several important trends, which we discuss in this section, as follows.

1.3.1. Memory Capacity

Memory capacity has approximately doubled every 1.5 years since the early 1970s. Prior to that, memory was not implemented in integrated circuits, but consisted of arrays of tiny magnetic cores strung together on a lattice of fine wires. Thus, the origin of the term core to describe a computer's main memory. Figure 1.5 illustrates the memory capacity trend, showing how memory gate density has increased from 1 mbegabit (Mbit) per chip in 1986 to 256 Mbit per chip in 2000. By 2004, this is expected to reach 1 Gbit per chip.

Figure 1.5. Exponential increase in memory capacity as a function of time since the early 1970s, adapted from [Maf01].

The increase in memory capacity and processor performance (Section 1.3.3) has made it possible to migrate supercomputer applications such as sophisticated image enhancement, finite-element modelling, and weather prediction to desktop computers. However, this rate of increase is probably not sustainable on an indefinite basis, as discussed in Section 1.3.4.

1.3.2. Processor Capacity

The number of transistors per chip has approximately doubled every 1.5 years since the early 1970s, when integrated circuits became available. This effect is called Moore's Law. Figure 1.6 illustrates this trend, showing how chip size has increased from fewer than 2,000 transistors in 1970 to over 15 million transistors in 2000. By 2003, this is expected to reach 60 million transistors per chip.

Figure 1.6. Exponential increase in processor capacity as a function of time since the early 1970s (Moore's Law), adapted from [Maf01].

The increase in processor capacity has made it possible to increase the speed of processor circuitry, particularly for arithmetic logic operations. For example, the chip area (also called real estate) required by an N-digit multiplier circuit varies approximately as N². Thus, if the processor area doubles every 1.5 years, then the speed increase of a multiplier possible with such technology could increase by a factor of 2^1/2 = 1.414 times every 1.5 years. We see this reflected in the performance increases shown in the following section.

1.3.3. Processor Performance

In order to measure processor performance, we use the units of SPEC benchmarks. The System Performance Evaluation Cooperative (SPEC) benchmark suite was created in 1989 to rpovide a consistent set of realistic benchmarks for evaluating CPU performance. The SPEC95 set of benchmarks includes games, a hardware simulator, file compression and decompression, applications such as quantum physics simulations, fluid dynamics problem solutions, and so forth. In Figure 1.7, we show the increase in processor performance as a function of the SPEC92 benchmark suite, which is listed in Gee et al.'s analysis of cache performance. (Cache is a type of fast memory that we will discuss later in this course.)

Figure 1.7. Exponential increase in processor performance as a function of time since the early 1970s, adapted from [Maf01].

The increase in processor performance has made it possible to port real-world applications to the PC platforms (e.g., Pentium and Apple G4 PC architectures). In the preceding section, we said that a multiplier's speed could increase 1.414 times every 1.5 years. So why does the curve in Figure 1.7 show an increase in processor capacity of 1.54 times per year? Consider that, although the previously-mentioned N-digit multiplier requires area proportional to N², there exist many operations that only need area proportional to N. The mix of these applications and the operations that they use is highly dependent on the applications being run on the computer. Hence, the use of a benchmark suite such as SPEC92 or SPEC95 (discussed in the textbook), to provide a consistent mix of operations across all platforms and test scenarios.

1.3.4. Physical Limits on Computer Capacity and Performance

In the preceding presentation of hardware technology trends, there is danger in assuming that these trends will continue forever. In fact, there are physical limits to computer chip size and speed, which we will discuss briefly in this section.

Currently, computer chips are made by a process called lithography, in which the chip design is laid out with the aid of a computer, then printed onto a master template having multiple layers. This template is called a lithograph, after the printing process of that name. Masks are made from the lithograph master and are used to allow (or mask) the penetration of high-energy radiation into the semiconductor materials that comprise the chip. This irradiation process supports chemical removal of materials that comprise the partially completed semiconductor circuit, which is called etching.

Present lithography processes support 0.18 micron resolution (size of the smallest feature that can be produced on the chip). Making these dimensions smaller, to pack more transistors on the chip, requires the lithography process to use wavelengths of electromagnetic energy that are much smaller than visible light. Thus, it is foreseen that litho processes will progress through the ultraviolet range, down into X-ray and gamma-ray lithography, as technology is developed.

However, the limiting constraint is not how small we can make features on a circuit board. Rather, the hard limit involved in using computer circuits that we have today is dictated by the size of an atom. For example, when the thickness of an insulating strip between two conductors on a circuit board becomes less than approximately 1.2 atomic diameters, then there is significant interference or cross-talk between conductors or channels in an information-carrying structure called a bus. Given today's resolution of 0.18 microns, and the limiting atomic dimension of approximately 1nm (0.001), we only have approximately a factor of 200, or less than 2.5 orders of magnitude increase in resolution left in the current technology. That translates to a capacity increase of approximately 250² = 62,500, which is not a comforting margin of safety. If Moore's Law continues to hold, then we the number of capacity-doubling periods is given by

N_p = log₂(62,500) < 16 ,

which translates to less than 16(1.5) = 24 years left in the technology cycle. Dr. Moore's current estimate is that we will run out of capacity in 2017, others say as late as 2025. What is certain is, that within your professional lifetime as computer scientists, there will be a radical shift from current circuit technology to something different - a technology that will support much faster, more highly parallel computing. Perhaps quantum computing will emerge as the dominant technology. Others say that biological computers based on DNA or protein technology will come to the rescue. IBM has recently announced the possibility of building circuits from carbon nanotubes, which presently have a width of approximately 10 atomic diameters. Whatever will happen, the paradigm shift will have profound technological and economic effects on the computer industry.

There is an interesting article about future technology trends at CNN's Web site that discusses Extreme Ultraviolet Photolithography for chip making - a useful summary of things that might occur.

If you want to read more about physical limits on computing, visit Dr. Frank's Web page about physical limits on computation. He has taught a course in this area, and the references listed therein make for interesting reading.

We next present overview of the types of computer architectures that have been developed with the technologies that have evolved since the dawn of useable computers during WWII.

1.3.5. Types of Architectures

There are three principal ways to connect components in a computer, called the von Neumann, bus, and multiprocessor architectures. We summarize each of these, as follows.

1.3.5.1. von Neumann Architecture. John von Neumann elucidated the first practical stored-program computer architecture (scheme for connecting computer components) in the mid-1940s. It is comprised of the five classical components (input, output, processor, memory, and datapath). The processor is divided into an arithmetic logic unit (ALU) and control unit, a method of organization that persists to the present. Within the processor, the ALU datapath mediates data transfer for computation. The registers are fast memory modules from/to which data can be read/written to support streaming computation, as shown in Figure 1.8. Within the ALU, an accumulator supports efficient addition or incrementation of values corresponding to variables such as loop indices.

Figure 1.8. Schematic diagram of von Neumann architecture, adapted from [Maf01].

The von Neumann architecture has a significant disadvantage - its speed is dependent on the bandiwdht or throughput of the datapath between the processor and memory. This is called the von Neumann bottleneck.

1.3.5.2. Bus Architecture. There are many animals (e.g., cat or dog) whose internal organs are hung from a backbone that conveniently happens to be horizontal. Bus-based computers are structured like that - processors and memory are connected to a backbone bus that acts as a "superhighway" for data or instructions to move between processors and memory. In practice, the bus architecture has the same components as the von Neumann architecture, but they are arranged along a bus, as shown in Figure 1.9.

Figure 1.9. Schematic diagram of a bus architecture, adapted from [Maf01].

In principle, the bus computer solves the von Neumann bottleneck problem by using a fast bus. In practice, the bus is rarely fast enough to support I/O for the common case (90 percent of practical applications), and bus throughput can be significantly reduced under large amounts of data. Figure 1.10 shows the layout of components in a modern PC, which is a useful example of a bus architecture.

Figure 1.10. Schematic diagram of PC bus architecture, adapted from [Maf01].

In Figure 1.10, note that there are multiple buses - the PCI bus mediates large-scale data transfer between external components, as does the ISA bus. The SCSI bus is used to daisy chain peripherals such as disk drives, scanner, etc. The von Neumann bottleneck is partially overcome in practice by using a very fast bus, called the memory bus, to connect the CPU, memory, and the PCI bus.

Unfortunately, there are physical limits on how many components can be packed in a given space in a von Neumann machine (or any other computer, for that matter). Bus bandwidth limits the sequential computation model inherent in the von Neumann architecture to speeds that are not feasible for compute-intensive scientific applications. As a result, multiprocessor architectures were developed to address such problems, as discussed in the following section.

1.3.5.3. Multiprocessor or Parallel Architecture. Recall the old saying, "Many hands make less work." In computers, the use of many processors together reduces the amount of time required to perform the work of solving a given problem. Due to I/O and routing overhead, this efficiency is sublinear in the number of processors. That is, if W(N) [or T(N)] denotes the work [or time to perform the work] associated with N processors, then the following relationships hold in practice:

W(N) < N · W(1)

and

T(N) > T(1)/N .

The first equation means that the work performed by N processors working on a task, where each processor performs work W(1) [the work of one processor in a sequential computation paradigm], will be slightly less than N times W(1). Note that we use "<" instead of "=" because of the overhead required to (1) divide up the problem, (2) assign the parts of the problem to N processors, (3) collect the partial results from the N processors, and (4) combine the partial results to achieve a whole, coherent result.

The second equation means essentially the same thing as the first equation, but the work is replaced by time. Here, we are saying that if one processor takes time T(1) to solve a problem, then that same problem solved on an N-processor architecture will take time slightly greater than T(1)/N, assuming all the processors work together at the same time. As in the preceding paragraph, this discrepancy is due to the previously-described overhead.

Figure 1.11. Schematic diagram of multiprocessor architecture, where each CPU shares a memory module that is connected to a common bus - adapted from [Maf01].

Figure 1.11 illustrates schematically a simple multiprocessor architecture with four CPUs, which share a memory module that is connected to the CPUs via a common bus. This is a simple architecture that is useful for solving selected types of compute-intensive problems. However, if you try to solve data-intensive problems on such an architecture, you encounter the von Neumann bottleneck trying to read and write the large amount of data from and to the shared memory.

Figure 1.12. Schematic diagram of multiprocessor architecture with shared memory, where each CPU also has its own fast, local memory - adapted from [Maf01].

To help solve the problem of bus contention inherent in shared-memory multiprocessors, computer scientists developed the mixed model of parallel processing, in which the CPUs have small, very fast local memories that communicate with the CPU via a very fast, short bus. Because the bus is short, there are fewer impedance problems and the bus bandwidth can be increased. This migrates the von Neumann bottleneck closer to the CPU, and alleviates some of the demands on shared memory. Local memory architectures are useful for problems that have data locality, where (1) each CPU can solve part of a problem with part of the problem's data, and (2) there is little need for data interchange between processors.

A special type of multiprocessor is called a distributed processor, in which the individual CPUs (called compute servers) are connected to storage servers by a network. You can make the multiprocessor diagrammed schematically in Figure 1.12 into a distributed computer by replacing with a network the bus that connects the CPUs to each other and to the shared memory.

Having taken a high-level tour through computer technology trends, we next examine the low-level technology that makes these machines work, namely, digital logic.

1.4. Digital Logic Design

Digital logic is hardware that implements the functions of Boolean algebra, which are or, and, not, as well as combinations of these functions. Boolean algebra operations are extensively employed in computing, and form the basis for the vast majority of arithmetic operations, as we shall see later in this course.

We begin this section by discussing the operands and operations that comprise Boolean algebra, then show how these logical operations are implemented in simple digital devices.

1.4.1. Theory of Logical Operations

The kind of logic that we use in computing is generally called formal logic, which is one of the three bases of mathematics. (The other two bases are measurement theory and algebraic manipulation.) The type of formal logic that we will use in this course is called Boolean logic or Boolean algebra (the terms are interchangeable). There are other types of formal logics, such as interval logic (often called fuzzy logic), but we will not consider them in this course.

Definition. An algebra is a tuple A = (F,O) that contains a set of values F and a set of operations O on those values.

Example. The algebra of real numbers contains operations such as addition, multiplication, and division, with operands selected from the real numbers R.

Definition. Boolean algebra B = ({0,1}, O) contains a set of operations O called logical-or, logical-and, logical-not, etc., that operate upon numbers in the set {0,1} (i.e., operate on zero(es) or one(s)).

Example: logical-not - The not function inverts the value of its input. Thus, 0 = not(1) and 1 = not(0). Because the not function has only one input, it is called a unary operation.

Example: logical-and - The and function takes the minimum of its two inputs. Thus, 0 = and(0,0) = and(0,1) = and(1,0). However, 1 = and(1,1). Because the and function has two inputs, it is called a binary operation. (This use of the word "binary" is similar to calling the set {0,1} the "binary numbers", because there are only two numbers in the set.)

Example: logical-or - The or function takes the maximum of its two inputs. Thus, 0 = or(0,0), with 1 = or(0,1) = or(1,0) = or(1,1).

It is important to note that all logical operations can be constructed from not and and. In your discrete mathematics course (COT3100 at UF/CISE), you will see (or have seen) how to do this. For now, it is useful to accept this assertion and continue on.

We next consider how digital logic is represented in term of truth tables and graphic symbols used in circuit diagrams. In Section 1.4.3, we will discuss how digital logic can be implemented in hardware.

1.4.2. Representations of Digital Logic

When we are working with digital logic, it is easy to describe the functionality of a logic device in terms of a truth table.

Definition. A truth table is a tabular representation of a logical operation that completely defines the operation.

Observation. A logical operation with N inputs will have 2^N entries in its truth table. If an operation has many inputs, then its truth table can be represented compactly by listing only those entries that have nonzero output.

There are also a variety of graphic symbols that are used with digital logic functions. In Figure 1.13, we show the symbols for the and as well as or gates, with their truth table representations. If we represented these truth tables compactly, then the and operation would have only one row in its truth table, namely, [1 1 1].

Self-Exercise. How many rows are there in the compact representation of the or operation's truth table, and what are these rows?

Figure 1.13. Graphic symbols and truth tables for the logic gates that implement and as well as or operations, adapted from [Maf01].

The not gate is represented graphically as shown in Figure 1.14a. By appending a circle to the output of a gate symbol, we can represent the effect of putting a not gate after the gate represented by the symbol. For example, the output of the and gate is inverted by the not gate to form the nand gate, as shown in Figure 1.14b. A similar situation is shown for the nor gate in Figure 1.14c. Note that the truth table for the nand operation has three nonzero entries, whereas the truth table for the and operation had only one nonzero entry. A symmetric condition exists for the nor operation.

Figure 1.14. Graphic symbols and truth tables for negative logic gates that implement (a) not, (b) nand, (c) nor operations - adapted from [Maf01].

In the next section, we discuss how these basic gates are implemented in electronic circuits.

1.4.3. Circuit Implementation of Digital Logic

In 1947, scientists at Bell Laboratories developed the transistor, which is a circuit element whose conductivity varies with applied voltage. The advantage of the transistor is that a relatively small input current can be used to control a much larger power-supply current. The general concept involved in the operation of the transistor is similar to that of the vacuum tube (recall our discussion of first-generation computers), but requires much less power and space. Also, the transistor can change state much more quickly than a vacuum tube - as sometimes said in electrical engineering, the transistor has a higher switching frequency.

1.4.3.1. Simple Model of A Transistor NOT Gate. Figure 1.15 shows a diagram of a transistor that implements negative logic. Note that the collector supply voltage V_cc is positive, and the current from this power source flows through a current limiting resistor, whose symbol is the jagged line below V_cc. The resistor is designed to prevent damage to the transistor junction, which is delicate. (This junction is comprised of several semiconducting materials deposited on a small wafer of silicon or germanium called substrate. As we discussed in Section 1.3, the dimensions of features etched into the substrate are presently on the order of 0.2 micron, so it doesn't take much excess current to cause electrical arcing across these features, which renders the transistor useless.)

A control voltage (shown in Figure 1.15 as the input signal A) is applied to the base of the transistor. If A = 0, the conductivity of the transistor is low, and the current from the collector is not grounded at the emitter. This causes a voltage difference to be present between the emitter (at ground) and the collector (at V_cc). Symmetrically, when A = 1 (i.e., a voltage significantly higher than zero), then the conductivity of the transistor is enhanced, and V_cc is grounded. This produces little or no voltage drop across the collector-emitter junction, and the output at the collector is effectively zero. When the output of the transistor is taken from the collector (as shown in Figure 1.15), these two conditions cause a not operation to be implemented.

Figure 1.15. Transistor implementation of a NOT gate, adapted from [Maf01].

1.4.3.2. NAND Gate Circuit. The nand gate is implemented using the operating principle of the not circuit shown in Figure 15. As shown in Figure 1.16, if both transistors have a high input (A = 1 and B = 1), then both transistors conduct, and V_cc is grounded. This produces little or no voltage drop from the voltage supply to ground, and near-zero voltage (logical zero value) occurs at the gate's output. This implements the entry for zero output in the nand gate's truth table, shown again in Figure 1.16 for purposes of verification.

If either (or both) input(s) of the nand circuit shown in Figure 1.16 is (are) low, then the two transistors taken together do not form a conducting path, and there is a significant voltage drop (logical one) at the gate output. This implements the nonzero entries of the nand gate's truth table.

Figure 1.16. Transistor implementation of a NAND gate, adapted from [Maf01].

1.4.3.3. NOR Gate Circuit. The nor gate is implemented using the operating principle of the not circuit shown in Figure 15. As shown in Figure 17, if both transistors have a low input (A = 0 and B = 0), then both transistors do not conduct, and V_cc cannot be grounded. This produces a significant voltage drop from the voltage supply to ground, and a logical one value occurs at the gate's output. This implements the sole entry for unitary output in the nor gate's truth table, shown again in Figure 17 for purposes of verification.

Conversely, if either (or both) input(s) of the nand circuit shown in Figure 1.16 is (are) high, then either (both) transistor(s) form a conducting path that grounds V_cc, yielding no significant voltage drop (logical zero) at the gate output. This implements the zero entries of the nor gate's truth table.

Figure 1.17. Transistor implementation of a NOR gate, adapted from [Maf01].

Integrated circuits often contain multiple gates in one in-line package. For example, the layout of a quad nand gate IC is shown in Figure 1.18. Note that the notch in the circuit packaging material, shown on the left-hand side of the IC, helps orient the IC user to the pin ordering scheme. Each nand gate input and output pin is unique, with a pin for the common supply voltage V_cc, and another pin for common ground.

Figure 1.18. Example of quad nand gate integrated circuit pin-out diagram, adapted from [Maf01].

We next examine how the laws of Boolean algebra can be used to generate designs for more complicated logic circuits.

1.4.4. More Complex Logic Circuits

In this section, we present several examples of how complex logic circuits can be designed from equations that represent the circuit's functionality, or from truth tables.

1.4.4.1. Laws of Boolean Algebra. Each type of formal logic has rules for manipulating operations and operands. In Boolean algebra, the following rules hold:

Assumption. Let B = {0,1}, with p,q,r

B and x,y,z

R. This assumption will hold for the subsequent discussion.

Identity. In arithmetic with real numbers, x · 1 = x and x + 0 = x. In logic, the analogous statements are

p and 1

p, and p or 0

p .

Domination. In arithmetic with real numbers, x · 0 = 0. The logical analogue is

p and 0

0 .

Symmetrically, p or 1

1 .

Idempotency. This property can be used to simplify and or or operations, i.e.,

p or p

p
p and p

p .

Double Negation. In arithmetic, -(-x) = x. In logic,

not(not(p))

Commutativity. In arithmetic, addition and multiplication are commutative, i.e., x + y = y + x and x · y = y · x. In logic, we have analogous formulations:

p or q

q or p
p and q

q and p .

Associativity. In arithmetic, addition and multiplication are associative, i.e., (x + y)+ z = x + (y + z) and (x · y) · z = x · (y · z). In logic, we have the analogous formulations:

(p or q) or r

p or (q or r)
(p and q) and r

p and (q and r) .

Distributivity. In arithmetic, multiplication distributes over addition. In logic, we have a similar situation:

p and (q or r)

(p and q) or (p and r)
p or (q and r)

(p or q) and (p or r) .

DeMorgan's Laws. Manipulation of logical statements is greatly aided by DeMorgan's laws, which describe a property that resembles distributivity:

not(p or q)

not(p) and not(q)
not(p and q)

not(p) or not(q) .

Self-Exercise. Construct truth tables for (a) distributive and (b) DeMorgan's Laws, listed above.

1.4.4.2. Circuit Equivalence. Using the preceding equivalences, especially DeMorgan's laws, it is possible to transform one circuit form (e.g., nor logic) into another form (e.g., nand logic). Here, nor logic uses nor gates, and symmetrically for nand logic.

Figure 1.19. Example of quad nand gate integrated circuit pin-out diagram, adapted from [Maf01].

For example, Figure 1.19 shows a circuit that outputs a one when inputs A and C are both one, or when A and B are both one, and outputs a zero otherwise. Note that the left-hand circuit has the same truth table as the right-hand circuit. Also note that the circuit equation for the left-hand circuit (AC + AB) can be rewritten using the distributive law (from Section 1.4.4.1) to yield A(C+B), which is the equation for the right-hand circuit. Here, the minterm AC means (A and C), and the minterm C + B means (C or B).

Logic transformations are primarily used for circuit minimization. For example, Note that the right-hand circuit in Figure 1.19 has only two gates, while the left-hand circuit uses three gates. This is because the expression AC + AB requires three operations (two and, one or), while the expression A(C + B) requires two operations (one and, one or).

1.4.4.3. Majority-Vote Circuit. Given N inputs a₁, a₂, ..., a_N, the majority vote function outputs a one when N/2 or more of the inputs are ones, and zero otherwise. (If N is odd, then N/2 is rounded up to the nearest integer.) Figure 1.20 illustrates a logic circuit that implements the majority vote function for three inputs. Observe that the inputs A, B, and C and their negations (represented by a bar above each variable) are present in buses (vertical lines), then are combined to form the four minterms listed in the equation for M. Including the not gates used for negation, eight gates are required to implement this circuit. Note, however, that the and gates each have three inputs (corresponding to the three variables in each minterm) and the or gate that produces the circuit output has four inputs (corresponding to the four minterms of the equation for M).

Figure 1.20. Example of Majority Vote Circuit, its truth table, and the governing equation, adapted from [Maf01].

Self-Exercise. How many minterms would there be in the equation for the majority-vote circuit with four inputs? five inputs? N inputs?

1.4.5. Comparators and Combinatorial Logic

Combinatorial logic circuits include multiplexers, demultiplexers, encoders, decoders, and comparators. A combinatorial circuit has many inputs and one (or, sometimes, more than one) output. The output(s) of a combinatorial circuit are uniquely determined by the inputs. Combinatorial circuits have no memory elements. We will begin our discussion of combinatorial circuits with the simple example of a comparator, then progress to the multiplexer and decoder.

1.4.5.1. Comparator. It is often useful to determine whether or not two operands are equal, which is called comparison. A digital logic circuit that determines whether or not two four-bit operands A = (A₀,A₁,A₂,A₃) and B = (B₀,B₁,B₂,B₃) are equal is shown in Figure 1.20. The algorithm that the circuit implements is based on the concept of an exclusive-or operation (also called an xor operation), which is instantiated in an xor gate.

The xor function resembles the inclusive or, except that when both inputs are high, the xor output is zero, as shown in the following truth tables:

                 A   B   X                 A   B   X   
             -+---------------         -+---------------
          xor |  0   0   0          or  |  0   0   0
              |  0   1   1              |  0   1   1
              |  1   0   1              |  1   0   1
              |  1   1   0              |  1   1   1

After the xor operation is applied to each bit of A and B (we call this bitwise processing), the xor outputs are operated on by a quad-input nor gate. Thus, if any of the bits are different, the or part of the nor gate will return a one, and the not part of the nor gate will return a zero. If A₀ = B₀ and A₁ = B₁ and A₂ = B₂ and A₃ = B₃, then the nor gate will applu the or operation to the outputs of the xor gates (which will all be one) to yield a one, which will be inverted at the output of the nor gate to yield a zero.

Figure 1.21. Example of Comparator Circuit, adapted from [Maf01].

1.4.5.2. Multiplexer. A multiplexer has n control signals and 2ⁿ inputs (sometimes called data lines). The purpose of a multiplexer is to use the control signals to decide which of the inputs will be routed to the multiplexer's output. Thus, the multiplexer is a type of switching device that chooses its output from among its inputs, based on control signals.

For example, Figure 1.22 illustrates a multiplexer with n = 2 control lines (A, B) and 2ⁿ inputs (D₀ through D₃). Note that inverters (not gates) provide the complement of A and B for constructing the minterms of the circuit's governing equation.

Figure 1.22. Example of Multiplexer Circuit, adapted from [Maf01].

Self-Exercise. Write the governing equation of the multiplexer shown in Figure 1.22. How does this compare with the governing equation of the majority-vote circuit shown in Figure 1.20? In what detailed ways are these circuits different?

1.4.5.3. Decoder. A decoder can be loosely thought of as a kind of reverse multiplexer, without the control lines. A decoder has n inputs, and 2ⁿ outputs. The input indexes the output - for example, if the input is 000, then the output D₀ will be high. If the input is 001, (or 010) then the output D₁ (resp. D₂) will be high, and so forth.

For example, Figure 1.23 illustrates a decoder with n = 3 inputs (A, B, C) and 2ⁿ = 8 outputs (D₀ through D₇). As in the multiplexer circuit, inverters provide the input complements for constructing the minterms of the circuit's governing equation.

Figure 1.23. Example of Decoder Circuit, adapted from [Maf01].

Self-Exercise. Write the governing equation of the decoder shown in Figure 1.23. How does this compare with the governing equation of the multiplexer shown in Figure 1.22? In what detailed ways are these circuits different? (Hint: Think about minterms, complementation, and DeMorgan's laws.)

Thus, decoders can be used to select one object from a large number of indexed objects, given a Boolean representation of the index of that object at the decoder's input. For example, if the decoder outputs each control a relay that switches a voltage through a different resistor value, then it is possible to construct an n-to-2ⁿ decoder that converts an n-bit input signal into an analog voltage ranging from zero to 2ⁿ units. However, the circuitry for large n can become prohibitive, so this technique is of pedagogic interest only.

1.4.6. Programmable Logic Arrays

It is readily seen that circuits such as those shown in Section 1.4.5 can become prohibitively complex, large, and power-consumptive if implemented on a circuit board with many chips containing various types of logic gates. Thus, it is reasonable to ask how we might put a complex digital circuit on a chip. There are two ways to do this. First, the circuit can be custom-fabricated, which is very expensive, especially for small quantities. Second, one can use programmable logic, which provides many different types of gates on a chip, with a programmable network of interconnects between the gates. Programmable logic has widespread software support, so it is relatively easy (using modern design tools) to design a circuit, reduce it to a logic schematic (diagram of connectivity and parameters of logic elements in a circuit), then use a computer to program a special logic device called a PLA (programmable logic array).

In Figure 1.24, we illustrate a simple PLA circuit. This type of circuit is called two-level logic because there are two levels of processing (i.e., and gates at Level 1, or gates at Level 2).

Self-Exercise. What are the two levels of processing in the circuits shown in Section 1.4.5?

In practice, programmable logic devices are made in a variety of gate configurations, connectivity networks, and gate densities. One of the more important types of PLAs is called FPGA, for Field Programmable Gate Array. FPGAs represent a new method of designing custom ICs. Unlike conventionals Gate Arrys, FPGAs can be designed and programmed by a computer user (e.g., a hardware designer), and represent an advanced stage of evolution from Programmable Logic Devices (PLD). FPGAs are advantageous because they have a much higher gate density than PLAs or PLDs, and offer greater flexibility. FPGAs are useful for rapid prototyping, especially of large or complex circuits. Designs can be described using (a) schematic layout tools, (b) synthesis from a hardware description language model, or (c) high-level language driven synthesis tools.

Figure 1.24. Example of programmable logic array, adapted from [Maf01].

The circuit shown in Figure 1.24 is a fused-link circuit. To program this circuit, a software-driven programming device merely applies a large voltage to open-circuit (or "blow") each of the fuses (small boxes in Figure 1.24) that control access to a given input of a specific gate. For example, if the fuse at the junction of and gate #1 output and or gate #5 input is blown, then the and gate output cannot is not an input to that or gate.

There is a paper on the Web on Design of High Speed, Low Powered Dynamic Programmable Logic Arrays that is useful for advanced reading in this interesting topical area.

We next discuss clocked logic circuits.

1.4.7. Synchronous Logic

Thus far, we have considered only asynchronous logic circuits, in which the input pulses are allowed to propagate through the circuit at their own speed. This is acceptable for simple logic designs, but has the disadvantage that one needs to wait a short (usually somewhat indeterminate) time for the circuit output to stabilize. In time-critical applications where timing precision is required, the asynchronous approach to circuit design and implementation does not always produce useful results, due to indeterminacies in the output settling time.

To remedy this problem, synchronous logic circuits have been developed whose transition from one state to another is mediated by a clock pulse. (This can be thought of like people in a band playing in time to the conductor.) The vast majority of logic circuits use edge triggered clocking, which means that the state change in a circuit (e.g., from State Element 1 to State Element 2 in Figure 1.25a) occurs only when the clock pulse changes from a zero value to a one.

(a)

(b)

Figure 1.25. State changes in logic circuits: (a) synchronous logic changes state on the leading clock pulse (0-1 transition), and (b) asynchronous logic does not require a clock to change state - adapted from [Maf01].

To better understand synchronous logic, which is the dominant form in digital computing, we next examine how a clock works. We then discuss latches and storage circuits such as clocked latches and flip-flops.

1.4.7.1. Principles of the Clock. When we think of a digital clock, the image of a numerical timekeeping device comes to mind. However, in digital logic, a clock is really a very precise oscillator that changes its output from zero to one, then back to zero in a regular, predictable way.

For example, in Figure 1.26, a clock is comprised of an oscillator whose output is thresholded to form a train of rectangular pulses, called a square wave. The clock period is measured from one 0-1 transition to the next 0-1 transition. The addition of a delaying device after the clock output causes phase lag between the clock signals C1 and C2.

Figure 1.26. Example of clock output, with symmetric (A) and asymmetric (B and C) clock pulses, adapted from [Maf01].

Within a clock cycle, there is the time a pulse is high (t_H) and the time it is low (t_L). If t_H does not equal t_L within a small measurement error, then we say that the clock pulse is asymmetric. This term also holds for a phase-shifted pulse, as shown in signal B of Figure 1.26.

If t_H< t_L within a small measurement error, then we say that the pulse duty cycle is less than one. This case is shown in Figure 1.26, for the clock signal labelled C. If t_H > t_L within a small measurement error, then we say that the pulse duty cycle is greater than one.

1.4.7.2. The SR Latch. Early in computing practice, it was desirable to store a bit in a small memory device. Since the only usable storage devices were relays (an electromechanical device) and delay lines that used the metal Mercury (chemical symbol Hg), it was decided to adapt asynchronous relay-based logic (very slow) to produce a storage element that could hold one bit.

Figure 1.27. The two states of the SR latch, adapted from [Maf01].

Figure 1.27 shows a modern version of this concept, called the SR Latch, which is implemented in nor logic. (Using DeMorgan's Laws, it is possible to implement this in nand logic.) When the Set input is high and the Reset input is low, State 1 results, and conversely for State 0. Thus, the Set (Reset) puts a 1 (0) value at the output Q. It is not difficult to see that placing this circuit after an oscillator could product the clock pulses shown in Figure 1.26.

Self-Exercise. Derive the truth tables for State 0 and State 1 of the SR latch shown in Figure 1.27.

1.4.7.3. Clocked Latches. The problems with the asynchronous SR Latch shown in Figure 1.27 are (a) the S and R inputs can be energized at any time, and (b) settling of the circuit to yield its output value occurs immediately after the S or R signals are provided. This situation is remedied in the Clocked SR Latch of Figure 1.28, where two and gates are used in conjunction with a clock to drive the SR Latch of Figure 1.27.

The circuit works the same as the SR Latch, with the following important exceptions. First, if S or R are one, and the clock value is zero, then the output of the and gates is zero. If S or R equals one, and the clock is high, then the S,R values are passed to the input of the SR latch, but only while the clock is high. This gives the latch time t_H to settle (see Section 1.4.7.1). If S or R are zero, then it doesn't matter what the value of the clock is - the inputs to the SR Latch remain zero.

Figure 1.28. The Clocked SR Latch, adapted from [Maf01].

Self-Exercise. (1) Derive the truth table for the clocked input to the SR Latch in Figure 1.28, using the and gate logic, as shown. (2) Prove (using truth tables) how this could or could not be implemented in nor logic.

A problem with the clocked SR latch is that it has two inputs, and they have a forbidden configuration. Namely, one cannot apply both S = 1 and R = 1 to the circuit simultaneously - it won't know which state to enter! This problem is resolved with the Clocked D Latch, which is a minor variation on the clocked SR latch.

As shown in Figure 1.29, the D latch adds a not gate or inverter to the input of the clocked SR latch, so that its "S" and "R" inputs always have opposite values. In order to avoid confusion with the SR latch, the input is called D. The clock pulse is applied to the input, as it was with the SR latch.

Figure 1.29. The Clocked D Latch, adapted from [Maf01].

As shown in Figure 1.29, the D latch has the following timing behavior (from the D,C,Q signals below the latch circuit). The D signal is the input (data), C is the clock, and Q is the output. When D goes high (0-1 transition), nothing happens to the latch until C goes high. (If D is low by the time C goes high, then the latch does not change state.)

If D is high and C goes high, then the latch changes state. There is a slight delay from the time C goes high to the time Q goes high, which is called the propagation delay (a function of the type of logic used, the semiconductor fabrication technique, wire lengths, etc.) When C goes low, whether or not D is high, the latch output (Q) stays high. This is the memory function of the SR latch that we discussed earlier. However, if D is low, then if C goes high (0-1 transition), Q goes low after the propagation delay occurs. Thus, the D latch only stores a unitary input (D = 1) for a time equal to the period of the clock. This can be verified by inspecting the timing diagram in Figure 1.29.

1.4.7.4. The D Flip-Flop. In order to make the D latch store a value throughout the high state of the clock, it is necessary to make a more complex circuit using the D latch, which we call a D flip-flop. The flip-flop is so named because it can be toggled back and forth between two states (Q = 0 and Q = 1) by varying D. The D flip-flop is comprised of two D latches and an inverter, as shown in Figure 1.30

Figure 1.30. The D flip-flop with timing diagram, adapted from [Maf01].

Per Figure 1.30 the D flip-flop has the following timing behavior (determined from the D,C,Q signals below the latch circuit). When D goes high (0-1 transition), nothing happens to the latch until C goes high. (If D is low by the time C goes high, then the latch does not change state.) At the trailing edge of the clock pulse (1-0 transition), Q goes high if it was previously low. If Q was previously high, nothing happens to Q. When D goes low, nothing happens to Q until the clock pulse goes low. Then, after a propagation delay, Q goes low.

There are two critical time intervals associated with the D flip-flop. The setup time is the time required to keep D high, in order to wait for the trailing edge of the clock pulse. The hold time is the time required to hold D high after setup, to ensure that Q stays high.

Self-Exercise. (1) Derive the truth table for each of the four input states of the D flip-flop as shown in Figure 1.30. Hint: Input states are defined by the values of D and C (e.g., State 0: D = 0, C = 0->0; State 2: D = 0, C = 0 -> 1; etc.)

1.5. Computer Performance Assessment

Reading Assignments and Exercises

When we talk about computer performance, we need to consider the following issues:

Method of Performance Evaluation - By what measures will the performance be assessed? Will the assessment be local to a specific part of the computer hardware, or global (i.e., across all hardware and software modules)? What will the mix of instructions be?
Limitations of Evaluation - What are the shortcomings of each method or measure of performance assessment? From whence do these limitations arise, and why? What can be done to improve these limitations?
Metrics - What formal measures are to be used for performance evaluation? Speed? Time used in a computation? Memory space required?
Processor Performance Equation - How will the processor's performance be described mathematically? Does the equation portray actual parameters of practical interest, or does the computational model described by the equation not match the processing architecture under test?
Performance Evaluation Reports - What methods of analysis and presentation are used to evaluate data? Is the data normalized to a particular benchmark result or parameter? Is the normalization performed in a physically meaningful way?
Ahmdahl's Law - Whatever measurements are made, as one continues to enhance processor performance, the overall improvement achieves diminishing returns.

We examine each of thise issues, as follows.

1.5.1. Performance Evaluation

Computer performance evaluation is primarily based on throughput and response time. Throughput is how many bits of data are processed, or how many operations are performed, in a given interval of time. For example, we say that a processor has a throughput of N MB/s (megabytes per second). With respect to execution time of a program on a processor X, we say that

Performance_X = 1 / Execution Time_X

and, for two processors X and Y,

Relative Performance = Performance_X / Performance_Y

For example, in the preceding equation, if processor X is faster than processor Y, then Execution Time_X < Execution Time_Y (all other things being equal), so Relative Performance < 1. This also implies increased throughput of X with respect to Y.

Key Concept. As a result, we want to improve performance by increasing performance, which means decreasing execution time. In practice, this can be done by (a) decreasing response time, or (b) adding more processors to a system (provided I/O is managed efficiently).

1.5.2. Measuring Performance

A program that is executing, or has been executed, has the following components:

Wall-Clock Time - how long it takes (typically, time in seconds) for your program to execute, from the time it is invoked to the time it completes. This time is measured with respect to a global time standard, so we name it according to a common object such as a wall clock. This is also called elapsed time.
CPU Time - comprised of user CPU time (time spent computing the program), and system CPU time (the time the operating system spends supporting the program).
I/O Time - time spend reading and writing data from/to memory.
Other Time - time spend running other programs, waiting for the operating system to be scheduled, etc.

We can measure a program's execution time using the UNIX time command. For example, let's enter the UNIX command time du at one's root directory (which can be reached via the command cd. When I tried this on my (large) directory tree, I got the following result:

0.21u 1.54s 0:24.49 7.1%

These four numbers have the following meaning:

User CPU Time = 0.21 seconds, the first number (0.21u).
System CPU Time = 1.54 seconds, the second number (1.54s).
Elapsed Time = 24.49 seconds, the third number (0:24.49).
CPU Activity as Percent of Elapsed Time = 7.1 percent, the fourth number (7.1%).

Observe that percent CPU activity is computed as follows:

Step 1.

Step 2.

Step 3.

Other Time is computed as follows:

Step 1.

Step 2.

Given a method of measuring time, and computation of different times (e.g., CPU, Other) from these measures of time, we can now discuss a simple model of computation for measuring CPU performance.

1.5.3. CPU Performance Equation

In order to measure CPU performance in a physically realistic way, we need a model of computation. In the simplest case, we start with the number of CPU cycles, defined as follows:

N_cyc = IC · CPI ,

where IC denotes the instruction count (number of instructions per program), and CPI denotes the average cycles per instruction.

Given the CPU cycle time t_cyc or the clock rate BW_clk, we can express the CPU time as:

t_cpu = N_cyc · t_cyc

t_cpu = N_cyc · BW_clk .

Patterson and Hennesey [Pat98] express this as

with units indicated for each variable.

If there are different types or classes of instructions in a given program, then the preceding equation for N_cyc is not an accurate estimate, because CPI can be affected by the instruction mix (nimber of each of type of instruction that occurs in the program). Thus, we use the following equation to more accurately determine the number of cycles incurred when a program is executed:

N_cyc =

CPI_i · IC_i ,

where we assume that there are n classes of instructions. The frequency distribution of instructions, which is comprised of (IC₁, IC₂, ..., IC_n), is obtained by a technique called execution profiling, which is supported by a variety of commercial software tools. The number of cycles per instruction type is determined from analysis of a specific processor design, validated by performance measurement.

Example 1. To illustrate the practical utility of these measurements, let us assume that a program runs in 10 seconds on a processor with a 400 MHz clock rate. We want to build a new computer that can run the program in 6 seconds by increasing the clock frequency to X MHz. An unfortunate consequence is that the average CPI will be 1.2 times higher in the new processor. Here's how we set the problem up to determine the new clock rate X:

Solving the equation yields X = 800 MHz. Thus, we see that doubling the clock rate does not necessarily double the processor speed, because there are other variables (e.g., CPI or the instruction mix) that can be affected, as shown below.

Example 2. Now we will consider the effect of instruction mix. Assume that the following data are correct:

Q: Which code sequence will be faster - #1 or #2?

A: The sequence that incurs the least number of cycles is faster. Using the preceding equation for N_cyc we see that the number of cycles for code sequences 1 and 2 are given by:

N_cyc(Seq.1) = (2 · 1) + (1 · 2) + (2 · 3) = 10 cycles
N_cyc(Seq.2) = (4 · 1) + (1 · 2) + (1 · 3) = 9 cycles .

Thus, code sequence #2 is faster, because it requires 9 cycles, versus 10 cycles for code sequence #1.

1.5.4. Performance Evaluation

In an ideal situation, one has a suite of programs whose instruction mix is known exactly. By running each program on a processor to be analyzed, it is possible to determine CPI for each type of instruction. However, life is rarely ideal. Thus, we have the following types of benchmarking programs to measure and bound processor performance:

Synthetic Benchmarks are used to exactly measure specific characteristics of a processor (e.g., memory I/O performance, register-to-register I/O, speed of arithmetic functions, etc.) Because the benchmarking program is synthetic, one can put any combination of instructions into it, and the benchmark is not necessarily realistic.
Toy Benchmarks are simple but somewhat realistic programs that help the designer get a preliminary idea of a how a processor will behave under pseudo-realistic constraints. These might include tasks such as solving a simple matrix equation, performing a near-trivial image processing operation, etc.
Kernels are more involved programs that capture the functionality of a larger program. For example, one can use an operating system kernal to get an idea of the operating system performance on a given processor, provided that the kernal is executed significantly more in practice than the other programs in the operating system.
Real programs are used (a) to measure performance at many stages of processor design, (b) to estimate the processor's performance under realistic computing constraints and (c) in assessment of fielded systems.

In practice, realistic benchmarks include engineering or scientific applications, software development tools, transaction processing, and office applications such as large spreadsheets or formatting of word processing documents.

1.5.5. Performance Reporting

After one has measured the performance of a given processor, then one formulates a test report. In order to make our report accurate and meaningful to a hardware designer, we must have at least three elements in the report:

Hardware/Software Configuration tells the designer what benchmarking programs were run on the processor under test, and how the processor was set up (e.g., how much memory, what clock rate, compiler version, etc.)
Evaluation Process Conditions provide information about special constraints on the instruction mix that can influence the reproducibility of the performance measurement.
Performance Summary is typically expressed in terms of average rates. How these averages are computed is of practical interest.

For example, consider the following types of means:

Normalized results are not easily understood or accurately compared with the arithmetic mean, because the mean of ratios between two series of measurements is not the same as the ratio of the series means. Instead, normalized results should be combined with the geometric mean, which is independent of the data series used for normalization. This is because the ratio of two series' geometric means is the same as the mean of the ratios between the series.

That is, given two data series {A_i} and {B_i}, where i = 1..n, if the arithmetic mean is used, then

AM({A_i}) / AM({B_i})

AM({A_i / B_i})

whereas, with the geometric mean:

GM({A_i}) / GM({B_i}) = GM({A_i / B_i}) .

Here, the construct {A_i / B_i} means divide the i-th element of A by the i-th element of B. For example, if {A_i} = {2,4,6} and {B_i} = {2,2,3}, then {A_i / B_i} = {2/2, 4/2, 6/3} = {1,2,2}.

Example. The practical implications of this problem are illustrated in the following table:

Here, "Normalized to A" means that the data for processors A, B, and C are divided by the data for A. In the first set, programs P1 and P2 are both normalized to A, hence, the arithmetic and geometric means for A are both 1.0 (the mean of 1.0 and 1.0 is 1.0). However, the arithmetic mean for B is 5.05 = (10.0 + 0.1)/2, while the geometric mean for B is 1.0, due to the previously-mentioned normalization property. Total execution time is likewise normalized in the first set, to the value for A, which is 1001 = 1000 + 1. Thus, the total time for B = 0.10989 = (100 + 10) / 1001, which is approximately 0.11.

Self-Exercise. (1) Derive all the entries in the preceding table. Be able to explain how each one was calculated. (A problem like this will likely be an exam question.)

1.5.6 Implications of Amdahl's Law

In the 1960s, Gene Amdahl derived an equation that expresses the diminishing returns that one obtains when trying to optimize a computer program. His equation has been adapted by Patterson and Hennesey, as follows:

This means that, the more you enhance the performance of a fraction f of a program or processor, the remaining (unenhanced) part (1 - f) gets progressively smaller. In order to keep the speedup S constant for each enhancement, you have to enhance performance of the remaining (1 - f) part to have a much greater speedup than S.

Example. Suppose a program takes 10 seconds to execute on processor P. After P's floating point (FP) unit is enhanced to be 5 times faster, then reinstalled in P, the effective speedup is graphed in Figure 1.31a. Here, the speedup is on the ordinate (vertical axis) and the fraction of the execution affected by the FP is on the abscissa. What is obvious is that in order to obtain a speedup of 2.0, the new FP has to be used approximately 0.65 (65 percent) of the time. In all but the most highly parallelized image or signal processing code, this is quite unlikely.

(a)

(b)

Figure 1.31. Amdahl's Law applied to processor performance enhancement, adapted from [Maf01].

The situation worsens in Figure 1.31b. Here, we see that if the new FP unit affects only 1/2 of the program execution, then we will approach a speedup of 2.0 asymptotically, regardless of how much the FP unit's performance is improved. As in Figure 1.31a, the speedup is graphed on the ordinate, with the enhancement factor of the FP unit on the abscissa.

Amdahl's Law has important implications for practical computer design. Namely, if a portion of the CPU does not affect program execution very much, then there is little point in drastically enhancing its performance. As a result, we try to enhance the most frequent deficiencies first. This is called enhancing the common case and it works, for the reasons shown in the preceding example.

1.6. Performance Benchmarking

Reading Assignments and Exercises

As we discussed in Section 1.5, a benchmark is a program that is used to measure a computer system's performance in a more-or-less standardized way. We aspire to standard benchmarks because of the need for realism and reproducibility in performance testing. In the ideal situation, one can test a processor using a representative sample of real workload. In practice, one first characterizes the working set of programs statistically (e.g., number of operations of each type), then computes the time each program requires, based on the sum of the product CPI_i · IC_i over all instruction types.

We next overview different types of benchmarks and how they have been, or are, used in practice.

1.6.1. Linpack Benchmark

The Linpack benchmark is an old, well-established measure of CPU and ALU performance that measures the time required to solve a dense system of linear equations. In FORTRAN, the Linpack kernal resembles the following loop:

     DO 10 I = 1,N
       DY(i) = DY(i) + DA * DX(i) 
 10  CONTINUE

Here, the "D" that precedes each variable denotes a double-precision operand. As a result, the Linpack benchmark challenges the ALU and control unit primarily. However, I/O is also challenged, due to the large data requirement associated with the vectors DX and DY.

Linpack has the following measures (also called metrics):

R_peak - peak system performance, in Gflops
N_max - matrix size that yields the highest R_peak
N_1/2 - matrix size that yields the 0.5 · R_peak
R_max - Gflops achieved for the matrix of size N_max

You can read more about the Linpack benchmark at http://www.top500.org/.

1.6.2. Intel's iCOMP Index 3.0

The iCOMP index is a geometric mean of relative performance measures (recall from Section 1.5 how geometric means work well with ratio-based metrics [also called ratiometric quantities]). The benchmark contains a mix of instructions that are assumed to characterize existing and emerging software. The iCOMP benchmark seems to be more relevant as the use of 3D graphics, multimedia, and Internet access increases.

iCOMP 3.0 has the following composition, with weighting indicated in percent:

#1,2: Multimedia and Internet application (25%)
#3: Two integer productivity applications (20% each)
#4: 3D geometry and lighting calculations (20%)
#5: Java application (10%)
#6: Floating point: engineering, finance, and game applications (5%)

The preceding six benchmark results are combined as follows. Let the ratio BM_i/Base_BM_i denote the i-th benchmark performance, where BM denotes the benchmark run on a test processor, and Base_BM denotes the benchmark run on a baseline processor, in this case a Pentium II clocked at 350 MHz. The iCOMP benchmark is computed as:

where P_i denotes a weighted exponent that incorporates the 1/6 power (root) that comprises the outer operation of the geometric mean. Thus, the iCOMP index is a weighted geometric mean of application-specific performance ratios.

1.6.3. SPEC2000 Benchmarks

The System Performance Evaluation Corporation (SPEC) benchmarks are upgraded every 2-3 years, to account for the characteristics of software as program designs evolve. For example, longer run time, larger problems, and greater diversity of applications characterize newer software. The SPEC2000 benchmark has both baseline and optimized measures, comprised of the geometric mean of normalized execution times. The reference machine for SPEC2000 is a Sun Ultra5_10 (300 MHz SPARC architecture with 256MB of memory).

The current SPEC2000 suite is comprised of the following 12 integer and 14 floating point (FP) programs, which measure response time and throughput:

164.gzip (C, integer) - Compression
175.vpr (C, integer) - FPGA circuit placement and routing
176.gcc (C, integer) - C programming language compiler
181.mcf (C, integer) - Combinatorial optimization
186.crafty (C, integer) - Chess game
197.parser (C, integer) - Word/language processing
252.eon (C++, integer) - Computer visualization
253.perlbmk (C, integer) - PERL programming language
254.gap (C, integer ) - Group theory, interpreter
255.vortex (C, integer) - Object-oriented database
256.bzip2 (C, integer ) - Another compression program
300.twolf (C, integer ) - Place and route simulator
168.wupwise (Fortran-77, FP) - Physics/quantum chromodynamics
171.swim (Fortran-77, FP) - Shallow water circulation modelling
172.mgrid (Fortran-77, FP) - Multigrid solver: 3D potential field
173.applu (Fortran-77, FP) - Parabolic/elliptical PDEs
177.mesa (C, FP) - 3D graphics library
178.galgel (Fortran-90, FP) - Computational fluid dynamics
179.art (C, FP) - Image recognition/neural networks
183.equake (C, FP) - Seismic wave propagation simulation
187.facerec (Fortran-90, FP) - Image processing: Face recognition
188.ammp (C, FP) - Computational chemistry
189.lucas (Fortran-90, FP) - Number theory/Primality testing
191.mfa3d (Fortran-90, FP) - Finite-element crash simulation
200.sixtrack (Fortran-77, FP) - High energy nuclear physics accelerator design
301.apsi (Fortran-77, FP) - Meteorology: Pollutant distribution

SPEC2000 metrics derived from the integer benchmarks (#1-12, above) include:

SPECint2000: The geometric mean of 12 normalized ratios when each benchmark is compiled in highly optimized mode (also called aggressive optimization)
SPECint_base2000: The geometric mean of 12 normalized ratios when compiled with normal optimization (also called conservative optimization)
SPECint_rate2000: The geometric mean of 12 normalized throughput ratios when compiled with aggressive optimization
SPECint_rate_base2000: The geometric mean of 12 normalized throughput ratios when compiled with conservative optimization

There is also a similar set of floating point metrics.

The results of the SPEC2000 integer benchmarks are shown in Figure 1.32, where the circle denotes the geometric mean, and the spokes of the wheel whose rim is the circle denote individual benchmarks keyed positionally to the diagram in the upper left-hand corner. Note that the Mips processor with IRIX operating system performs significantly worse than the mean on 181.mcf (combinatorial optimization), and the three processors all perform worse than the mean on 255.vortex (object-oriented database) and 252.eon (computer visualization). Since databases and visualization involve significant I/O, and visualization involves significant amounts of graphics, it would be reasonable to suggest that these areas need improvement in each of the three processors tested.

Figure 1.32. Test results for the SPEC2000 integer benchmarks run on three processors, with key in the upper left hand corner, after [Maf01].

The floating-point benchmarks shown in Figure 1.33 reveal similar discrepancies. For example, all processor perform relatively poorly on the 179.art (neural net) benchmark, which involves significant amounts of multiplication and division by small magnitudes. It is interesting to note that, in both the integer and floating point benchmark suites, the Intel Pentium III processor appears to have the most uniform performance (largest number of benchmark results closest to the mean). Since Intel is a contributor to the SPEC benchmark suite, it is not surprising that their processors could be designed to perform well on these established benchmarks.

Figure 1.33. Test results for the SPEC2000 floating point benchmarks run on three processors, with key in the upper left hand corner, after [Maf01].

When the geometric means (of the SPEC95 integer and floating-point benchmarks) are respectively graphed as a function of clock rate, we have the results shown in Figures 1.34 and 1.35. Note that (a) SPEC95 is similar to SPEC2000, and (b) the performance of the Pentium Pro exceeds that of the Pentium due to microarchitecture improvements.

Figure 1.34. Graph of SPEC95 integer benchmark performance as a function of processor clock rate in MHz, adapted from [Maf01].

It can also be seen that the improvement in performance carries through to the floating point computations, as shown in Figure 1.35. However, the slope of the floating point benchmark curve for the Pentium is not as great as that for the integer benchmark. This could imply (depending on the statistical significance of the measurements) that the Pentium processor does not yield improvements as readily for floating point computations as it does for integer computations.

Figure 1.35. Graph of SPEC95 floating point benchmark performance as a function of processor clock rate in MHz, adapted from [Maf01].

1.6.4. MIPS Benchmark

The speed of a processor in millions of instructions per second (MIPS) can be estimated as:

MIPS = IC / (10⁶ · t_cpu) ,

where t_cpu denotes CPU time. Unfortunately, due to different behavior characteristics of various processors (e.g., as shown in Figures 1.32 and 1.33) 1 MIPS on a given processor may not mean the same thing as 1 MIPS on another processor. In particular, MIPS is based heavily on instruction count (IC). This is like determining the winner of an auto race by who used fewer RPMs, or the winner of a marathon by who took fewer steps when running.

In practice, MIPS is especially misleading because floating point computations are done differently on different systems. For example, some computers do FP operations in software, while others use dedicated hardware units on- or off-chip. Also, the MIPS benchmark does not factor in the instruction mix, and thus compromises realistic characterization of workload.

The only situation where MIPS might be a useful benchmark is when the majority of variables have been fixed - for example, when comparing two processors from the same vendor that support the same ISA using the same compiler or the same benchmark suite. Apart from that, MIPS should be avoided as a measure of processor performance.

1.6.5. Fallacies and Pitfalls

When applying benchmarks to processors, one should avoid making the following mistakes:

Ignoring Amdahl's Law occurs when one tries to recursively optimize a processor, and puts much effort into the optimization process, only to achieve diminishing returns. Instead, one should optimize the common case first, then determine via a mathematical model such as Amdahl's Law whether or not further optimization is cost-effective.
Using MIPS as a performance metric is a common mistake, especially when one is in a hurry to portray a processor's performance in a favorable way. Instead, standard or well-established suites of benchmarks (such as the SPEC or iCOMP suite) should be applied to the processor, to highlight performance deficits in several categories.
Using the Arithmetic Mean of normalized CPU times is erroneous mathematically because the normalized CPU times are ratiometric quantities. As such, they cannot be combined into an ensemble statistic using the Arithmetic Mean. Instead, the Geometric Mean must be employed to combine ratiometric quantities.
Using "hardware-independent" measures such as code size defeats the purpose of hardware performance testing, namely, to highlight the specific advantages and deficits of a given processor. If the benchmark is hardware-independent, then it cannot produce information that is specific to a given architecture or processor. Instead, a suite of benchmarks such as SPEC or iCOMP should be used to identify hardware-specific performance deficits.
Assuming that synthetic benchmarks predict real performance is a major but common error. Recall from our discussion in Section 1.5, that synthetic benchmarks are programs devised by algorithm or hardware designers to measure specific types of performance on highly focused types of operation mixes. Because these benchmarks often bear little or no resemblance to real programs, they are of little use in characterizing a processor's performance in practical applications. Instead, a suite of benchmarks that is comprised of practical programs, such as SPEC or iCOMP, should be used.
Assuming that the geometric mean of CPU time ratios is proportional to the total execution time is erroneous for two reasons. First, the arithmetic mean, which is based on a sum, is proportional to total time. The geometric mean is based on a product and is thus not proportional to an accumulated (summed) total measure. Second, the CPU time is only part of the total execution time. Other factors that influence total time are I/O activity, wait time in a multiprocessor system, and network overhead in a parallel or distributed computing system.

In summary, performance measures are specific to a given program or suite of programs. CPU time is the only adequate measure of CPU performance - other metrics, such as I/O time or total time factor in effects that are not properly associated with the CPU (although they may be associated with a computer system). For a given ISA, performance increases result from (a) increases in clock rate, (b) improvements in processor architecture that reduce CPI, and (c) compiler enhancements that reduce IC (and, possibly, CPI) for a given type of instruction. The ideal benchmark is the workload that you plan to run on a given processor: all other assumptions about benchmarks can lead to unrealistic performance figures.

Real Computer Science begins where we almost stop reading ...