Overview of Computer Abstractions
Figure 1.1. Levels of abstraction in computer systems, adapted from [Maf01]
- Operating System - Provides a convenient interface between (a) the user and his/her application software, and (b) the hardware (sometimes called the bare machine).
- Assembler - Translates assembly language, a primitive type of programming language, into machine code, which is a stream of ones and zeroes.
- Instruction Set Architecture(ISA) - Interfaces the software (listed above) to the hardware (listed below), and provides support for programming.
- Processor, Memory, and I/O System - These components support the execution of machine code instructions expressed in terms of the ISA.
- Datapath and Control - Provide a convenient abstraction for connecting the processor, memory, and I/O system and controlling their function efficiently.
- Input - Provides data and program information
- Datapath - Mediates I/O
- Control - Implements control of calculation and communication or I/O processes
- Memory - Storage and retrieval of programs or data
- Output - Result of running program on processor using input
1.2. Overview of Computer Abstractions
1.2.1. Different Abstract Views
Figure 1.2. Another view of levels of abstraction in computer systems, adapted from [Maf01].
- Level 5 - Problem Oriented Language - Provides a convenient interface and applications engine that helps the user produce results specific to a given application area. For example, Microsoft Word is used for document creation or editing, Excel for accounting spreadsheets, etc. The language at this level is usually a sequence of keystrokes or a high-level scripting language. In software design, the language is a high-level programming language such as C, C++, or Java.
- Level 4 - Assembly Language - Assembly is a very detailed language that helps the systems programmer or software designer move information around in a computer architecture in a highly specific way. For example, many compilers (programs that translate programming language into an assembly-like language) are written in assembly language. The advantage to assembly language is speed and power in accessing various features of the hardware.
- Level 3 - Operating System Machine - Provides a convenient interface between assembly language and the abstraction of the hardware architecture's instruction set. Operating systems generally contain many libraries that help a software developer or programmer connect to lower-level system functions (e.g., I/O, memory allocation, etc.) in an organized way.
- Level 2 - Instruction Set Architecture (ISA) - One of the most important parts of a computer is the interface between the lowest-level software and the hardware. The ISA includes anything programmers use to make a binary machine language program work correctly, including instructions, I/O, etc. The ISA facilitates design of functionality independent of the hardware.
- Level 1 - Microarchitectural Level - Microinstructions are low-level control instructions that define the set of datapath control signals which apply to a given state of a computing machine. The microinstructions, together with their sequencing, comprise the microarchitecture, whose purpose is to rigorously and consistently express the control of logic circuits that comprise the computer hardware. Designing this control in terms of a program that implements machine instructions in terms of simpler microinstructions is called microprogramming.
- Level 0 - Digital Logic - The circuitry that makes a digital computer run is called logic. All processes of a digital computer are expressed in terms of functions of ones and zeros, for example, and, or, and not functions. We will review these logic functions in Section 1.4.
Figure 1.3. Generalized view of levels of abstraction in computer systems, adapted from [Maf01].
Figure 1.4. Another view of levels of abstraction in computer systems, adapted from [Maf01].
1.2.2. RISC versus CISC
- Bigger is better!
- Make the hardware as "smart" (and, hence, as complex) as possible.
- If you have a great sprawling architecture, that's ok. The hardware fabricators will figure out what to do with your design.
- Don't worry about whether or not the system design is neatly partitioned into layers. (One great sea of logic gates would be ok until we figure out something else that works.)
- When one little part fails, the whole system dies and we will never find out why. That's ok - just build another CPU chip from the ground up. Maybe that one will work better.
- Small is beautiful.
- Keep the hardware simple and stupid (KISS design philosophy).
- Hardware and software designers should work together to make the architecture simple and modular.
- Neatly partition the system design into layers, as shown in Figures 1.2 and 1.3. Then, take the vast majority of the functionality that CISC implements in hardware, and put it into software (using compiler transformations) instead.
- Make the hardware and compiler robust, so the entire system can perform reliably.
1.2.3. Concept Summary
- Step 1. - Bare logic circuits (plugboard based programming)
- Step 2. - Microarchitectural control (machine language programming)
- Step 3. - Operating system (batch -> multiprogramming -> time-sharing -> parallelism)
- Step 4. - Complex architectures (compilers for high-level languages, e.g., FORTRAN, COBOL -> Pascal -> C -> C++ -> Java)
- Step 5. - Hardware/Software interface development (ISA -> CISC -> RISC)
1.3. Overview of Computer Technology Trends
1.3.1. Memory Capacity
Figure 1.5. Exponential increase in memory capacity as a function of time since the early 1970s, adapted from [Maf01].
1.3.2. Processor Capacity
Figure 1.6. Exponential increase in processor capacity as a function of time since the early 1970s (Moore's Law), adapted from [Maf01].
1.3.3. Processor Performance
Figure 1.7. Exponential increase in processor performance as a function of time since the early 1970s, adapted from [Maf01].
1.3.4. Physical Limits on Computer Capacity and Performance
1.3.5. Types of Architectures
Figure 1.8. Schematic diagram of von Neumann architecture, adapted from [Maf01].
Figure 1.9. Schematic diagram of a bus architecture, adapted from [Maf01].
Figure 1.10. Schematic diagram of PC bus architecture, adapted from [Maf01].
Figure 1.11. Schematic diagram of multiprocessor architecture, where each CPU shares a memory module that is connected to a common bus - adapted from [Maf01].
Figure 1.12. Schematic diagram of multiprocessor architecture with shared memory, where each CPU also has its own fast, local memory - adapted from [Maf01].
1.4. Digital Logic Design
1.4.1. Theory of Logical Operations
1.4.2. Representations of Digital Logic
Self-Exercise. How many rows are there in the compact representation of the or operation's truth table, and what are these rows?
Figure 1.13. Graphic symbols and truth tables for the logic gates that implement and as well as or operations, adapted from [Maf01].
Figure 1.14. Graphic symbols and truth tables for negative logic gates that implement (a) not, (b) nand, (c) nor operations - adapted from [Maf01].
1.4.3. Circuit Implementation of Digital Logic
Figure 1.15. Transistor implementation of a NOT gate, adapted from [Maf01].
Figure 1.16. Transistor implementation of a NAND gate, adapted from [Maf01].
Figure 1.17. Transistor implementation of a NOR gate, adapted from [Maf01].
Figure 1.18. Example of quad nand gate integrated circuit pin-out diagram, adapted from [Maf01].
1.4.4. More Complex Logic Circuits
p and p p .
p and q q and p .
(p and q) and r p and (q and r) .
p or (q and r) (p or q) and (p or r) .
not(p and q) not(p) or not(q) .
Self-Exercise. Construct truth tables for (a) distributive and (b) DeMorgan's Laws, listed above.
Figure 1.19. Example of quad nand gate integrated circuit pin-out diagram, adapted from [Maf01].
Figure 1.20. Example of Majority Vote Circuit, its truth table, and the governing equation, adapted from [Maf01].
Self-Exercise. How many minterms would there be in the equation for the majority-vote circuit with four inputs? five inputs? N inputs?
1.4.5. Comparators and Combinatorial Logic
A B X A B X -+--------------- -+--------------- xor | 0 0 0 or | 0 0 0 | 0 1 1 | 0 1 1 | 1 0 1 | 1 0 1 | 1 1 0 | 1 1 1
Figure 1.21. Example of Comparator Circuit, adapted from [Maf01].
Figure 1.22. Example of Multiplexer Circuit, adapted from [Maf01].
Self-Exercise. Write the governing equation of the multiplexer shown in Figure 1.22. How does this compare with the governing equation of the majority-vote circuit shown in Figure 1.20? In what detailed ways are these circuits different?
Figure 1.23. Example of Decoder Circuit, adapted from [Maf01].
Self-Exercise. Write the governing equation of the decoder shown in Figure 1.23. How does this compare with the governing equation of the multiplexer shown in Figure 1.22? In what detailed ways are these circuits different? (Hint: Think about minterms, complementation, and DeMorgan's laws.)
1.4.6. Programmable Logic Arrays
Self-Exercise. What are the two levels of processing in the circuits shown in Section 1.4.5?
Figure 1.24. Example of programmable logic array, adapted from [Maf01].
1.4.7. Synchronous Logic
Figure 1.25. State changes in logic circuits: (a) synchronous logic changes state on the leading clock pulse (0-1 transition), and (b) asynchronous logic does not require a clock to change state - adapted from [Maf01].
Figure 1.26. Example of clock output, with symmetric (A) and asymmetric (B and C) clock pulses, adapted from [Maf01].
Figure 1.27. The two states of the SR latch, adapted from [Maf01].
Self-Exercise. Derive the truth tables for State 0 and State 1 of the SR latch shown in Figure 1.27.
Figure 1.28. The Clocked SR Latch, adapted from [Maf01].
Self-Exercise. (1) Derive the truth table for the clocked input to the SR Latch in Figure 1.28, using the and gate logic, as shown. (2) Prove (using truth tables) how this could or could not be implemented in nor logic.
Figure 1.29. The Clocked D Latch, adapted from [Maf01].
Figure 1.30. The D flip-flop with timing diagram, adapted from [Maf01].
Self-Exercise. (1) Derive the truth table for each of the four input states of the D flip-flop as shown in Figure 1.30. Hint: Input states are defined by the values of D and C (e.g., State 0: D = 0, C = 0->0; State 2: D = 0, C = 0 -> 1; etc.)
1.5. Computer Performance AssessmentReading Assignments and Exercises
- Method of Performance Evaluation - By what measures will the performance be assessed? Will the assessment be local to a specific part of the computer hardware, or global (i.e., across all hardware and software modules)? What will the mix of instructions be?
- Limitations of Evaluation - What are the shortcomings of each method or measure of performance assessment? From whence do these limitations arise, and why? What can be done to improve these limitations?
- Metrics - What formal measures are to be used for performance evaluation? Speed? Time used in a computation? Memory space required?
- Processor Performance Equation - How will the processor's performance be described mathematically? Does the equation portray actual parameters of practical interest, or does the computational model described by the equation not match the processing architecture under test?
- Performance Evaluation Reports - What methods of analysis and presentation are used to evaluate data? Is the data normalized to a particular benchmark result or parameter? Is the normalization performed in a physically meaningful way?
- Ahmdahl's Law - Whatever measurements are made, as one continues to enhance processor performance, the overall improvement achieves diminishing returns.
1.5.1. Performance Evaluation
1.5.2. Measuring Performance
- Wall-Clock Time - how long it takes (typically, time in seconds) for your program to execute, from the time it is invoked to the time it completes. This time is measured with respect to a global time standard, so we name it according to a common object such as a wall clock. This is also called elapsed time.
- CPU Time - comprised of user CPU time (time spent computing the program), and system CPU time (the time the operating system spends supporting the program).
- I/O Time - time spend reading and writing data from/to memory.
- Other Time - time spend running other programs, waiting for the operating system to be scheduled, etc.
timecommand. For example, let's enter the UNIX command
time duat one's root directory (which can be reached via the command
cd. When I tried this on my (large) directory tree, I got the following result:
0.21u 1.54s 0:24.49 7.1%
- User CPU Time = 0.21 seconds, the first number (
- System CPU Time = 1.54 seconds, the second number (
- Elapsed Time = 24.49 seconds, the third number (
- CPU Activity as Percent of Elapsed Time = 7.1 percent, the fourth number (
- Step 1. Add the User and System CPU Times to get Total
CPU Time (e.g., 1.75 sec = 0.21 sec + 1.54 sec).
Step 2. Divide the Total CPU Time by Elapsed Time (e.g.,
0.071 = 1.75 sec / 24.49 sec).
Step 3. Multiply by 100 to get percent (e.g., 7.1% = 0.071 x 100 percent).
- Step 1. Add the User and System CPU Times to get Total
CPU Time (e.g., 1.75 sec = 0.21 sec + 1.54 sec).
Step 2. Subtract the Total CPU Time from Elapsed Time (e.g.,
Other Time = 22.74 sec = 24.49 sec - 1.75 sec).
1.5.3. CPU Performance Equation
Ncyc(Seq.2) = (4 · 1) + (1 · 2) + (1 · 3) = 9 cycles .
1.5.4. Performance Evaluation
- Synthetic Benchmarks are used to exactly measure specific characteristics of a processor (e.g., memory I/O performance, register-to-register I/O, speed of arithmetic functions, etc.) Because the benchmarking program is synthetic, one can put any combination of instructions into it, and the benchmark is not necessarily realistic.
- Toy Benchmarks are simple but somewhat realistic programs that help the designer get a preliminary idea of a how a processor will behave under pseudo-realistic constraints. These might include tasks such as solving a simple matrix equation, performing a near-trivial image processing operation, etc.
- Kernels are more involved programs that capture the functionality of a larger program. For example, one can use an operating system kernal to get an idea of the operating system performance on a given processor, provided that the kernal is executed significantly more in practice than the other programs in the operating system.
- Real programs are used (a) to measure performance at many stages of processor design, (b) to estimate the processor's performance under realistic computing constraints and (c) in assessment of fielded systems.
1.5.5. Performance Reporting
- Hardware/Software Configuration tells the designer what benchmarking programs were run on the processor under test, and how the processor was set up (e.g., how much memory, what clock rate, compiler version, etc.)
- Evaluation Process Conditions provide information about special constraints on the instruction mix that can influence the reproducibility of the performance measurement.
- Performance Summary is typically expressed in terms of average rates. How these averages are computed is of practical interest.
Self-Exercise. (1) Derive all the entries in the preceding table. Be able to explain how each one was calculated. (A problem like this will likely be an exam question.)
1.5.6 Implications of Amdahl's Law
Figure 1.31. Amdahl's Law applied to processor performance enhancement, adapted from [Maf01].
1.6. Performance BenchmarkingReading Assignments and Exercises
1.6.1. Linpack Benchmark
DO 10 I = 1,N DY(i) = DY(i) + DA * DX(i) 10 CONTINUE
- Rpeak - peak system performance, in Gflops
- Nmax - matrix size that yields the highest Rpeak
- N1/2 - matrix size that yields the 0.5 · Rpeak
- Rmax - Gflops achieved for the matrix of size Nmax
1.6.2. Intel's iCOMP Index 3.0
- #1,2: Multimedia and Internet application (25%)
- #3: Two integer productivity applications (20% each)
- #4: 3D geometry and lighting calculations (20%)
- #5: Java application (10%)
- #6: Floating point: engineering, finance, and game applications (5%)
1.6.3. SPEC2000 Benchmarks
164.gzip(C, integer) - Compression
175.vpr(C, integer) - FPGA circuit placement and routing
176.gcc(C, integer) - C programming language compiler
181.mcf(C, integer) - Combinatorial optimization
186.crafty(C, integer) - Chess game
197.parser(C, integer) - Word/language processing
252.eon(C++, integer) - Computer visualization
253.perlbmk(C, integer) - PERL programming language
254.gap(C, integer ) - Group theory, interpreter
255.vortex(C, integer) - Object-oriented database
256.bzip2(C, integer ) - Another compression program
300.twolf(C, integer ) - Place and route simulator
168.wupwise(Fortran-77, FP) - Physics/quantum chromodynamics
171.swim(Fortran-77, FP) - Shallow water circulation modelling
172.mgrid(Fortran-77, FP) - Multigrid solver: 3D potential field
173.applu(Fortran-77, FP) - Parabolic/elliptical PDEs
177.mesa(C, FP) - 3D graphics library
178.galgel(Fortran-90, FP) - Computational fluid dynamics
179.art(C, FP) - Image recognition/neural networks
183.equake(C, FP) - Seismic wave propagation simulation
187.facerec(Fortran-90, FP) - Image processing: Face recognition
188.ammp(C, FP) - Computational chemistry
189.lucas(Fortran-90, FP) - Number theory/Primality testing
191.mfa3d(Fortran-90, FP) - Finite-element crash simulation
200.sixtrack(Fortran-77, FP) - High energy nuclear physics accelerator design
301.apsi(Fortran-77, FP) - Meteorology: Pollutant distribution
SPECint2000: The geometric mean of 12 normalized ratios when each benchmark is compiled in highly optimized mode (also called aggressive optimization)
SPECint_base2000: The geometric mean of 12 normalized ratios when compiled with normal optimization (also called conservative optimization)
SPECint_rate2000: The geometric mean of 12 normalized throughput ratios when compiled with aggressive optimization
SPECint_rate_base2000: The geometric mean of 12 normalized throughput ratios when compiled with conservative optimization
181.mcf(combinatorial optimization), and the three processors all perform worse than the mean on
255.vortex(object-oriented database) and
252.eon(computer visualization). Since databases and visualization involve significant I/O, and visualization involves significant amounts of graphics, it would be reasonable to suggest that these areas need improvement in each of the three processors tested.
Figure 1.32. Test results for the SPEC2000 integer benchmarks run on three processors, with key in the upper left hand corner, after [Maf01].
179.art(neural net) benchmark, which involves significant amounts of multiplication and division by small magnitudes. It is interesting to note that, in both the integer and floating point benchmark suites, the Intel Pentium III processor appears to have the most uniform performance (largest number of benchmark results closest to the mean). Since Intel is a contributor to the SPEC benchmark suite, it is not surprising that their processors could be designed to perform well on these established benchmarks.
Figure 1.33. Test results for the SPEC2000 floating point benchmarks run on three processors, with key in the upper left hand corner, after [Maf01].
Figure 1.34. Graph of SPEC95 integer benchmark performance as a function of processor clock rate in MHz, adapted from [Maf01].
Figure 1.35. Graph of SPEC95 floating point benchmark performance as a function of processor clock rate in MHz, adapted from [Maf01].
1.6.4. MIPS Benchmark
1.6.5. Fallacies and Pitfalls
- Ignoring Amdahl's Law occurs when one tries to recursively optimize a processor, and puts much effort into the optimization process, only to achieve diminishing returns. Instead, one should optimize the common case first, then determine via a mathematical model such as Amdahl's Law whether or not further optimization is cost-effective.
- Using MIPS as a performance metric is a common mistake, especially when one is in a hurry to portray a processor's performance in a favorable way. Instead, standard or well-established suites of benchmarks (such as the SPEC or iCOMP suite) should be applied to the processor, to highlight performance deficits in several categories.
- Using the Arithmetic Mean of normalized CPU times is erroneous mathematically because the normalized CPU times are ratiometric quantities. As such, they cannot be combined into an ensemble statistic using the Arithmetic Mean. Instead, the Geometric Mean must be employed to combine ratiometric quantities.
- Using "hardware-independent" measures such as code size defeats the purpose of hardware performance testing, namely, to highlight the specific advantages and deficits of a given processor. If the benchmark is hardware-independent, then it cannot produce information that is specific to a given architecture or processor. Instead, a suite of benchmarks such as SPEC or iCOMP should be used to identify hardware-specific performance deficits.
- Assuming that synthetic benchmarks predict real performance is a major but common error. Recall from our discussion in Section 1.5, that synthetic benchmarks are programs devised by algorithm or hardware designers to measure specific types of performance on highly focused types of operation mixes. Because these benchmarks often bear little or no resemblance to real programs, they are of little use in characterizing a processor's performance in practical applications. Instead, a suite of benchmarks that is comprised of practical programs, such as SPEC or iCOMP, should be used.
- Assuming that the geometric mean of CPU time ratios is proportional to the total execution time is erroneous for two reasons. First, the arithmetic mean, which is based on a sum, is proportional to total time. The geometric mean is based on a product and is thus not proportional to an accumulated (summed) total measure. Second, the CPU time is only part of the total execution time. Other factors that influence total time are I/O activity, wait time in a multiprocessor system, and network overhead in a parallel or distributed computing system.