6.1. Overview of Memory Hierarchies
- Registers are the fastest type of memory, which are located internal to a processor. These elements are primarily used for temporary storage of operands, small partitions of memory, etc., and are assumed to be one word (32 bits) in length in the MIPS architecture.
- Cache is a very fast type of memory that can be external or internal to a given processor. Cache is used for temporary storage of blocks of data obtained from main memory (read operation) or created by the processor and eventually written to main memory (write operation).
- Main Memory is modelled as a large, linear array of storage elements that is partitioned into static and dynamic storage, as discussed in Section 2 of these notes. Main memory is used primarily for storage of data that a program produces during its execution, as well as for instruction storage. However, if main memory is too small for such data, then subsets of the data can be paged to/from disk storage.
- Disk Storage is much slower than main memory, but also has much higher capacity than the preceding three types of memory. Because of the relatively long search times, we prefer not to find data primarily in disk storage, but to page the disk data into main memory, where it can be searched much faster.
- Archival Storage is offline storage such as a CD-ROM jukebox or (in former years) rooms filled with racks containing magnetic tapes. This type of storage has a very long access time, in comparison with disk storage, and is also designed to be much less volatile than disk data.
6.2. Basics of Cache and Virtual Memory (VM)
6.2.1. Cache Misses and VM Page Faults
Figure 6.1. Schematic diagram of (a) cache and (b) virtual memory, showing the similarities between the two techniques.
- Replacement cost - When a cache miss or VM page fault occurs, the processor or memory management unit must (1) find the requested block or page in the lower-level store, (2) write the block or page back to the cache or paging store, and (3) set the appropriate bits (per Section 6.2.5) to reflect the recency of the replacement. These operations incur overhead that would not have occurred if the memory operation had been successful (i.e., if a miss or page fault had not occurred). Of particular interest are the bus and memory costs (discussed below), as well as the block or page replacement strategy. It is crucial that page replacement cause the least possible number of misses or page faults, as discussed in Section 6.2.3.
- Bus cost - When a cache block is obtained from memory, it must be transmitted along the bus that connects main memory with the processor or cache (depending on how the physical architecture supports cache). This bus must have very high bandwidth, so as not to impede cache replacement. Similarly, in VM, the bus between the disk and memory must be sufficiently fast not to inhibit data transfer in support of page replacement.
- Lower-level storage cost - If a cache miss is encountered, then the requested data must be retrieved from main memory, which implies that main memory itself must be sufficiently fast not to impede overall cache performance for a given average hit frequency. Similarly, in VM, the disk read/write latency must be sufficiently small so as not to slow down the page replacement process.
6.2.2. Address Translation Mechanisms for Cache and VM
- Data - The block of data from memory that is stored in a specific entry (also called row) or collection of entries in the cache,
- Tag - A small field of length K bits, used for comparison purposes in verifying correct addressing of data, and
- Valid Bit - A one-bit field that indicates status of data written into the cache.
- Tag - A K-bit field that corresponds to the K-bit tag field in each cache entry,
- Index - An M-bit field in the middle of the virtual address that points to one cache entry, and
- Byte Offset - Two bits (in MIPS) that are not used to address data in the cache.
- Direct Mapped - There is a one-to-one correspondence between each block of data in the cache and each memory block that we are trying to find. That is, if we are trying to find a memory block b, then there is one and only one place in the cache where b is stored.
- Set Associative - Given memory block b that we are trying to find in the cache, there are L entries in the cache that can contain b. We say that this type of cache is called L-way set associative. For example, if L = 2, then we have a two-way set associative cache.
- Fully Associative - Any block of memory that we are looking for in the cache can be found in any cache entry - this is the opposite of direct mapping.
- If a TLB miss (the requested page exists in memory), then only the address translation is missing from the TLB. This is handled by having the CPU load the translation from the page table into the TLB. The page reference (Step 1) is then tried again.
- If a page fault (page not in memory), then this condition is handled by the VM portion of the operating system through a page fault exception produced by the CPU.
- Retrieve the missing translation from the page table.
- Copy the reference and dirty bits for a given TLB entry back to the page table when the TLB entry is replaced. (These bits are the only part of a TLB entry to be modified.) Copying these bits back at miss time is efficient, since the TLB is designed to have a small miss rate.
Reading Assignment. The configuration of a typical TLB is discussed at the bottom of page 580 of the textbook, and the structure and functionality of a real TLB is overviewed on page 583. Know how these TLBs are configured, with justification for selection of each parameter.
6.2.3. Block/Page Replacement Strategies
6.2.4. Write-Through and Write-Back
- Write-through writes the cache (page table) entry to the cache (resp. page table) as well as to the memory location (resp. disk location) pointed to by the cache (resp. page table) entry being written.
- Write-back writes the cache (page table) entry to the cache (resp. page table) only until the current entry is replaced. At replacement time, the final entry is written to the memory location (resp. disk location) pointed to by the cache (resp. page table) entry.
Reading Assignment. An excellent review of the relative advantages of write-back and write-through is given on p. 607 of the textbook.
Hint: Know these advantages and disadvantages, because you might be asked to discuss them on an exam question.
6.3. Memory Systems Performance Analysis & Metrics
6.3.1. Analysis of Cache Size
Example 1. Let ND = 64 KB, L = 1, and M = 32 bits. From the previous equations, there are 214 words and, since L = 1, there are 214 blocks implying NA = 14 bits. Each block has 32 bits of data plus a tag of length NT = M - NA - 2 = 32 - 14 - 2 = 16 bits, plus a valid bit. The total cache size is thus given byNC = 214 · (32 + 32 - 14 - 2 + 1) bits = 214 · 49 bits = 784 Kbits .
6.3.2. Cache Addressing
Example 2. If NB = 64 blocks and K = 16 bytes, with A = 1200, then,which maps to block number 11 = 75 mod 64.
6.3.3. Memory System Design for Cache Implementation
Example 3. Let the memory incur time delay tA = 1 cycle to transmit the address, tM = 15 cycles for each DRAM access, and tD = 1 cycle to transmit the data from memory. If the cache in Case 1 has L = 4 words per block and a one-word wide bank of DRAMs, then the miss penalty is given byCmiss = tA + L · tM + L · tD = 1 + 4(15) + 4(1) = 65 cycles.Suppose we double the bus width (e.g., effectively decrease L by a factor of two). Then we have L = 2 andCmiss' = tA + L · tM + L · tD = 1 + 2(15) + 2(1) = 33 cycles.If we use a four-word wide memory, then L = 1 andCmiss" = tA + L · tM + L · tD = 1 + 15 + 1 = 17 cycles.For Cmiss, the number of bytes transferred per miss is given by Nb = 4(4)/Cmiss = 0.25. For Cmiss', Nb = 4(4)/Cmiss' = 0.48, and for Cmiss", Nb = 4(4)/Cmiss" = 0.94.
6.3.4. Improving Cache Performance
Example 4. Let a program P have a cache miss rate of 3 percent and a data cache miss rate of 5 percent. if a machine M has a CPI of 2.1 with no memory stalls and a miss penalty of 36 cycles, how much faster would M run if it had a perfect cache (one with no misses)?
Step 1. Assume that the instruction count is denoted by IC, as before. Then, we haveInstruction-miss-cycles = IC · 3 percent · 36 = 1.08 · IC .Step 2. Let the frequency of loads and stored in Program P be 40 percent of all instructions. Thus, using the preceding equations, we compute the number of memory miss cycles asMemory-miss-cycles = IC · 40 percent · 5 percent · 36 = 0.72 · IC .Step 3. From the results of Step 1 and Step 2, and from our preceding discussion of memory stalls, the total number of memory stall cycles is given by 1.08 · IC + 0.72 · IC = 1.8 · IC . This represents a memory stall penalty of more than one cycle per instruction. Thus, the revised CPI that accounts for memory stalls is given byCPI' = CPI + memory-stall-cycles = 2.1 + 1.8 = 3.9 .Step 4. From the givens, and from the results of Steps 1-3, above, the performance of the CPU with perfect cache is 2.1 (CPI without cache misses). As we have seen in Steps 1-3, above, the performance of the CPU with realistic cache is given by CPI' = 3.9. Taking the ratio of the CPIs, we find that the performance of the perfect cache can be expressed in terms of a factor f, which is computed asf = CPI'/CPI = 3.9 / 2 = 1.95 .This indicates a 95 percent improvement in performance when a perfect cache is used in this instance.
- A lower initial CPI will cause a greater impact due to memory stall cycles; and
- The CPU-memory performance gap will likely persist in this situation, even in the presence of hardware upgrades, because (a) main memory performance will probably not improve as fast as CPI cycle time, and (b) memory performance is measured in terms of CPU clock cycles expended by a cache miss. Thus, a higher CPU clock rate induces a higher miss penalty.
6.4. I/O Devices and Buses
6.4.1. I/O Devices
- Polled I/O, whereby the processor surveys each I/O device in turn to see if the device has output available for the processor, or is able to handle an I/O request.
- Interrupt-driven I/O, which is based on the concept of on-demand processing, whereby each I/O device generates an interrupt when an I/O event occurs (e.g., the user presses a key on the computer keyboard). The processor treats these I/O interrupts as discussed previously. For example, in MIPS, the I/O interrupt generates an entry in the EPC and Cause registers, and an interrupt handler transfers control to the appropriate software (utility routine) based on the type of interrupt. After the I/O interrupt is serviced, control returns to the program with which the interrupt was associated, and which is waiting to be executed.
- Direct Memory Access (DMA) I/O that transfers data directly from the I/O device to main memory, but requires special hardware. This technique is particularly useful when large amounts of data (e.g., images) must be moved from high-bandwidth devices (e.g., digital video cameras) into memory.
- Dumb or wire buses simply transmit data without adjustment of bus performance, packetizing of data, etc. This type of bus comprises the majority of board-level and chip-level interconnects in modern computers, for example, CPU-cache interconnection.
- Packet buses divide the transmitted data into partitions called packets. Each packet is transmitted along the bus as a contiguous entity. When all packets have been collected at the receiver (also called marshalling), they are re-assembled to reconstruct the original data. Problems with packets include collisions (when two packets try to occupy the same space at the same time), overhead due to large headers or small packet bodies, and erroneous or missing packets due to bus faults. However, packet errors that feature erroneous bits are not limited to packetized data - this effect can happen with any kind of bus.
- Reconfigurable buses have the ability to change their connectivity, transmission parameters (e.g., nominal bandwidth), and in some cases monitor data transmission along a bus. In a few very sophisticated implementations, adaptive error detection and correction as well as load balancing schemes can be employed to maintain quality of service within prespecified operational constraints.
- Simplex communication is unidirectional (e.g., component A transmits to component B via the bus, or vice versa). Transmission from A to B and from B to A does not occur at the same time.
- Duplex transmission is bidirectional (e.g., A-to-B and B-to-A transmission occurs concurrently), for example, by interleaving packets from messages that are sent in alternate directions.
Example 5. Suppose a bus is rated at a nominal bandwidth B = 1 gigabit per second (Gbps). What is the actual range of useful bus bandwidth in simplex mode, assuming f = 60 percent of packets have to be retransmitted? How many megabytes per second throughput can this realistic bandwidth achieve, absent of overhead?Answer: If f percent of the packets have to be retransmitted, then the nominal bus bandwidth is decreased by a factor of (1 + f/100). Thus, the effective bandwidth is given byB' = B / (1 + f/100) = B / (1 + 0.6) = 0.625 · B ,which corresponds to 625 Mbps. Converting bits to bytes, we obtain 78.125 MB/sec as the actual bus bandwidth.
- CD - cost of error detection procedure (e.g., parity checking, cehcksum formation and comparison);
- CC - cost of error correction (where possible); and
- CR - cost of packet retransmission.