Friday, 4 April 2014

Cache Memory Assignments

Problem 1.

The diagram above illustrates a blocked, direct-mapped cache for a computer that uses 32-bit data words and 32-bit byte addresses.
1. What is the maximum number words of data from main memory that can be stored in the cache at any one time?
Maximum number of data words from main memory = (16 lines)(4 words/line) = 64 words
1. How many bits of the address are used to select which line of the cache is accessed?
With 16 cache lines, 4 bits of the address are required to select which line of the cache is accessed.
1. How many bits wide is the tag field?
Bits in the tag field = (32 address bits) - (4 bits to select line) - (4 bits to select word/byte) = 24 bits
1. Briefly explain the purpose of the one-bit V field associated with each cache line.
The tag and data fields of the cache will always have value in them, so the V bit is used to denote whether these value are consistent (valid) with what is in memory. Typically the V bit for each line in the cache is set to "0" when the machine is reset or the cache is flushed.
1. Assume that memory location 0x2045C was present in the cache. Using the row and column labels from the figure, in what cache location(s) could we find the data from that memory location? What would the value(s) of the tag field(s) have to be for the cache row(s) in which the data appears?
The cache uses ADDR[7:4] to determine where data from a particular address will be stored in the cache. Thus, location 0x0002045C will be stored in line 5 of cache. The tag field should contain the upper 24 bits of the address, i.e., 0x000204. Note that the bottom 4 bits of the address (0xC) determine which word and byte of the cache line is being referenced.
1. Can data from locations 0x12368 and 0x322FF8 be present in the cache at the same time? What about data from locations 0x2536038 and 0x1034? Explain.
Location 0x12368 will be stored in line 6 of the cache. Location 0x322F68 will be stored in line F of the cache. Since the lines differ, both locations can be cached at the same time. However, locations 0x2536038 and 0x1034 both would be stored in line 3 of cache, so they both could not be present in the cache at the same time.
1. When an access causes a cache miss, how many words need to be fetched from memory to fill the appropriate cache location(s) and satisfy the request?
There are 4 words in each line of the cache and since we only have one valid bit for the whole line, all 4 words have to have valid values. So to fill a cache line on a cache miss all 4 words would have to be fetched from main memory.

Problem 2. Cache multiple choice:
1. If a cache access requires one clock cycle and handling cache misses stalls the processor for an additional five cycles, which of the following cache hit rates comes closest to achieving an average memory access of 2 cycles?

(A) 75%
(B) 80%
(C) 83%
(D) 86%
(E) 98%
2 cycle average access = (1 cycle for cache) + (1 - hit rate)(5 cycles stall)
=> hit rate = 80%
1. LRU is an effective cache replacement strategy primarily because programs

(A) exhibit locality of reference
(B) usually have small working sets
(C) read data much more frequently than write data
(D) can generate addresses that collide in the cache
(A). Locality implies that the probability of accessing a location decreases as the time since the last access increases. By choosing to replace locations that haven't been used for the longest time, the least-recently-used replacement strategy should, in theory, be replacing locations that have the lowest probability of being accessed in the future.
1. If increasing the associativity of a cache improves performance it is primarily because programs

(A) exhibit locality of reference
(B) usually have small working sets
(C) read data much more frequently than write data
(D) can generate addresses that collide in the cache
(D). Increasing cache associativity means that there are more cache locations in which a given memory word can reside, so replacements due to cache collisions (multiple addresses mapping to the same cache location) should be reduced.
1. If increasing the block size of a cache improves performance it is primarily because programs

(A) exhibit locality of reference
(B) usually have small working sets
(C) read data much more frequently than write data
(D) can generate addresses that collide in the cache
(A). Increased block size means that more words are fetched when filling a cache line after a miss on a particular location. If this leads to increased performance, then the nearby words in the block must have been accessed by the program later on, ie, the program is exhibiting locality.
1. A fully-associative cache using an LRU replacement policy always has a better hit rate than a direct-mapped cache with the same total data capacity.

(A) true
(B) false
False. Suppose both caches contain N words and consider a program that repeatedly accesses locations 0 through N (a total of N+1 words). The direct-mapped cache will map locations 0 and N into the same cache line and words 1 through N-1 into separate cache lines. So in the steady state, the program will miss twice (on locations 0 and N) each time through the loop.Now the fully-associative case: when the program first accesses word N, the FA cache will replace word 0 (the least-recently-used location). The next access is to location 0 and the FA cache will replace word 1, etc. So the FA cache is always choosing the replace the word the program is about to access, leading to a 0% hit rate!
1. Consider the following program:
```integer A[1000];
for i = 1 to 1000
for j = 1 to 1000
A[i] = A[i] + 1
```
When the above program is compiled with all compiler optimizations turned off and run on a processor with a 1K byte direct-mapped write-back data cache with 4-word cache blocks, what is the approximate data cache miss rate? (Assume integers are one word long and a word is 4 bytes.)

(A) 0.0125%
(B) 0.05%
(C) 0.1%
(D) 5%
(E) 12.5%
(A). Considering only the data accesses, the program performs 1,000,000 reads and 1,000,000 writes. Since the cache has 4-word blocks, each miss brings 4 words of the array into the cache. So accesses to the next 3 array locations won't cause a miss. Since the cache is write-back, writes happen directly into the cache without causing any memory accesses until the word is replaced. So altogether there are 250 misses (caused by a read of A[0], A[4], A[8], ...), for a miss rate of 250/2,000,000 = 0.0125%
1. In a non-pipelined single-cycle-per-instruction processor with an instruction cache, the average instruction cache miss rate is 5%. It takes 8 clock cycles to fetch a cache line from the main memory. Disregarding data cache misses, what is the approximate average CPI (cycles per instruction)?

(A) 0.45
(B) 0.714
(C) 1.4
(D) 1.8
(E) 2.22
(C). CPI = (1 inst-per-cycle) + (0.05)(8 cycles/miss) = 1.4
1. Consider an 8-line one-word-block direct-mapped cache initialized to all zeroes where the following sequence of word addresses are accessed:1, 4, 5, 20, 9, 19, 4, 5, 6, and 9.
Which of the following tables reflect the final tag bits of the cache?
First map the addresses to cache line numbers and tags where
line number = address mod 8
```address:   1   4   5  20   9  19   4   5   6   9
line #:    1   4   5   4   1   3   4   5   6   1
tag:       0   0   0   2   1   2   0   0   0   1
```
So, figure (E) represents the final tag bits of the cache.
1. Consider the following partitioning of a CPU's 32-bit address output which is fed to a direct-mapped write-back cache:
What is the memory requirement for data, tag and status bits of the cache?

(A) 8 K bits
(B) 42 K bits
(C) 392 K bits
(D) 1,160 K bits
(E) 3,200 K bits
The tag is 15 bits, cache index is 13 bits, and byte offset 4 bits (ie, 16 bytes/block). So there are 213= 8192 cache lines. Each cache line requires
``` 15 tag bits
1 valid bit
1 dirty bit (since this is a write-back cache)
128 data bits (16 bytes/cache line)
===
145 bits per cache line
```
Total storage required = 8192*145 bits = 1,160K bits.

Problem 3. A student has miswired the address lines going to the memory of an unpipelined BETA. The wires in question carry a 30-bit word address to the memory subsystem, and the hapless student has in fact reversed the order of all 30 address bits. Much to his surprise, the machine continues to work perfectly.
1. Explain why the miswiring doesn't affect the operation of the machine.
Since the Beta reverses the order of the 30 bit address in the same manner for each memory access, the Beta will use the same reversed address to access a particular memory location for both stores and loads. Thus, the operation of the machine will not be affected.
1. The student now replaces the memory in his miswired BETA with a supposedly higher performance unit that contains both a fast direct mapped cache and the same memory as before. The reversed wiring still exists between the BETA and this new unit. To his surprise, the new unit does not significantly improve the performance of his machine. In desperation, the student then fixes the reversal of his address lines and the machine's performance improves tremendously. Explain why this happens.
Caches take advantage of locality of reference by reading in an entire block of related data at one time, thereby reducing main memory accesses. By reversing the order of the 30 bit address, locality of the memory addresses is disrupted. The low-order bits that would normally place related data close to one another are instead the high-order bits and related data is more spread out through the main memory. This reduction in locality reduces cache performance significantly. When the student fixes the address line reversal problem, locality of the memory is restored, and the cache can perform as intended.

Problem 4. For this problem, assume that you have a processor with a cache connected to main memory via a bus. A successful cache access by the processor (a hit) takes 1 cycle. After an unsuccessful cache access (a miss), an entire cache block must be fetched from main memory over the bus. The fetch is not initiated until the cycle following the miss. A bus transaction consists of one cycle to send the address to memory, four cycles of idle time for main-memory access, and then one cycle to transfer each word in the block from main memory to the cache. Assume that the processor continues execution only after the last word of the block has arrived. In other words, if the block size is B words (at 32 bits/word), a cache miss will cost 1 + 1 + 4 + B cycles. The following table gives the average cache miss rates of a 1 Mbyte cache for various block sizes:
1. Write an expression for the average memory access time for a 1-Mbyte cache and a B-word block size (in terms of the miss ratio m and B).
Average access time = (1-m)(1 cycle) + (m)(6 + B cycles) = 1 + (m)(5+B) cycles
1. What block size yields the best average memory access time?
1. If bus contention adds three cycles to the main-memory access time, which block size yields the best average memory access time?
1. If bus width is quadrupled to 128 bits, reducing the time spent in the transfer portion of a bus transaction to 25% of its previous value, what is the optimal block size? Assume that a minimum one transfer cycle is needed and don't include the contention cycles introduced in part (C).

Problem 5. The following four cache designs C1 through C4, are proposed for the Beta. All use LRU replacement where applicable (e.g. within each set of a set associative cache).
1. Which cache would you expect to take the most chip area (hence cost) ?
Cache C4 would most likely take up the most chip area because it is fully associative, thereby requiring a comparator for each cache line, and because it has the most data word capacity.
1. Which cache is likely to perform worst in a benchmark involving repeated cycling through an array of 6K integers ? Explain.
C2 would likely have the worst performance on a benchmark involving repeated cycling through an array of 6K integers since it is the only cache listed with less than 6K data word capacity.
1. It is observed that one of the caches performs very poorly in a particular benchmark which repeatedly copies one 1000-word array to another. Moving one of the arrays seems to cure the problem. Which cache is most likely to exhibit this behavior ? Explain.
We are told that one of the caches performs poorly in a particular benchmark which repeatedly copies one 1000-word array to another and that if one of the arrays is moved, the problem seems to be cured. This behavior is most likely exhibited by cache C3 because it is a direct mapped cache which only has one location to put any particular address. If the lower bits (used to choose the cache line) for the addresses of the array overlap, poor performance could be observed. Moving the array so that the lower bits of the array addresses don't overlap could solve this problem.
1. Recall that we say cache A dominates cache B if for every input pattern, A caches every location cached by B. Identify every pair (A, B) of caches from the above list where A dominates B. Explain your reasoning.
So long as we are not using a random replacement strategy, it is always possible to come up with a benchmark that will make a particular type of cache have a miss on every data access. Thus, we cannot say that one particular type of cache always dominates another type of cache. However, we can compare two caches of the same type. Both C4 and C1 are fully associative caches with the same replacement strategy. We can say that C4 dominates C1 since C4 has a greater data word capacity.

Problem 6. The data-sheet for a particular byte-addressable 32-bit microprocessor reads as follows:
The CPU produces a 32-bit virtual address for both data and instruction fetches. There are two caches: one is used when fetching instructions; the other is used for data accesses. Both caches are virtually addressed. The instruction cache is two-way set-associative with a total of 212 bytes of data storage, with 32-byte blocks. The data cache is two-way set-associative with a total of 213 bytes of data storage, with 32-byte blocks
1. How many bits long is the tag field in each line of the instruction cache?
There are 32 = 25 bytes per block. The cache has 212 total bytes and is 2-way set associative, so each set has 211 bytes and thus 211/25 = 26 cache lines. So the address is partitioned by the cache as follows:
[4:0] = 5 address bits for selecting byte/word within a block
[10:5] = 6 address bits for selecting the cache line
[31:11] = 21 address bit of tag field
1. How many address bits are used to choose which line is accessed in the data cache?
There are 32 = 25 bytes per block. The cache has 213 total bytes and is 2-way set associative, so each set has 212 bytes and thus 212/25 = 27 cache lines. So the address is partitioned by the cache as follows:
[4:0] = 5 address bits for selecting byte/word within a block
[11:5] = 7 address bits for selecting the cache line
[31:12] = 20 address bit of tag field
1. Which of the following instruction addresses would never collide in the instruction cache with an instruction stored at location 0x0ACE6004?

(B) 0x0C81C81C    (E) 0xFACE6004
(C) 0x00000004    (F) 0x0CEDE008
Collisions happen when instruction addresses map to the same cache line. Referring to the answer for (A), address bits [10:5] are used to determine the cache line, so location 0x0ACE6004 is mapped to cache line 0.Only (D) 0x0ACE6838 maps to a different cache line and hence could never collide in the instruction cache with location 0x0ACE6004.
1. What is the number of instructions in the largest instruction loop that could be executed with a 100% instruction cache hit rate during all but the first time through the loop?
The instruction cache hold 212 bytes or 210 = 1024 instructions. So if the loop had 1024 instructions it would just fit into the cache.

Problem 7. The following questions ask you to evaluate alternative cache designs using patterns of memory references taken from running programs. Each of the caches under consideration has a total capacity of 8 (4-byte) words, with one word stored in each cache line. The cache designs under consideration are:
DM: a direct-mapped cache.S2: a 2-way set-associative cache with a least-recently-used replacement policy.
FA: a fully-associative cache with a least-recently-used replacement policy.
The questions below present a sequence of addresses for memory reads. You should assume the sequences repeat from the start whenever you see "...". Keep in mind that byte addressing is used; addresses of consecutive words in memory differ by 4. Each question asks which cache(s) give the best hit rate for the sequence. Answer by considering the steady-state hit rate, i.e., the percentage of memory references that hit in the cache after the sequence has been repeated many times.
1. Which cache(s) have the best hit rate for the sequence 0, 16, 4, 36, ...
DM: locations 4 and 36 collide, so each iteration has 2 hits, 2 misses.S2: 100% hit rate. 0 and 16 map to the same cache line, as do 4 and 36, but since the cache is 2-way associative they don't collide.FA: 100% hit rate. The cache is only half filled by this loop.
1. Which cache(s) have the best hit rate for the sequence 0, 4, 8, 12, 16, 20, 24, 28, 32, ...
DM: locations 0 and 32 collide, so each iteration has 7 hits, 2 misses.S2: locations 0, 16 and 32 all map to the same cache line. The LRU replacement strategy replaces 0 when accessing 32, 16 when accesing 0, 32 when accessing 16, etc., so each iteration has 6 hits, 3 misses.FA: has 0% hit rate in the steady state since the LRU replacement strategy throws out each location just before it's accessed by the loop!
1. Which cache(s) have the best hit rate for the sequence 0, 4, 8, 12, 16, 20, 24, 28, 32, 28, 24, 20, 16, 12, 8, 4, ...
All caches perform the same -- locations 0 and 32 trade places in the caches, so each iteration has 14 hits and 2 misses.
1. Which cache(s) have the best hit rate for the sequence 0, 4, 8, 12, 32, 36, 40, 44, 16, ..
DM: 32 collides with 0, 36 with 4, 40 with 8, 44 with 12, so each itreation has only 1 hit and 8 misses.S2: locations 0, 16 and 32 trade places in the cache, so each iteration has 6 hits and 3 misses.FA: 0 hits since LRU throws out each location just before it's accessed by the loop.
1. Assume that a cache access takes 1 cycle and a memory access takes 4 cycles. If a memory access is initiated only after the cache has missed, what is the maximum miss rate we can tolerate before use of the cache actually slows down accesses?
If accesses always go to memory, it takes 4 cycles per access. Using the cache the average number of cycles per access is
1 + (miss rate)*4
So if the miss rate is larger than 75% the average number of cycles per access is more than 4.

Problem 8. Ben Bitdiddle has been exploring various cache designs for use with the Beta processor. He is considering only caches with one word (4 bytes) per line. He is interested in the following cache designs:
C1: 2-way set associative, LRU replacement, 256 total data words (128 sets of 2 words each).C2: 2-way set associative, random replacement, 256 total data words (128 sets of 2 words each).
C3: 2-way set associative, LRU replacement, 512 total data words (256 sets of 2 words each).
C4: 4-way set associative, LRU replacement, 512 total data words (128 sets of 4 words each).
C5: Fully associative, LRU replacement, 512 total data words.
In order to help her analysis, Ben is trying to identify cases where one cache dominates another in terms of cache hits. Ben considers that cache A dominates cache B if, given identical strings of memory references, every memory reference that gives a cache hit using B is also a hit using A. Thus if A dominates B, A will give at least as high a hit rate as B for every program.
In each of the following pairs of caches, deduce whether the first dominates the second:
1. C1 dominates C2
False. C1 has a 0% hit rate for 0, 256, 512, 0, 256, 512, ..., but C2 might do slightly better because it chooses the replacement set at random.
1. C2 dominates C1
No. C1 has a 100% hit rate for 0, 256, 0, 256, ..., but C2 might have an occasional miss.
1. C3 dominates C1
Yes. C3 differs only in having a higher capacity than C1.
1. C3 dominates C2
No. As we saw in (A) there are programs where LRU gets 0% hit rate and random may do slightly better, independently of the sizes of the caches.
1. C4 dominates C3
No. C4 has 0% hit rate on 0, 128, 256, 384, 512, 0, ... since all accesses map to the same cache line and LRU throws out the location just about to be accessed. In C3, 128 and 384 map to a different cache line than 0, 256 and 512, so manages a 40% hit rate in the steady state.
1. C4 dominates C2
No, for the same reason as (A) and (D).
1. C5 dominates C1
No. Consider the following access pattern: 0, accesses to 512 uncached locations whose addresses don't map to cache line 0 for cache C1, 0, ...C5 will replace location 0 on the 513th access and hence miss when 0 is accessed in the following cycle. C1 will have location 0 still in the cache when it's accessed again by the loop.
1. Averaged over a wide range of typical application programs, which of the above caches would you expect to yield the highest hit rate?
In general larger caches are better and fully associative caches are better than set associative caches, so C5 should have the highest hit rate.

Problem 9. Adverbs Unlimited is considering a computer system based loosely on the Beta. Five designs have been proposed, each of them similar to the Beta except for a cache between the 32-bit processor data bus and the main-memory subsystem. Like the Beta, each machine deals with 32-bit main-memory addresses, for a total address space of 232 bytes. The machines' caches differ only in the parameters of associativity, size, and writeback. The block size for each cache 1 word (4 bytes).
ModelAssociativityTotal data size (bytes)Write-
DEFINATELYfour-way216back
CERTAINLYdirect-mapped216back
HOPEFULLY4-way210through
PERHAPS2-way210back
DOUBTFULLYdirect-mapped210back
1. How many bits are required for the tag portion of each cache line for each of the architectures? How bits of comparitor are needed? How many bits of SRAM altogether (including tag fields, valid and dirty bits).
DEFINIATELY: 216/4-way = 214 bytes/subcache
=> 212 cache lines/subcache => 32 - 14 = 18 tag bits
=> 18 * 4 = 76 bits of comparator
=> total SRAM bits = 4*(8*214 data bits + 212*(18 tag + 1 valid + 1 dirty))CERTAINLY: 216/1-way = 216 bytes/subcache
=> 214 cache lines => 32 - 16 = 16 tag bits
=> 16 bits of comparator
=> total SRAM bits = 8*216 data bits + 214*(16 tag + 1 valid + 1 dirty)
HOPEFULLY: 210/4-way = 28 bytes/subcache
=> 26 cache lines/subcach => 32 - 8 = 24 tag bits
=> 24 * 4 = 96 bits of comparator
=> total SRAM bits = 4*(8*28 data bits + 26*(24 tag + 1 valid))
PERHAPS: 210/2-way = 29 bytes/subcache
=> 27 cache lines/subcach => 32 - 9 = 23 tag bits
=> 23 * 2 = 46 bits of comparator
=> total SRAM bits = 2*(8*29 data bits + 27*(23 tag + 1 valid + 1 dirty))
DOUBTFULLY: 210/1-way = 210 bytes/subcache
=> 28 cache lines => 32 - 10 = 22 tag bits
=> 22 bits of comparator
=> total SRAM bits = 8*210 data bits + 28*(22 tag + 1 valid + 1 dirty)
1. Address lines from the CPU are designated A31, ..., A1, A0, where A0 is the low-order address bit. Which of these CPU address lines are used as address inputs to the SRAMs of the cache in the PERHAPS model?
PERHAPS is a 2-way set-associative cache with a total of 210 bytes, so each direct-mapped subcache contains 29 bytes. With a block size of 1 word (4 bytes), address bits [8:2] would be used as the index into the 32-bit-wide SRAM.
1. Suppose that address lines A2 and A9 were inadvertently interchanged in the cable between the DOUBTFULLY CPU and its cache. Which, if any, of the following statements best describes the effect(s) of this change, assuming that other hardware and software remain unmodified?
1. The machine would no longer work.
2. The machine would continue to work as before.
3. The machine would continue to work, but at a reduced performance level.
4. The machine would continue to work, at an improved performance level.
(B). Address bits A2 through A9 are used as the cache index, interchanging them has no effect other than to change where in SRAM each cache line is stored, i.e., all the same locations are cached, they just happen to be stored in different cache SRAM locations than one might have expected.

Problem 10. You are designing a controller for a tiny cache that is fully associative but has only three words in it. The cache has an LRU replacement policy. A reference record module (RRM) monitors references to the cache and always outputs the binary value 1, 2, or 3 on two output signals to indicate the least recently used cache entry. The RRM has two signal inputs, which can encode the number 0 (meaning no cache reference is occurring) or 1, 2, or 3 (indicating a reference to the corresponding word in the cache).
1. What hit ratio will this cache achieve if faced with a repeating string of references to the following addresses: 100, 200, 104, 204, 200?
Here's what happens:
```access 100: miss; cache contains 100, ---, ---
access 200: miss; cache contains 200, 100, ---
access 104: miss; cache contains 104, 200, 100
access 204: miss; cache contains 204, 104, 200
access 200: hit;  cache contains 200, 204, 104
access 100: miss; cache contains 100, 200, 204
access 200: hit;  cache contains 200, 100, 204
access 104: miss; cache contains 104, 200, 100
access 204: miss; cache contains 204, 104, 200
access 200: hit;  cache contains 200, 204, 104
...
```
So in the steady state, location 200 stays in the cache and all other locations get replaced. So the hit rate is 2/5 or 40%.
1. The RRM can be implemented as a finite-state machine. How many states does the RRM need to have? Why?
There are 3! = 6 ways to list the three locations in order of use. Thus RRM needs 6 states, one state for each possible order.
1. How many state bits does the RRM need to have?
We can encode six states using 3 state bits.
1. Draw a state-transition diagram for the RRM.
1. Consider building an RRM for a 15-word fully associative cache. Write a mathematical expression for the number of bits in the ROM required in a ROM-and-register implementation of this RRM. (You need not calculate the numerical answer.)
There are 15! possible states, so we would need ceiling(log2(15!)) = 41 state bits. So the ROM would have 241 locations of 41 bits each, for a total of approximately 90 trillion bits.
1. Is it feasible to build the 15-word RRM above using a ROM and register in today's technology? Explain why or why not.
90 trillion bits is a bit much even for today's technology. In a .18u technology, a single transistor pulldown in a ROM might require (.2u x .5u) = .1u2, so our ROM would require about 9 square meters of silicon!