Sunday, 23 June 2013

Cache memory - Direct mapped, Set Associative, Associative ,Performance

Direct mapped Cache


Mapping function 

Each main memory block can be mapped to only single cache line like ,
 each 0, m,2m ... block can be assigned to cache block 0
 each m+1, 2m+1, 3m+1 ... block can be assigned to cache block 1

cache line number  = (main memory block number) mod number of cache lines
                            = (main memory address / Block size) mod number of cache lines
                            = (Md/B) mod m

Problem 1

A two dimensional byte array A[50][50] is stored in main memory starting address 0x1100 in row major order. Direct mapped cache is used with 32 line.
If array is accessed in following order

for (int i=0;i< M; i++)
 for (int j= 0;j
    A[i][j] = 0

1)Find number of cache miss assuming cache is empty initially.
2)consider same code to run again, one more time after first scenario then find total cache miss.
3) If the size of array element is t Bytes  then find total cache miss for single array access.
4) What difference would it make if we had fully associative cache.

Solution 1.1

Array is accessed in order

A[0][0], A[0][1] ....  A[0][N-1] .... A[M-1][0],A[M-1][1] ...A[M-1][N-1]

First element A[0][0] is stored at address (0x1100)H = (1048576)d in decimal

Cache line to which element A[0][0] will be mapped
  = (main memory block number) mod number of cache lines
  = (main memory address / Block size) mod number of cache lines
  = (1048576 / 64) mod 32 = 0

so first element is mapped to first cache line and complete block of 64 bytes are moved i.e elements A[0][0] to A[0][49] and A[1][0] to A[1][13](64th element of two dimensional array) will be moved in cache line zero

Next reference to A[0][1] will be a hit and so till A[1][13](64rd element of array)

Fig. shows organization of main and cache memory
Each Bth(block size) element will be miss because first element in every new memory block will not be present in cache.


i.e number of miss will be ceil(MN/B) for array A[M][N] for this case

= ceil(50x50 / 64) = 40


Fig shows state of cache memory
number of blocks that will be overwritten on first array access is ceil(MN/B) - m
= 40-32
= 8


Solution 1.2

On first run of function we had 40 cache miss  and first 8 blocks were overwritten.
Now array elements are accessed again from A[0][0]

Assuming it as first run then there would be ceil(MN/B) = 40 cache miss But since we already have data from previous run in the cache memory of which we had overwritten first 8 blocks so remaining 32 - 8 = 24 are still present in cache memory and we will have hit for that may elements So number of cache miss will be 40-24 = 16

Therefore Total number of cache miss after two run of function would be 40 + 16 = 56



Solution 1.3

When transferring block to cache from main memory  we moved 64 elements earlier which lead to miss on every 65 element
Now in a block we can accommodate (B/t) elements Hence transfer would be for next B/t -1 elements along this one.  Hence every (B/t) will give result in a cache miss here

So we will have (MN/(B/t)) cache miss.


Associative Mapping
main memory block can be loaded in to any cache line address is interpreted as just tag and word instead of line because which ever line is empty we put block into that number of lines in cache is not determined by address format unlike as in Direct mapped

Set Associative Cache
cache is divided into v sets each consisting of k lines
 k-way set associative mapping that is there are k possible lines in which the same mapped blocks can go.

cache line number  = (main memory block number) mod number of sets v
                            = (main memory address / Block size) mod (number of cache line/k)
                            = (Md/B) mod m/k
memory address is interpreted as tag set and word tag+set specify a block in main memory

when memory address is presented from CPU it indexes with set bits in cache to find set where would could be It then compares tag bits with every tag field of line in set and if there is match then it addresses the word in that line number of cache lines here tag comparison is only to k possible


Problem  A 64KB cache has 16 byte blocks. If addresses are 32 bits, how many
bits are used the tag, index, and offset in this cache?

a)if cache is direct mapped
Solution
number of bits for addressing word= 4bit
number of cache lines = size of cache / size of block = 64KB/16B = 4K = 12bits
remaining 16 bits for tag so  division of 32 bits is like
16bit  12bit  4bit
b)if cache is 4-way set associative
number of bits for addressing word are same 
number of lines/blocks in each cache set is 4 
so each cache set is 4*16B = 64B
number of cache sets = size of cache/size of cache set = 64KB/64B = 1K 
so 10 bits for cache set/index
so division of 32 bits is like 
18bit tag  10bit index  4bit word
c)for fully asscociative cache
number of index bits = 0 because any block can be stored in any line

Problem Consider following
cache size 8 byte, 
2-way set associative
2 byte block size with
LRU replacement
request for addresses
0110, 0000, 0010, 0001, 0011, 0100, 1001, 0000, 1010, 1111, 0111
determine address in cache and miss/hit

Solution 
number of lines/blocks in each set = 2
size of cache set = 2*2 B = 4B
number of cache set = 8B/4B = 2 
division of 4bits address
2bit tag  1bit index 1bit word


2 bytes(block) are moved for each address so next address would be also brought in cache here index maps directly into 0 or 1 cache set if index is 1 its set 1 and if bit is 0 its set 0

1) for address 0110
index is 1 so it goes in  set 1
set 0  --------
         ---------
set 1 01------(address 0110 and 0111) 

       ----------
its compulsory miss because first reference of 0110
2)0000
index 0
set 0  00------(address 0000 0001)
         ---------
set 1 01------(address 0110 and 0111)

       ----------
its compulsory miss because first reference of 0000
3) similarly for 0010 its compulsory miss and placed in set 1 line 2 with next byte 0011
4), 0001 its hit because its in set0 line1
5)0011 hit
6)0100 compulsory miss mapped to set 0 line 2 with next byte 0101
7)1001 is first reference hence compulsory miss it will replace content of set 0 line based on LRU


set 0  (0000 0001)
         (0100 0101)
set 1  (0110 0111) 

         (0010 0011)
0000 is reference least recently so its replaced in line1
and line 1 would be now (1001 1010)
8)0000
it 0000 second reference and its miss so its conflict miss 
replaces with LRU its 0100 second line so now set 0 line 2 is 0000 0001
9)1010 compulsory miss mapped to set 1 line 1 because 0110 should be replaced 
now content of set1 line 1 is 1010 1011
10)1111 compulsory miss mapped to set 1 line 2
set 1 line 2 1111 0011
11)0111
we brought this block into cache but was replaced without use so
its capacity miss



Types of cache miss
Compulsory miss: when block is accessed for first time its called first reference miss, cold start miss or compulsory miss
Capacity miss: when blocks are discarded because cache cannot contain all blocks needed for program execution. when block is discarded before its use
Conflict miss: when several blocks are mapped to same cache set/block. Occurs in case of direct mapped and set associative cache


Cache performace
number of memory stall = number of miss  * miss penalty(cost per miss)
number of miss = IC * misses/instruction
miss/instruction =number of memory access/ instruction * miss rate


Problem Consider CPI = 1, only data access are load and store which makes 50% instructions  miss penalty =25clock cycles, miss rate 2%

Solution 
number of memory access per instruction is 1 for instruction and .5 for data 
Memory stall cycle = IC * (1+.5)* .02*25 = .75*IC
cpu execution time = (cpu clock cycle + memory stall cycle) * clock cycle
with cpu with all cache hits its =  (IC*CPI ) * clock cycle
for cache its .2 miss rate ts  (IC*CPI +IC*.75) * clock cycle =1.75*IC * clock cycle
the computer with no cache miss is 1.75 times faster

Problem Consider CPI for ALU=1, Load/Store=1.5,  Branch=1.5, Jump=1. instruction mix of 40% ALU and  logical operations, 30% load and store, 20% branch and 10% jump instructions.  4-way set associative cache with separate data and instruction cache miss rate of 0.20 for data and miss rate of  0.10 for instructions miss penalty of 50 cycles for both . What is the effective CPU time (or effective CPI with memory stalls) and the average memory access time for this application with this Cache organization?

Solution
CPI ideal =(0.4*1)+(0.3*1.5)+(0.2*1.5)+(0.1*1) = 1.25.
memory stalls = number of memory access/ instruction * miss rate *miss penalty
.3 load/store data operations
1 instruction 
memory stall = .3*.2*50 + 1*.1*50 =8
Average memory access time = percentage of data accesses (hit time + data miss rate* miss penalty) + percentage of inst. Accesses (hit time + inst.miss rate*miss penalty).
Therefore AMAT = (0.3/1.3)(1 + 0.2*50) + (1/1.3)(1+ 0.1*50).


b)Now consider a 2 level 4-way unified cache with a level l (L1) miss rate of 20% (0.20) and a level 2 (L2) local miss rate of 30% (0.30). Assume hit time in L1 is 1 cycle, assume miss penalty is 10 cycles if you miss in L1 and hit in L2 (i.e., hit time in L2 is 10 cycles), and assume miss penalty is 50 cycles if you miss in L2 (i.e., miss penalty in L2 is 50 cycles). Derive the equation for the effective CPU time (or effective CPI) and the average memory access time for the same instruction mix as part (a) for this cache organization.

First note that the miss rate for L2 is given as the local miss rate. Average memory accesses per instruction = 1.3 as noted earlier (0.3 for data and 1 for inst).



Cache miss penalty reduction
L2 cache second level cache
average memory access time = hit time(L1) + miss rate(L1) * miss penalty(L1)
miss penalty(L1) = hit time(L2) + miss rate(L2) * miss penalty(L2)
Local miss rate is total number of miss in cache divided by total number of access to cache
 Global miss rate is total number of misses/ total number of CPU accesss
for L2 global miss rate is  miss rateL1 * miss rate L2
for L1 global and local miss rate is same

Write policy
write through written to both block in cache and to lower memory
write back written only to block in cache .Modified block is written to memory only when it is replaced. write occurs at speed of cache memory, less memory bandwidth required save intermediate writes. Read miss can result in writing the 
cpu wait during write through is called write stall. 
write allocate  a block is allocated on write miss that is a block is written in memory so next time it will be hit
no write allocate until read miss no block is allocated


www.seas.gwu.edu/~bhagiweb/cs211/homeworks/hw4-sol.pdf


References:
http://www.cs.lth.se/EDA115/    -- Cache memories Problem solutions