Real Computer Science begins where we almost stop reading ...: Instruction Pipeline and CPU Performance

Sunday, June 23, 2013

Instruction Pipeline and CPU Performance

Pipeline is implementation technique where multiple instructions are overlapped in execution.
Pipeline does not reduces the completion task for single task but increases the throughput of entire workload
preformance increase from pipeline is directly proportional to number of pipe stages

Key Terms
Instruction Latency : time from initiation of instruction until its result available.
Throughput is number of operations per unit time
Pipeline overhead is (pipeline register delay + clock skew)
Lenght of all stages is same in pipeline, equal to length of longest stage + overhead
Speed up = (average instruction time non pipeline)/ (avg instruction time in pipeline architecture)
Average instruction time in pipeline architecture is equal to instruction latency because after first instruction each instruction completes in one stage time.
Speedup for n instructions is ratio of time needed to process n instruction of non pipeline to that on pipeline architecture

that is
Sn = n * T / (n+k-1) * Tk
T time to process one instruction on non pipeline
Tk clock cycle time

Example
Consider 6 execution stages of lengths 40 ns, 50 ns, 60 ns, 50 ns, 60 ns, and 40 ns.

Find the instruction latency and time execute 100 instructions on both pipelined and non pipeline machine assume overhead of 5ns in pipline machine.

Solution
non pipeline machine
instruction latency is time from initiation of instruction to its completion
each instruction consist of 6 stages of enght mentioned above so
instruction latency = 40+50+60+50+60+40 = 300
time to execute 100 instruction is 100*instruction latency = 30000ns

Pipeline machine
Since all stages have same length in pipeline each would be equal to (largest stage + overhead)
i.e 60+5ns
After first instruction is complete we get result for other instruction at interval of clock cycle
So Time to execute 100 instruction is 65*6 + 99*65=6825ns

Problem An instruction requires 4 stages of 30ns 9ns,20ns,10ns.
i)What is minimum asynchronous time for instruction to complete
Solution 30+9+20+10 = 69ns
ii)what should be the clock rate if we use pipeline for this stages
clock rate should be able to accommodate the largest stage 30sec because each instruction in pipeline can be initiated in 30ns and would give latency of 30*4 = 120ns
we can choose finer clock by looking at the shortest stage that is 9ns and find clock rate that is integrally divisible in other stages if we choose 10ns it would take 3 clock for first stage 1 for second 2 for third and 1 for last and would give latency of 70ns
ii)what is speed up
speed up = 30+9+20+10/30 =2.3

Harzard
Consider a 5 stage pipeline, to run pipeline at full performance 4 next instruction will start their execution before first complete but what if these next instructions are dependent on the result of the first instruction
I1
I2 //depends on the output of I1
I3//depends on output of I1

these kind of situations which prevent next instruction to continue are called hazards.such situations can occur because programmer may assume that in sequential program first instruction will complete and then only next instruction will execute

Data Hazard
the pipeline is delayed because the data to be operated on are delayed for some reason
when the result of instruction i is the source of instruction j (j>i)
decode stage cannot continue for instruction 2 and need to be resumed after instruction 1 has written the result
Instruction attempts to read the data before data is available in register file

Classification of Data Hazard
1)RAW read after write
2)WAW
3)WAR

Data Hazards are avoided in one of the two ways
Operand Forwarding
We allow ID to fetch the wrong content from register file but by using the bypass multiplexer we compare the content read in the decode stage with content of execute stage
Before instruction j writes data to register it is availabe at output of ALU after execute stage.
Instruction Decode fetches contents from register file

Pipeline Interlock

load addr --> r0
add r1 <--- r1 + r0
For such instructions it is not possible to foreward the result because the content of register r0 is not availble until the end of memory access for load while the add instruction needs the r0 at mem access of load
We need to insert stall in such case

Control Hazards
caused by conditional and unconditional branching

1)Stall Pipeline

2)Predict not taken

high performance than previous but more complex
predict ------------not taken branch ------------continue with instructions after branch
two scenarios occur when its actuallu taken and when not taken

not taken branch

no change

taken branch

Assume after ID stage we know if branch is taken or not and new instruction address then we need to restart IF for branch instruction and discard the fetched instruction causing a delay of one clock cycle

          1     2    3    4     5    6     7
----------------------------------------
B       F    D   E   M    W
i+1            F   -    -       -
Bk                  F    D     E ..

3) Predict branch as taken
If target address can be known before branch instruction then it can be of use

4) Delayed Branch

B
i+1
i+2   /// branch delay slot
i+3
...
Bk

Instructions in branch delay slot are executed whether or not branch is taken

For taken branch it will look like

          1     2    3    4     5    6     7
----------------------------------------
B       F    D   E   M    W
i+1            F   D    E     M W
Bk                  F    D     E ..

If branch is untaken,
    execution continues with the instruction after the branch delay instruction.
If the branch is taken,
    execution continues at the branch target.

The job of the compiler is to make the instruction that fills this slot useful and valid.

Branch scheduling schemes
1) from before the branch

we move the instruction before the branch in the delay slot but branch should not depend on moved instruction
this strategy is best, always results in performance enhancement

2)from target

In case before branch instruction decides the branch we can move the taget instruction in the branch delay slot It should not harm to execute the rescheduled instruction in case branch is not taken, when branch is taken can increase the code as two copies will be executed
improves performance when branch is taken

3)from fall through

improves performance when branch is not taken

Cancelling branch

Solving problems
1)If no forwarding is used then result is available after MEM
2)If forwarding is used result is available in ID
3)If LW instruction is used then result is not available until MEM even if forwarding is used
4)If branch not taken strategy is used and if loop runs n times then n-1 times branch will be predicted wrong and 1 time it is predicted correct
5)find number of times loop runs and then count number of steps by counting until last instruction's IF stage
multiply to get total number of steps
6)MUX is used for bypassing
7)seperate adder needs to be used when branch is to be calculated in ID stage
-------------------------------------------------------

CPU Performance
1) If x is n% faster than y
execution_time (y) = (1+nf) * execution_time (x)

2)CPU time = number of clock cycles * clock cycle time or
                   = nnumber of clock cycles / clock rate

number of clock cycle = instruction count(IC) * Clock cycles per instruction (CPI)

so CPU time = IC * CPI * (clock cycle time)
                     = (instructions/program) * (cycles/instruction) * cycle time

Problem compare the cpu time for two cases
i)condition code is set by the compare instruction and followed by branch that test condition code
ii)compare is included in branch instruction itself
Assume conditional branch instruction takes 2 clock cycles and all other instruction take 1 cycle.For case 1 number of branch instruction is 20%. clock cycle time for case 1 is 25% faster also because it has single instruction
Solution
cpu time = IC(i) * CPI * (clock cycle time)
CPI = CPI of branch + CPI for other instructions(including compare)
CPI branch = (frequency of branch instruction) * (cycles for branch instruction )
frequency of branch instruction(i) = .20*IC(i) / IC(i) = .20
CPI(i)    = .20*2 + .80*1 = 1.2
for case ii
we donot have compare instructions we have only load instructions which constitutes 20% of all instructions
total number of instructions in terms of case i is
IC(ii) = IC(i) - .2* IC(i) = .8*IC(i)
CPI branch = (frequency of branch instruction) * (cycles for branch instruction )
since 20% of total instructions in case(i) were branch instructions but in case(ii) total number of instructions has changed which is 80% of case i
frequency of branch instruction(ii) = .20*IC(i) / .80*IC(i) = .25
CPI(ii) = .25*2 + .75*1 = 1.25
clock cycle time(ii) = 1.25 * clock cycle time (i)

Unpipeline system
Average instruction execution time = clock cyle * Average CPI

Pipeline system without hazards
Average instruction executiion time = (clock cycle + overhead) * Average CPI
Average CPI for pipeline system is considered as 1 so
Average instruction executiion time = (clock cycle + overhead)

Pipeline system with hazards
A stall causes pipeline performance to degrade from the ideal
speedup = (Avg execution time on non pipeline system) / (Avg execution time on pipeline system)
CPI = Ideal CPI + pipeline stall clock cycles per instruction (because of hazards)
ideal CPI of pipeline system is 1
so CPI = 1+ pipeline stall clock cycles per instruction

Instruction pipeline performance in case of branch penalty

Avg time for instruction = Ideal CPI + (penalty) * P(branch)* P(taken)

1 + (penalty) * P(branch)* P(taken)

Problem
In an instruction pipeline of 10ns clock memeory instruction takes 2 stall cycles branch instruction takes 3 stall cycles and frequency of memory and branch instruction is 20% and 30% resp.calculate average instruction time
Solution
Average instruction time = (Ideal CPI + pipeline stall clock cycle per instruction ) * clock cycle time
                                     = (1+ (.20*2+.30*3)) * 10ns
                                    = 23ns

If we consider out of 20% branch only 10% of time its taken then
Average instruction time = (Ideal CPI + pipeline stall clock cycle per instruction ) * clock cycle time

                                      = (1+ (.20*.10*2+.30*3)) * 10ns  = 19.4ns

Without Forwarding dependent instruction Decode cannot occur until WB of the previous instruction
that is
I1 I2 where I2 is dependent on I1

        1   2     3      4         5      6
----------------------------------------
I1    if     id     ex   mem      wb
I2           if       -       -         id      ex   ...

With Forwarding
Results can be forwarded from EX, MEM and WB so immediately next as well as other instruction after that can get the result

        1   2     3      4         5      6
----------------------------------------
I1    if     id     ex   mem      wb
I2           if      id     ex        ...

here ex of I2 is forwarded result from ex of I1

Problem Consider standard RISC pipeline with 5 stages each instruction takes single cycle for each stage
loop: I1 //load instruction
        I2 // add instruction using data loaded in register in I1
        I3
        I4
        I5
       bnz (condition) loop

bnz is branch instruction and assume that loop runs for n times Draw the pipeline timing diagram
i) when no forwarding
ii) with forwarding
Solution

i) no forwarding

Result of the branch is not known until mem stage of branch instruction so the bnz instruction restarts after mem stage of I5
17 clock cycle it takes for single iteration,18th cycle we will be counting in the next iteration so for all iteration except last we have 17 clock cycles and for the last iteration we have 18 clock cycle
total clock cycle for n iterations = (n-1)*17+1*18

ii) with forwarding and branch prediction as not taken

In cycle 4 we have stall for instruction I2 because it is case of Pipeline interlock where data from I1(load instruction) is not available until after memory instruction to I2(add instruction)

m in the last instruction I1 is because of incorrect branch prediction of not taken
Total cycles for complete n iterations = 10*n-1+11*1

Note n-1 time its predicted wrong but the last time the branch is predicted correct so we need to take that case into account

http://www.cc.gatech.edu/classes/cs4760_96_fall/hw/hw2-sol.html
http://pages.cs.wisc.edu/~markhill/cs752/Fall1999/hw3-sol.html

Real Computer Science begins where we almost stop reading ...

Sunday, June 23, 2013

Instruction Pipeline and CPU Performance

No comments:

Post a Comment