





#### **Registers**

- Top of memory hierarchy
- User visible registers

ADD R1,R2,R3

- Programmer / Compiler decides how to use these
- How many? Names?

BNEQ Loop

- Control and status registers
  - Some of these used indirectly by the program
    - PC, PSW, flags, ...
  - Some used only by CPU internally
    - MAR, MBR, ...
- Internal latches (*apurekisteri*) for temporal storage during instruction execution
  - Example: Instruction register (IR) instruction interpretation; operand first to latch and only then to ALU
  - ALU output before result moved to some register

Computer Organization II, Autumn 2010, Teemu Kerola

22.11.2010



## User visible registers

- Different processor families ⇒
  - different number of registers
  - different naming conventions (nimeämistavat)
  - different purposes
- General-purpose registers (*yleisrekisterit*)
- Data registers (datarekisterit) not for addresses!
- Address registers (osoiterekisterit)
  - Segment registers (segmenttirekisterit)
  - Index registers (indeksirekisterit)
  - Stack pointer (pino-osoitin)
  - Frame pointer (ympäristöosoitin)
- Condition code registers (*tilarekisterit*)

No condition code regs.

IA-64, MIPS

Computer Organization II, Autumn 2010, Teemu Kerola





#### **PSW - Program Status Word**

- Name varies in different architectures
- State of the CPU
  - Privileged mode vs user mode
- Result of comparison (*vertailu*)
  - Greater, Equal, Less, Zero, ...
- Exceptions (poikkeus) during execution?
  - Divide-by-zero, overflow
  - Page fault, "memory violation"
- Interrupt enable/ disable
  - Each 'class' has its own bit
- Bit for interrupt request?
  - I/O device requesting guidance

Design issues:

- OS support
- memory and registers in
- control data storing
- paging
- subroutines and stacks

- etc

Computer Organization II, Autumn 2010, Teemu Kerola











# **Computer Organization II**

# Instruction pipelining (*liukuhihna*)

Computer Organization II, Autumn 2010, Teemu Kerola

22.11.2010



# Laundry example (by David A. Patterson)

- Ann, Brian, Cathy, Dave: each have one load of clothes to wash, dry and fold
- ABCD
- Washer takes 30 min



- Dryer takes 40 min
- **أ**∓
- "Folder" takes 20 min

Computer Organization II, Autumn 2010, Teemu Kerola







#### Lessons

- Pipelining does not help <u>latency of single task</u>, but it helps <u>throughput of the entire workload</u>
- Pipelining can <u>delay single task</u> compared with situation where it is alone in the system
  - Next stage occupied, must wait
- <u>Multiple tasks</u> operating simultaneously, but different phases
- Pipeline rate limited by <u>slowest</u> pipeline stage
  - Can proceed when all stages done
  - Not very efficient, if different stages have different durations, unbalanced lengths
- Potential speedup
  - = maximum possible speedup
  - = number of pipe stages

Computer Organization II, Autumn 2010, Teemu Kerola









#### Pipeline speedup (nopeutus)?

- Lets calculate (based on Fig 12.10):
  - 6- stage pipeline, 9 instr. → 14 time units total
  - Same without pipeline  $\rightarrow$  9\*6 = 54 time units
  - Speedup =  $time_{orig} / time_{pipeline} = 54/14 = 3.86 < 6!$
  - Maximum speed at times 6-14
    - one instruction per time unit finishes
    - 8 time units → 8 instruction completions
    - Maximum speedup =  $time_{orig}$  /  $time_{pipeline}$  = 48/8 = 6
- Not every instruction uses every stage
  - Will not affect the pipeline speed some stages unused
  - Speedup may be small (some stages idle, waiting for slow)
  - Unused stage → CPU idle (execution "bubble")
  - Serial execution could be faster (no wait for other stages)

Computer Organization II, Autumn 2010, Teemu Kerola

22.11.2010

40



# Pipeline performance: one cycle time



- Cycle time is the same for all stages
  - Time (in clock pulses) to execute the stage
- Each stage takes one cycle time to execute
- Slowest stage determines the pace (tahti, etenemisvauhti)
  - The longest duration becomes bottleneck

Computer Organization II, Autumn 2010, Teemu Kerola

22.11.2010

20







#### **Pipeline Features**

- Extra issues
  - CPU must store 'midresults' somewhere between stages and move data from buffer to buffer
  - From one instruction's viewpoint the pipeline takes longer time than single execution
- But still
  - Executing large set of instructions is faster
  - Better throughput (*läpimenoaste*) (instructions/sec)
- The parallel (concurrent) execution of instructions in the pipeline makes them proceed faster as whole, but slows down execution of single instruction

Computer Organization II, Autumn 2010, Teemu Kerola

22.11.2010

22



# **Pipeline Problems and Design Issues**

- Structural dependency (rakenteellinen riippuvuus)
  - Several stages may need the same HW

■ Memory: FI, FO, WO

■ ALU: CO, EI

| STORE | R1,VarX    |
|-------|------------|
| ADD   | R2,R3,VarY |
| MUL   | R3,R4,R5   |

#### ■ Control dependency (kontrolliriippuvuus)

- No knowledge on next instruction
- E.g., (conditional) branch destination may be known only after El-stage
- → Prefetched wrong instructions

#### ■ Data dependency (datariippuvuus)

Instruction needs the result of the previous non-finished instruction

| ADD  | R1,R7, R9 |
|------|-----------|
| Jump | There     |
| ADD  | R2,R3,R4  |
| MUL  | R1,R4,R5  |

MUL R1,R2,R3 LOAD R6, Arr(R1)

Computer Organization II, Autumn 2010, Teemu Kerola



#### **Pipeline Dependency Problem Solutions**

- In advance: prevent (some) dependency problems completely
- At run time: Hardware must notice and wait until all dependencies are cleared
  - Add extra waits, "bubbles", to the pipeline; Commonly used
  - Bubble (*kupla*) delayes everything behind it in all stages
- Structural dependency
  - More hardware, e.g., separate ALUs for CO and EI stages
  - Lots of registers, less operands from memory
- Control dependency
  - Clear pipeline, fetch new instructions
  - Branch prediction, prefetch these or those?
- Data dependency
  - Change execution order of instructions
  - By-pass (oikopolku) in hardware between stages: earlier instruction's result can be accessed already before its WO-stage is done

Computer Organization II, Autumn 2010, Teemu Kerola

22.11.2010



## **Data dependency**

- Read after Write (RAW) (a.k.a true or flow dependency)
  - Occurs if succeeding read takes place before the preceding write operation is complete
- Write after Read (WAR) (a.k.a antidependency)
  - Occurs if the succeeding write operation completes before the preceding read operation takes place
- Write after Write (WAW) (a.k.a output dependency)
  - Occurs when the two write operations take place in the reversed order of the intended sequence

Add r1,r5,r6 Store r1, A Add r1, r2, r3

Load rl, A

Add r3, r2, r1

Add r3, r2, r1

Load r1 A

■ The WAR and WAW are possible only in architectures where the instructions can <u>finish</u> in <u>different order</u>

Discussion?

22.11.2010

Computer Organization II, Autumn 2010, Teemu Kerola









### **Computer Organization II**

# Pipelining and Jump Optimization

Multiple streams (*Monta suorituspolkua*)

Delayed branch (*Viivästetty hyppy*)

Prefetch branch target (*Kohteen ennaltanouto*)

Loop buffer (*Silmukkapuskuri*)

Branch prediction (*Ennustuslogiikka*)

Computer Organization II, Autumn 2010, Teemu Kerola







#### Multiple instruction streams (monta suorituspolkua)

- Execute speculatively to both directions
  - Prefetch instructions that follow the branch to the pipeline
  - Prefetch instructions from <u>branch target</u> to (another) pipeline
  - After branch decision: reject the incorrect pipeline (its results)
- Problems
  - Branch target address known only after some calculations
  - Second split on one of the pipelines
    - Continue any way? Only one speculation at a time?
  - More hardware!
    - More pipelines, speculative results (registers!), control
  - Speculative instructions may delay real work
    - Bus and register contention? More ALUs?
- Capability to cancel not-taken instruction stream from pipeline
  - easier, if all changes done in WB phase

Computer Organization II, Autumn 2010, Teemu Kerola

22.11.2010

IBM 370/168,

IBM 3033

Intel IA-64



#### Prefetch branch target (kohteen ennaltanouto)

- Prefetch just branch target instruction, but do not execute it yet
  - Do only FI-stage
  - If branch taken, no need to wait for memory
- Must be able to clear the pipeline
- Prefetching branch target may cause page-fault

IBM 360/91 (1967)

Computer Organization II, Autumn 2010, Teemu Kerola



#### Loop buffer (silmukkapuskuri)

- Keep *n* most recently fetched instructions in high speed buffer inside the CPU
  - Use prefetch also
    - With good luck the branch target is in the buffer
    - F.ex. IF-THEN and IF-THEN-ELSE structures
- Works for small loops (at most *n* instructions)
  - Fetch from memory just once
- Gives better spacial locality than just cache

CRAY-1 Motorola 68010 ... Intel Core-2

Computer Organization II, Autumn 2010, Teemu Kerola

22.11.2010



#### **Static Branch Prediction**

Make an (educated?) guess which direction is more probable:

Branch or no?

Motorola 68020 VAX 11/780

Intel Pentium III

- Static prediction (*staattinen ennustus*)
  - Fixed: Always taken (aina hypätään)
  - Fixed: Never taken (ei koskaan hypätä)
    - ~ 50% correct
  - Predict by opcode (operaatiokoodin perusteella)
    - In advance decided which codes are more likely to branch
    - For example, BLE instruction is commonly used at the end of stepping loop, guess a branch
    - ~ 75% correct [LILJ88]

Computer Organization II, Autumn 2010, Teemu Kerola

22.11.2010

36



#### **Dynamic Branch Prediction**

- Dynamic prediction
  - Make a quess based on earlier history for (this) branch
  - Logic: What has happened in the recent history with this instruction
    - Improves the accuracy of the prediction
  - <u>Implementation</u>: extra internal memory = branch history table
    - Instruction address (for this branch)
    - Branch target (instruction or address) need this for quick action
    - Decision: taken / not taken
- Simple prediction based on just the previous execution
  - 1 bit memory is enough
  - Loops will always have one or two incorrect predictions

Computer Organization II, Autumn 2010, Teemu Kerola

22.11.2010

2-Bit Branch Prediction Logic for One Instruction 1st miss Not Taken ■ Improved simple Predict Predict model Taken Taken Taken ■ Don't change the prediction with one misprediction ■ Based on two previous executions of this instruction Not Taken ■ 2 bits enough Predict Predict Not Taken Not Taken Taken PowerPC 620 1st miss (Sta10 Fig 12.19) Computer Organization II, Autumn 2010, Teemu Kerola 22.11.2010





#### **Summary**

- Pipeline basics
  - Stage length, pipeline fill-up and drain times
  - Response time, throughput, speedup
- Hazards, dependencies
  - Structural, control, data (RAW, WAR, WAW)
  - How to avoid before time?
  - How to handle at run time?
- How to minimize branch costs?
  - Delayed branch, multiple pipeline streams, prefetch branch target, loop buffer, branch prediction

Computer Organization II, Autumn 2010, Teemu Kerola

22.11.2010 40



# **Review Questions**

- What information PSW needs to contain?
- Why 2-stage pipeline is not very beneficial?
- What elements effect the pipeline?
- What mechanisms can be used to handle branching?
- How does CPU move to interrupt handling?

Computer Organization II, Autumn 2010, Teemu Kerola