

























#### **Superscalar Execution**

- Instruction fetch (käskyjen nouto)
  - Branch prediction (hyppyjen ennustus)
  - → prefetch (ennaltanouto) from memory to CPU
  - Dispatch to instruction window (valintaikkuna)
- Instruction issue (käskyn päästäminen hihnalle)
  - Check (and remove) data, control and resource dependencies
  - Reorder; issue suitable instructions to pipelines
  - Pipelines proceed without waits
- Instruction complete, retire (suoritus valmistuu)
  - Commit or abort (hyväksy tai hylkää)
    - Usually all state changes occur here
  - Check and remove write and antidependencies
  - → wait / reorder (järjestä uudelleen)

Computer Organization II, Autumn 2010, Teemu Kerola

9.11.2010

























# Generation of Pentium Pipeline μops



- a) Fetch IA-32 instruction from L2 cache and generate upps to L1
  - Uses Instruction Lookaside Buffer (I-TLB)
  - and Branch Target Buffer (BTB)
    - four-way set-associative cache, 512 lines
  - 1-4 µops (=118 bit RISC) per instruction (most cases), if more then stored to microcode ROM
- b) Trace Cache Next Instruction Pointer instruction selection
  - Dynamic branch prediction based on history (4-bit)
  - If no history available, Static branch prediction
    - backward, predict "taken"forward, predict "not taken"
- c) Fetch instruction from L1-level trace cache
- d) Drive wait (instruction from trace cache to rename/allocator)

Computer Organization II Autumn 2010 Teemu Kerola

29 11 2010





### Pentium Pipeline Window of Execution



f) Micro-Op Queueing

- 2 FIFO queues for μops
  - One for memory operations (load, store)
  - One for everything else
- No dependencies, proceed when room in scheduling
  Micro-Op Scheduling
  - Retrieve μops from queue and dispatch (issue) for execution
  - Only when operands ready (check from ROB-entry)

#### h) Dispatching

- Check the first instructions of FIFO-queues (their ROB-entries)
- If execution unit needed is free, dispatch to that unit
- Two queues → out-of-order issue
- max 6 micro-ops dispatched in one cycle
  - ALU and FPU can handle 2 per cycle
- Load and store each can handle 1 per cycle

9.11.2010



## Pentium Pipeline Integer and FP Units



- i) Get data from register or L1 cache
- j) Execute instruction, set flags (lipuke)
  - Several pipelined execution units
    - 4 \* Alu, 2 \* FPU, 2 \* load/store
    - E.g. fast ALU for simple ops, own ALU for multiplications
  - Result storing: in-order complete
  - Update ROB, allow next instruction to the unit
- k) Branch check
  - What happend in the jump /branch instruction
  - Was the prediction correct?
  - Abort incorrect instruction from the pipeline (no result storing)
- I) Drive update BTB with the branch result

Computer Organization II, Autumn 2010, Teemu Kero

29 11 2010



### **Pentium 4 Hyperthreading**



- One physical IA-32 CPU, but 2 logical CPUs
- Instructions from 2 processes in the same pipeline
- OS sees as 2 CPU SMP (symmetric multiprocessing)
  - Logical processors execute different processes or threads
  - No code-level issues
  - OS must be capable to handle more processors (like scheduling, locks)
- Uses CPU wait cycles
  - Cache miss, dependences, wrong branch prediction
- If one logical CPU uses FP unit, then the other one can use INT unit
  - Benefits depend on the applications

Computer Organization II, Autumn 2010, Teemu Kerola

9.11.2010 29



### Pentium 4 Hyperthreading



Intel Nehalem arch.:

8 cores on one chip,

1-16 threads (820

million transistors)

First lauched processor

Core i7 (Nov 2008)

- Duplicated (kahdennettu)
  - IP, EFLAGS and other control registers
  - Instruction TLB
  - Register renaming logic
- Split (puolitettu)
  - No monopoly, non-even split allowed
  - Reordering buffers (ROB)
  - Micro-op queues
  - Load/store buffers
- Shared (jaettu)
  - Register files (128 GPRs, 128 FPRs)
  - Caches: trace cache, L1, L2, L3
  - Registers needed during µops execution
- Functional units: 2 ALU, 2 FPU, 2 ld/st-units

29.11.2010



























### **Review Questions**

- Differences / similarities of superscalar and traditional pipeline?
- What new problems must be solved?
- How to solve those?
- What is register renaming and why it is used?

Computer Organization II. Autumn 2010. Teemu Kerola

29.11.2010