







### Superscalar processor

- Efficient memory usage
  - Fetch several instructions at once, prefetching (ennaltanouto)
  - Data fetch and store (read and write)
  - Concurrency
- Several instructions of the <u>same</u> process executed concurrently on different pipelines
  - Select executable instruction from the prefetched one following a policy (in-order issue/out-of-order issue)
- Finish more than one instruction during each cycle
  - Instructions may complete in different order than started (out-of-order completion)
- When can an instruction finish before the preceeding ones?

Computer Organization II, Spring 2010, Tiina Niklander



# Dependencies (riippuvuus)

add r1,r2 move r3,r1

- True Data/Flow Dependency (datariippuvuus)
  - Read after Write (RAW)
  - The latter instruction needs data from former instruction
- Procedural/Control Dependency (kontrolliriippuvuus)
  - Instruction after the jump executed only, when jump does not happen

JNZ R2, 100 ADD R1, =1

- Superscalar pipeline has more instructions to waste
- Variable-length instructions: some additional parts known only during execution
- Resource Conflict (*Resurssiriippuvuus*)
  - One or more pipeline stage needs the same resource
  - Memory buffer, ALU, access to register file, ...

Computer Organization II, Spring 2010, Tiina Niklander





# Dependencies specific to out-of-order completion

■ Output Dependency (Kirjoitusriippuvuus)

load r1,X add r2,r1,r3 add r1,r4,r5

- write-after-write (WAW)
- Two instructions alter the same register or memory location, the latter in the original code must stay
- Antidependency (Antiriippuvuus)

move r2,r1 add r1,r4,r5

- Write-after-read (WAR)
- The former read instruction must be able to
- fetch the register content, before the latter
- write stores new value there

store R5, 40(R1)

#### ■ Alias?

- Two registers use indirect references to the same memory location?
- Different virtual address, same physical address?
- What is visible on instruction level (before MMU)?

Computer Organization II, Spring 2010, Tiina Niklander

16.2.2010



### **Dependencies**

NOTE: Newest 8th ed. only Ch14 -Incorrect definitions!

SEE ERRATA

Ch 12 - Correct definitions!



data dependency (RAW)

In data dependency instruction j cannot be executed before instr. i!

■ i: "read" R1 ....... j: "write" R1 antidependency (WAR)

■ i: "write" R1 ...... j: "write" R1 Anti- and output dependency allow change in execution order for instructions i and j, but afterwards must be checked that the right value and result remains

output dependency (WAW)

Computer Organization II, Spring 2010, Tiina Niklander



#### How to handle dependencies?

- Starting point
  - All dependences must be handled one way or other
- Simple solution (as before)
  - Special hardware detects dependency and force the pipeline to wait (bubble)
- Alternative solution
  - Compiler generates instructions in such a way that there will be NO dependencies
  - No special hardware
    - simpler CPU that need not detect dependencies
  - Compiler must have very detailed and specific information about the target processor's functionality

Computer Organization II, Spring 2010, Tiina Niklander

16.2.2010



# Parallelism (rinnakkaisuus)

■ Instruction-level parallelism (käskytason rinnakkaisuus)

load r1← r2 add r3 ← r3+1 add r4 ← r4, r2

- Independent instructions of a sequence can be executed in parallel by overlapping
- Theoretical upper limit for parallel execution of instructions
  - Depends on the code itself
- Machine parallelism (konetason rinnakkaisuus)



- Ability of the processor to execute instructions parallel
- How many instructions can be fetched and executed at the same time?
- ~ How many pipelines can be used
- Always smaller than instruction-level parallelism
  - Cannot exceed what instructions allow, but can limit the true parallelism
  - Dependences, bad optimization?

Computer Organization II, Spring 2010, Tiina Niklander





# **Superscalar execution**

- Instruction fetch (käskyjen nouto)
  - Branch prediction (*hyppyjen ennustus*)
    - → prefetch (ennaltanouto) from memory to CPU
  - Instruction window (valintaikkuna)
    - ~ set of fetched instructions
- Instruction dispatch/issue (käskyn päästäminen hihnalle)
  - Check (and remove) data, control and resource dependencies
  - Reorder; dispatch the suitable instructions to pipelines
  - Pipelines proceed without waits
  - If no suitable instruction, wait here
- Instruction complete, retire (suoritus valmistuu)
  - Commit or abort (hyväksy tai hylkää)
  - Check and remove write and antidependencies
  - → wait / reorder (järjestä uudelleen)

Computer Organization II, Spring 2010, Tiina Niklander

16.2.2010 12



# In-order issue, in-order complete

- Traditional sequencial execution order
- No need for instruction window
- Instructions dispatched to pipelines in original order
  - Compiler handles most of the dependencies
  - Still need to check dependencies, if needed add bubbles
  - Can allow overlapping on multiple pipelines
- Instructions complete and commit in original oder
  - Cannot pass, overtake (ohittaa) on other pipeline
  - Several instructions can complete at same time
  - Commit/Abort

Computer Organization II, Spring 2010, Tiina Niklander

16.2.2010











# Register renaming (rekistereiden uudelleennimeäminen)

- One cause for some of the dependencies is the usage of names
  - The same name could be used for several independent elements
  - Thus, instructions have unneeded write and antidependencies
  - Causing unnecessary waits
- Solution: Register renaming
  - Hardware must have more registers (than visible to the programmer and compiler)
  - Hardware allocates new real registers during execution in order to avoid name-based dependencies (nimiriippuvuus)
- Need
  - More internal registers (register files, register set), e.g. Pentium II has 40 working registers
  - Hardware that is cabable of allocating and managing registers and performing the needed mapping

Computer Organization II, Spring 2010, Tiina Niklander

16.2.2010

R3 + 1

 $\leftarrow$  R5 + 1

R7 ← R3 + R4









# **Computer Organization II**

# Pentium 4

Computer Organization II, Spring 2010, Tiina Niklander







#### Sta09 Fig 14.9a-d



- a) Fetch IA-32 instruction from L2 cache and generate μops to L1
  - Uses Instruction Lookaside Buffer (I-TLB)
  - and Branch Target Buffer (BTB)

Generation of µops

- four-way set-associative cache, 512 lines
- 1-4 μops (=118 bit RISC) per instruction (most cases), if more then stored to microcode ROM
- b) Trace Cache Next Instruction Pointer instruction selection
  - Dynamic branch prediction based on history (4-bit)
  - If no history available, Static branch prediction
    - backward, predict "taken"
    - forward, predict "not taken"
- c) Fetch instruction from L1-level trace cache
- d) Drive wait (instruction from trace cache to rename/allocator)

Computer Organization II, Spring 2010, Tiina Niklander

16.2.2010

25



#### Sta09 Fig 14.9e

#### Resource allocation



- e) Allocate resources
  - 3 micro-operations per cycle
    - Allocate an entry from Reorder Buffer (ROB) for the μops (126 entries available)
    - Allocate one of the 128 internal work registers for the result
    - And, possibly, one load (of 48) OR store (of 24) buffer
- f) Register renaming
  - Clear 'name dependencies' by remapping registers (16 architectural regs to 128 physical registers)
  - If no free resource, wait (→ out-of-order)
- ROB-entry contains bookkeeping of the instruction progress
  - Micro-operation and the address of the original IA-32 instr.
  - State: scheduled, dispatched, completed, ready
  - Register Alias Table (RAT):

which IA-32 register → which physical register

Computer Organization II, Spring 2010, Tiina Niklander

16.2.2010



#### **Window of Execution**



- g) Micro-Op Queueing
  - 2 FIFO queues for µops
    - One for memory operations (load, store)
    - One for everything else
  - No dependencies, proceed when room in scheduling
- h) Micro-Op Scheduling
  - Retrieve µops from queue and dispatch for execution
  - Only when operands ready (check from ROB-entry)
- i) Dispatching
  - Check the first instructions of FIFO-queues (their ROB-entries)
  - If execution unit needed is free, dispatch to that unit
  - Two queues → out-of-order issue
  - max 6 micro-ops dispatched in one cycle
    - ALU and FPU can handle 2 per cycle
    - Load and store each can handle 1 per cycle

Computer Organization II, Spring 2010, Tiina Niklander

16.2.2010



## **Integer and FP Units**



- j) Get data from register or L1 cache
- k) Execute instruction, set flags (lipuke)
  - Several pipelined execution units
    - 4 \* Alu, 2 \* FPU, 2 \* load/store
    - E.g. fast ALU for simple ops, own ALU for multiplications
  - Result storing: in-order complete
  - Update ROB, allow next instruction to the unit
- I) Branch check
  - What happend in the jump /branch instruction
  - Was the prediction correct?
  - Abort incorrect instruction from the pipeline (no result storing)
- m) Drive update BTB with the branch result

 $Computer\ Organization\ II,\ Spring\ 2010,\ Tiina\ Niklander$ 

16.2.2010





## **Pentium 4 Hyperthreading**

- One physical IA-32 CPU, but 2 logical CPUs
- OS sees as 2 CPU SMP (symmetric multiprocessing)
  - Processors execute different processes or threads
  - No code-level issues
  - OS must be cabable to handle more processors (like scheduling, locks)
- Uses CPU wait cycles
  - Cache miss, dependences, wrong branch prediction
- If one logical CPU uses FP unit the other one can use INT unit
  - Benefits depend on the applications

Computer Organization II, Spring 2010, Tiina Niklander

16.2.2010

20



# **Pentium 4 Hyperthreading**



Intel Nehalem arch.: 8 cores on one chip,

1-16 threads (820

million transistors)

First lauched processor

Core i7 (Nov 2008)

- Duplicated (kahdennettu)
  - IP, EFLAGS and other control registers
  - Instruction TLB
  - Register renaming logic
- Split (puolitettu)
  - No monopoly, non-even split allowed
  - Reordering buffers (ROB)
  - Micro-op queues
  - Load/store buffers
- Shared (jaettu)
  - Register files (128 GPRs, 128 FPRs)
  - Caches: trace cache, L1, L2, L3
  - Registers needed during µops execution
  - Functional units: 2 ALU, 2 FPU, 2 Id/st-units

Computer Organization II, Spring 2010, Tiina Niklander

16.2.2010



# **Computer Organization II**

# **ARM Cortex-A8**

Computer Organization II, Spring 2010, Tiina Niklander

16.2.2010



#### **ARM CORTEX-A8**

- ARM refers to Cortex-A8 as application processors
- Embedded processor running complex operating system
  - Wireless, consumer and imaging applications
  - Mobile phones, set-top boxes, gaming consoles, automotive navigation/entertainment systems
- Three functional units
- Dual, in-order-issue, 13-stage pipeline
  - Keep power required to a minimum
  - Out-of-order issue needs extra logic consuming extra power
- Separate SIMD (single-instruction-multiple-data) unit called NEON
  - 10-stage pipeline

Computer Organization II, Spring 2010, Tiina Niklander













# **Integer Execution Unit**

- Two symmetric (ALU) pipelines, an address generator for load and store instructions, and multiply pipeline
- Multiply unit instructions routed to pipe0
  - Performed in stages E1 through E3
  - Multiply accumulate operation in E4
- E0 Access register file
  - Up to six registers for two instructions
- E1 Barrel shifter if needed.
- E2 ALU function
- E3 If needed, completes saturation arithmetic
- E4 Change in control flow prioritized and processed
- E5 Results written back to register file

Computer Organization II, Spring 2010, Tiina Niklander





# Load/store pipeline

- Parallel to integer pipeline
- E1 Memory address generated from base and index register
- E2 address applied to cache arrays
- E3 load, data returned and formatted
- E3 store, data are formatted and ready to be written to cache
- E4 Updates L2 cache, if required
- E5 Results are written to register file

Computer Organization II, Spring 2010, Tiina Niklander



## SIMD and Floating-Point Pipeline

- SIMD and floating-point instructions pass through integer pipeline
- Processed in separate 10-stage pipeline
  - NEON unit
  - Handles packed SIMD instructions
  - Provides two types of floating-point support
- If implemented, vector floating-point (VFP) coprocessor performs IEEE 754 floating-point operations
  - If not, separate multiply and add pipelines implement floatingpoint operations

Computer Organization II, Spring 2010, Tiina Niklander





# Review Questions / Kertauskysymyksiä

- Differences / similarities of superscalar and trad. pipeline?
- What new problems must be solved?
- How to solve those?
- What is register renaming and why it is used?
- Miten superskalaaritoteutus eroaa tavallisesta liukuhihnoitetusta toteutuksesta?
- Mitä uusia rakenteesta johtuvia ongelmia tulee ratkottavaksi?
- Miten niitä ongelmia ratkotaan?
- Mitä tarkoittaa rekistereiden uudelleennimeäminen ja mitä hyötyä siitä on?

Computer Organization II, Spring 2010, Tiina Niklander

16.2.2010