#### **EPIC** (Explicit Parallel Instruction Computing) - Parallelism explicit in machine instructions, not hidden inside the hardware (processor) - New semantics on machine instructions - Compiler solves dependencies and decides the parallel execution issues, processor just trusts it - VLIW (Very Long Instruction Word) - Handle instructions in bundles (*nippu*) - Branch predication, control speculation - Speculative execution of both (all) branch targets - Spekulative loading of data $Linuksen\ kommentti\ IA-64:sta\ (2005): http://www.realworldtech.com/forums/index.cfm? action=detail\&id=60298\& threadid=60123\& roomid=20123\& roomid=20123\&$ Computer Organization II, Spring 2009, Tiina Niklander 0.4.2009 ## Assembly-language format (symbolinen konekieli) [qp] mnemonic[.comps] dests = srcs qp qualifying predicate register ■ if predicate register value =1 (true), commit ■ mnemonic name of the instruction operaation comps completers, separated by periods ■ Some instructions have extra parts to qualify it dests destination operands, separated by commas ■ srcs source operands, separated by commas Computer Organization II, Spring 2009, Tiina Niklander 20.4.2009 #### **Assembly-language format** - Instruction group boundaries (stops) are marked;; - Instruction bundle template has the "black line" - Hint: the instructions of a group can be executed parallel - No data or output dependency within group - no read after write (RaW) or - no write after write (WaW) - What about antidependency (WaR)??? Id8 r1 = [r5] // first group sub r6 = r8, r9; add r3 = r1, r4 // second group st8 [r6] = r12 // memory address in r6 Computer Organization II, Spring 2009, Tiina Niklande 0.4.2009 #### **Computer Organization II** ## Key mechanisms - Predicated execution - Control speculation (= speculative loading) - Data speculation - Software pipelining (ohjelmoitu liukuhihna) Intel slides: http://www.cs.helsinki.fi/u/kerola/tikra/IA64-Architecture.pdf Computer Organization II, Spring 2009, Tiina Niklander 0.4.2009 #### Predicated execution (Predikoitu suoritus) #### Compiler - Create bundles, set template - Describe, which instructions can be executed parallel - Instruction execution order within one bundle undetermined - Eliminate branches, e.g. if-then-else 'jumps' - Assign own predicate register to each branch - Both branches could be executed (in parallel?) #### **CPU** Intel slide 18 - Start executing both branches - Even before the condition result is known! - Check predicates, when compare outcome known - Discard the results of the unselected branch - Commit the results of the selected branch - Predicate always ready at instruction commit? Computer Organization II. Spring 2009 Tijna Niklande 2009 1 #### **Speculative loading** - Start data load in advance - = speculative load ( even if unclear if the data is needed) - Ready at processor when needed, no latency - Straightforward unless branch or store between load and use - Branching? control speculation - Speculative loading could cause an exception (or page fault) that should not have happened at all - Store? data speculation - Speculative loading could be for the same memory location, the store is about to change ... Store R1, (R3) Load R5, (R4) 20.4.2009 R1, =Limit R5, Table(R1) Done Comp JLE Load Computer Organization II, Spring 2009, Tiina Niklander Intel slide 26 #### **Control Speculation** - = Hoist (nosta) load instruction earlier in the code before the branch instruction - Mark it speculative (.s) - If speculative load cause exception, delay it (NaT bit) - There is a possibility that exception should not happen! - Add chk.s instruction to the original location. It checks for exceptions and starts recovery routine Computer Organization II, Spring 2009, Tiina Niklande 2009 #### Software pipeline (Ohjelmoitu liukuhihna) Why called sofware pipeline? Hardware support to allow parallel execution of loop instructions - Parallel execution can be achieved by executing instructions of different iteration cycles - Each iteration cycle uses different registers - Automatic register renaming - Prolog (alku) and epilog (loppu) are special cases handled by rotating predicate register - "Loop jump" replaced by special loop termination instr. that controls the pipeline - Rotate registers, decrease loop count Computer Organization II, Spring 2009, Tiina Niklander 1.2009 ## Application register set (Sovelluksen rekisterit) Sta06 Fig 15.7 - General registers(128), FP-registers (128), Predicates (64) - Some static, some rotational (automatic renaming by hw) - Some general registers used a stack (*pino*) - Branch registers (8) - Target address can be in a register (indirect jump!) - Subroutine return address normally stored in register br0 - If ner call before return, br0 stored in register stack - Instruction pointer - Bundle address of current instruction - not address of single instruction - User mask - Flags (single-bit values) for traps and performance monitoring - Performance monitor data registers - Supports monitoring hardware - Information about hardware, e.g. branch predictions, usage of register stack, memory access delays, ... Computer Organization II, Spring 2009, Tiina Niklander #### Intel slides 15-17 #### Rekisteripino, Register Stack Engine - r0..r31 for global variables - r32..r127 (total 96) for subroutine calls - Call reserves a frame (set of regs in a register window) - parameters (inputs/outputs) + local variables - Size set dynamically (alloc instruction) - Registers automatically renamed after the call - Subroutine parameters always start from register r32 - Allocated in a circular-buffer fashion (renkaana) - If area full, hardware moves register contents of oldest frame to memory (= backing store). Restored when subroutine returns - Memory address in register BSP, BSPSTORE (backing store pointer) Computer Organization II, Spring 2009, Tiina Niklander 2009 ## Register stack - Allocation and restoring using two dedicated registers - CFM, Current Frame Marker - Size of the most-recently allocated area - sof=size of frame, sol=size of locals, - sor=size of rotation portion (SW pipeline) - GR/FP/PR register rotation information - rrb=register rename base - PFS, Previous Function State - Previous value of CFM stored here. Older content of PFS stored somewhere else (another register?) (alloc determines the destination) Computer Organization II, Spring 2009, Tiina Niklander 20 4 2009 24 #### **Computer Organization II** # Itanium 2 (again just called Itanium!) Computer Organization II, Spring 2009, Tiina Niklander 2009 #### **Itanium** - First implementation released in 2001 - Second, at that time called Itanium 2, released in 2002 - Simpler than conventional superscalar CPU - No resource reservation stations - No reorganization buffers (ROB) - Simpler register remapping hardware (versus register aliasing) - No dependency-detection logic - Compiler solved dependences and created /computed explicit parallelism directives - Large address space (suuri osoiteavaruus) - Smallest addressable unit: 1, 2, 4, 8, 10, 16 bytes - recommendation: use natural boundaries - Support both Big-endian and Little-endian Computer Organization II, Spring 2009, Tiina Niklander #### **Itanium** - Wide and fast bus: 128b, 6.4 Gbps - Improved cache hierarchy - L1: split instr, data 16KB + 16KB, set-ass. (4-way), 64B line - L2: shared 256KB, set-ass. (8-way), 128B line - L3: shared, 3MB, set-ass. (12-way), 64B line - All on-chip, smaller latencies - TLB hierarchy ■ I-TLB L1: 32 items, associative ■ L2: 128 items, associative ■ D-TLB L1: 32 items, associative ■ L2: 128 items, associative Computer Organization II, Spring 2009, Tiina Niklander 1.2009 #### **Memory management** - Memory hierarchy visible to applications also - = possibility to give hints - Fetch order: make sure, that earlier ops have committed - Locality: fetch a lot / a little lines to cache - Prefetch: when moved closer to CPU - Clearing: line invalidation, write policy - Implicit control (exclusive access) - Switching memory and register content - Increasing memory content by a constant value - Possibility to collect performance data - To improve hints... Computer Organization II, Spring 2009, Tiina Niklander #### **Itanium** - 11 instruction issue port (like selection window) - Max 6 instructions to execution in each cycle - in-order issue, out-of-order completion - 8-stage pipeline - More execution units (22) - 6 general purpose ALU's (1 cycle) - 6 multimedia units (2 cycles) - 3 FPU's (4 cycles) - 3 branch units - 4 data cache memory ports (L1: 1/2 cycle load) - Improved branch prediction - Application is allowed to give hints - Used to recude cache miss Computer Organization II, Spring 2009, Tiina Niklander 20.4.2009 #### **Computer Organization II** ## **Current State (2006-08)** ## Intel hyper-thread and multi-core STI multi-core Computer Organization II, Spring 2009, Tiina Niklander 1.2009 #### Intel Pentium 4 HT (IA-32) - HT Hyper-threading - 2 logical processors in one physical prosessor - OS sees it as symmetric 2-processor system - Use wait cycles to run the other thread - memory accesses (cache miss) - dependencies, branch miss-predictions - Utilize usually idle int-unit, when float unit in use - 3.06 GHz + 24%(?) - GHz numbers alone are not so important - 20 stage pipeline - Dual-core hyper-thread processor - Dual-core Itanium-2 with Hyper-threading http://www.intel.com/multi-core/index.htm Computer Organization II, Spring 2009, Tiina Niklander #### **Intel Multi-Core Core-Architecture** - 2 or more (> 100?) complete cores in one chip - Hyper-threading still in use - Simpler structure, less power - Private L1 cache - Private or shared L2 cache? - Intel Core 2 Duo E6700 - 128-bit data path - Private 32 KB L1 data cache - Private 32 KB L1 instr. Cache (for micro-ops) - Shared/private 4 MB L2 data cache Click or 2 Click for Pawlowski article Computer Organization II, Spring 2009, Tiina Niklander #### **Computer Organization II** ### STI Cell Broadband Engine (Sony-Toshiba-IBM) Computer Organization II. Spring 2009. Tijna Niklande 2009 #### STI Cell (Cell B.E.) - Sonv - Playstation 3 (4 cells) - IBM - Roadrunner supercomputer (installed 2008) - \$110M, 1100 m<sup>2</sup>, Linux - Peak 1.6 petaflops (1.6 \* 10<sup>15</sup> flops) - Sustained 1 petaflops - Over 16000 AMD Opterons for file ops and communication (e.g.) - Normal servers - Over 16000 Cells for number crunching - Blade centers Computer Organization II, Spring 2009, Tiina Niklander processors, 225 m<sup>2</sup> 20.4.2009 W. #### STI Cell (Cell B.E.) - Toshiba - Quad Core HD processor (or SpursEngine) - In multimedia laptops for HD DVD's - Mercury Computer Systems (year 2006) - Cell accelerator board (CAB) for PC's - 180 GFlops boost, Linux - Blade servers - Mercury 42U Dual Cell Based Blade 2 Systems - 42 Dual Cell BE Processors - IBM BladeCenter - 2 IBM PowerXCell 8i processor Mercury Dual-Cell Blade IBM Blade Server prototype w/ 2 cells (2005) 0.4.2009 42 Computer Organization II, Spring 2009, Tiina Niklander #### **Computer Organization II** ARM - Computer Organization II, Spring 2009, Tiina Niklander 1.2009 #### **ARM** architecture family - 32-bit embedded RISC microprocessor - ARM family accounts for approximately 90% of these. - The most widely used 32-bit CPU architecture in the world. - The ARM architecture is used in about 3/4 of all 32 bit processors sold. (source Intellitech) - Exists in 95% of all cell phones (source: Intellitech) #### ■ Architecture - Extremely simple (ARM6 only 35,000 transistors) - 32-bit data bus, a 26-bit (64 Mbyte) address space and sixteen 32-bit registers. - low power usage, hardwired-control - ARM: no cache, ARM4: cache - ARM8: 5-stage pipeline, static branch prediction, double-bandwidth memory Computer Organization II, Spring 2009, Tiina Niklander #### **ARM** architecture - Conditional execution of most instructions - 4-bit condition code in front of every instruction - Arithmetic instructions alter condition codes only when desired - Indexed addressing modes - 2-priority-level interrupt subsystem ``` \begin{array}{c} \text{while (i != j)} \\ \{ \text{ if (i > j)} \\ \text{ $i -= j$;} \\ \text{else $j -= i$; } \} \end{array} \\ \begin{array}{c} \text{loop CMP Ri, Rj }; \text{ set condition "NE" if (i != j)} \\ \text{;} & \text{"GT" if (i > j),} \\ \text{;} & \text{or "LT" if (i < j)} \\ \text{SUBGT Ri, Ri, Rj }; \text{ if "GT", $i = i-j$;} \\ \text{SUBLT Rj, Rj, Ri }; \text{ if "LT", $j = j-i$;} \\ \text{BNE loop} & \text{; if "NE", then loop} \end{array} ``` W. #### Things progress... - X86 => Pentium => Core => Nehalem (Core i7) => Westmere - Superscalar - More efficient use of pipelining - Parallel pipelines - Branch prediction - Out-of order -execution - CISC => RICS translations - Hyperthreading - Chip -level multiprocessing -> multi-core - Vector instruction codes (in vector processors) - Parallel data processing - Cache: more levels, larger cache - OX9650: 12 MB L2 Computer Organization II, Spring 2009, Tiina Niklander #### To different directions ... - Power consumption (Virrankulutus) - Mobile and portable devices - density => heating up - Superscalar improvements used? - Improments prediction logic gives less and less benefit => simpler CPU - => software based (Transmetan Crusoe tried this!) - => compiler does and gives better ordered instructions (IA-64, Itanium2, CELL, ..) - More cores on one chip - Different tasks (like Westmere integrated GPU) - Coordination of the cores (processors) becomes an issue Computer Organization II, Spring 2009, Tiina Niklander 2009 #### Review Questions / Kertauskysymyksiä - EPI C? - Why does the instruction bundle have a template? - What is predicated execution? How does it work? - What means control speculation? Data speculation? - How registers are used in subroutine calls? - Difference of hyper-threading and multi-core? - EPI C? - Miksi käskynipun yhteydessä on template? - Mitä tarkoitetaan predikoinnilla? Kuinka se toimii? - Mitä tarkoittaa kontrollispekulointi? Entä dataspekulointi? - Miten rekistereitä käytetään aliohjelmakutsuissa? - Mikä ero hyper-threadeillä ja multi-corella? Computer Organization II, Spring 2009, Tiina Niklander