Lesson 12

# **Current Research** Course Review

[Pat 08] [A-TKS 07] [KCHK 06]

Moore's Law Multiprocessor Challenge in 1980's Multicore Challenge Now Course Review

31.3.2011

Copyright Teemu Kerola 2011

#### Moore's Law

- "The number of transistors that can be inexpensively placed on an integrated circuit is increasing exponentially, doubling approximately every two years" (orig. 18 months)
  - Gordon E. Moore, 1965
  - Memory size in chips will double every two years
  - Increase in transistor count is also a rough measure of computer processing performance, resulting to processing speed doubling every two years
- "Heat barrier" limit for processing speed reached 2004
  - Luke Collins, IEE Review, Jan 2003

http://ieeexplore.ieee.org/iel5/2188/26844/01193720.pdf

31.3.2011

Copyright Teemu Kerola 2011

**Problem** Moore's Law will not give us faster processors (any more) - But it gives us now more processors on one chip • Multicore CPU • Chip-level multiprocessor (CMP) Herb Sutter, "A Fundamental Turr Toward Concurrency in SW", Dr. Dobb's Journal, 2005. 1975 1979 1983 1987 31.3.2011 Copyright Teemu Kerola 2011



# Moore's Law Reinterpreted

- · Number of cores per chip doubles every two years, while clock speed decreases
  - Need to utilize systems with hundreds or thousands of cores
  - Need to handle systems with millions (billions?) of concurrent threads
  - $\ Need to \ emphasize \ \underline{scalability} not \ best \ performance$ for fixed number of cores.
  - Need to be able to easily replace inter-chip parallelism with intra-chip parallelism

Marc Snir

http://www.cs.helsinki.fi/u/kerola/rio/papers/snir\_2008.pdf

31.3.2011 Copyright Teemu Kerola 2011

# Multi-core: An Inflection Point also in SW Development

- Multi-core architectures: an inflection point in mainstream SW development
- · Writing parallel SW is hard
  - Mainstream developers (currently) not used to thinking in parallel
  - Mainstream languages (currently) force the use of low-level concurrency features
  - Must have with new systems?
- Navigating through this inflection point requires better concurrency abstractions

31.3.2011 Copyright Teemu Kerola 2011

Lecture 12: Current Research, Summary

1

# The Multicore Challenge

- · Heat barrier dead-end for chip speed
- · So, try now multicore chips ...
  - Shared memory multiprocessor



- Convex, Floating Point Systems, INMOS, Thinking Machines, nCUBE, Kendall Square Research, MasPar, Encore, Sequent,...
- All failed Patterson's "Dead Parallel Computer Society"
- · John Hennessy:
  - "...when we start talking about parallelism and ease of use of truly
    parallel computers, we're talking about a problem that's as hard as any
    that computer science has faced. ... I would be panicked if I were in
    industry."
- · Challenge: How to use multicore effectively
- · Answer: Industry/government funded research?
  - Scale: Manhattan Project

31.3.2011

Copyright Teemu Kerola 2011

## MCC 1983 - (2000)

- re: Japanese 5th Generation Project
  - Ministry of Internat. Trade and Industry (MITI)
  - Japan, 1982, 10 years, \$850M
  - New type of computer to run Artificial Intelligence applications
  - US national level response to Japanese MITI Project
    - Massively Parallel Processing (MPP), how to build and use them
- Microelectronics and Computer Technology Corporation (MCC), Austin, Texas
  - Some 450 researchers in 1986
  - 12 companies in 1983: Control Data, DEC, Harris, RCA, Sperry-Univac, NCR, Honeywell, National Semiconductor, Advanced Micro Devices, Motorola, ...
  - More companies later on: Microsoft, Boeing, GE, Lockheed, Martin Marietta, Westinghouse, 3M, Rockwell, Kodak, Nokia 1997. DoD....
  - Budget \$50-100M per year, for 20 years?

 $\frac{http://www.tshaonline.org/handbook/online/articles/MM/dnm1.html}{Copyright Teemu Kerola 2011}$ 



#### MCC 1983 - (2000)

- CEO Admiral Bobby Ray Inman (ex NSA #1, ex CIA #2)
  - Well connected, new law to avoid antitrust problems 1984
- Main Research Areas (Programs)
- Software technology, semiconductor packaging, VLSI computeraided design, parallel processing, database management, human interfaces and artificial intelligence/knowledge-based systems
- Parallel Processing Program
  - "What type of MPP computer to build and how to use it?"
  - VP Peter Patton 1983-86, VP Stephen Lundstrom 1986-87
  - Some 30 researchers for 3 years, lots of resources
  - Lots of published (and proprietary) research papers
  - Reviews every 3 months, for knowledge transformation from MCC to shareholding companies
  - Could not formulate goal properly, starts to disintegrate 1987

31.3.2011

http://en.wikipedia.org/wiki/Microelectronics and Computer Technology Corporation

Copyright Teemu Kerola 2011 9

# MPP Challenge in early 1980's

- · Shared memory multiprocessors
  - With cache coherence
  - Memory bus can handle some 20 processors
- More processors have new problems
  - Interconnect networks
    - All-to-all, butterfly, cube connected cycles, hyper-cube, ...
  - Multi-level memory, varying memory access times
    - $\bullet \;\; Local, shared, node, network, \dots$
- Operating systems
  - Shared Memory OS, Distributed OS?
  - Compiler technology ouch!
- · Applications tailored for just one system?

31.3.2011

31.3.2011

Copyright Teemu Kerola 2011

10

# 31.3.2011 CopyrightTeemuKerola 2011 11

# The Multicore Challenge Challenge

- The Multicore Challenge
  - How to use multicore/shared-memory-multiprocessor effectively?
  - Answer: Industry/government funded research for many years (?)
- The Challenge Challenge
  - $\ \ Is the Multicore Challenge the \underline{right \ challenge \ to \ take}?$ 
    - It might fail (again)
  - What other challenges would better suit our need for ever faster computers?
- The Real Challenge (?):
  - How to get computational speed to double every two years or every 18 (24?) months for <u>all</u> computations? (just like before...)
    - Currently there is no answer
  - What if there is no answer ever? Is that ok?
  - Get computational speed to double every two years for <u>some</u> computations? Is this enough?
    - This is doable... but is it enough? Which "some"?

31.3.2011

Copyright Teemu Kerola 2011

12

# Needs for Multicore Challenge

- · How to synchronize processes in multiprocessor systems?
- How to make it easy to obtain parallel performance
- Scalable solutions (to many and even more processors)
- Avoid deadlocks
- Data consistency with error situations (abort support)
- Current programming languages force low level solutions
- Amdahl's Law: Proportion of serial code will give upper limit on speedup
  - With 5% serial code, max speedup is 20 (even for 100 processors)

Speedup = where P = parallel code proportion N = nr of processors Serial code proportion (1-P) should

31.3.2011

be almost zero. How? Copyright Teemu Kerola 2011

# US Multicore Challenge Projects

- - 1000-core (or more) processors
  - Parallel algorithms, development environments, and runtime systems that scale to 1000s of hardware threads
  - Need new OS & HW architectures
    - · What exactly is needed? That is the problem!
  - What type of MPP computer to build and how to use it?

31.3.2011

- FPGA-based simulators to test out work
  - FPGA = Field-Programmable Gate Array
  - · Reprogram HW to test new HW ideas
- Funding problem another challenge?
  - Defence Advanced Research Projects Agency (DARPA) funding receding in 2000-2008

· Would have been needed big-time Copyright Teemu Kerola 201

# US Projects from 2008

- · Stanford University
  - Pervasive Parallelism Lab
- · University of California at Berkeley
  - Parallel Computing Lab
- University of Illinois at Urbana-Champaign,
  - The Universal Parallel Computing Research Center
- The Multicore Association
  - Many companies share results

Copyright Teemu Kerola 2011 15 31.3.2011

# Stanford



- John Hennessy (Jan 2007):
  - "When we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science
  - "I would be panicked if I were in industry."

http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=445&page=3

31.3.2011 Copyright Teemu Kerola 2011

# Stanford

- · 2008, \$6M for 3 years,
- 9 faculty + 30 grad students
- Nvidia, Sun Microsystems, Advanced Micro Devices, Hewlett-Packard, IBM, Intel
- William Dally (chairman, Stanford CS dept) Stream computing, transactional memory
- Enable use of parallelism beyond traditional scientific computing
- New ideas for high-level concurrency abstractions
- New ideas for hardware support for new paradigms

31.3.2011

Copyright Teemu Kerola 2011



17

31.3.2011

Stanford

· Goal: the Parallel Computing Platform for 2012



- Make parallel programming practical for the masses
- Algorithms, programming models, runtime programming and sw systems
- architectures for scalable parallelism
  - 10.000s of HW threads

architecture

• Parallel computing a core component of CS education

- Build real, full system prototypes

Copyright Teemu Kerola 2011

# Universal Parallel Computer Research Centers (UPCRC's)

- · Funding: Intel & Microsoft
  - Dan Reed (Microsoft, Extreme Comp Grp)
- Univ of California at Berkeley
- Parallel Computing Lab
- David A. Patterson
- Univ of Illinois at Urbana-Champaign
  - The Universal Parallel Computing Research Center
- Marc Snir & Wen-mei Hwu





31.3.2011 Copyright Teemu Kerola 2011

# Berkeley

- David A. Patterson (Aug 2008):
  - "Knowing what we know today, if we could go back in time we would have launched a Manhattan Project to bring together the applications, software architecture, programming languages and compilers, libraries, testing and correctness, operating systems, hardware architecture, and chip design to tackle this parallel
  - "We need the US Government to return to its historic role to bring the many more minds on these important problem. To make real progress, we would need a long-term, multi-hundred million dollar per year program.

http://parlab.eecs.berkeley.edu/

http://www.cccblog.org/2008/08/26/the-multicore-challenge/

31.3.2011 Copyright Teemu Kerola 2011

# Berkeley Parallel Computing Lab David A. Patterson

- David A. Patterson, 10 faculty + 40 grad students
- \$17M for 5 years (2008-2012?)
- Berkeley Emulation Engine v3 (BEE3)
- 7+ basic programming tasks at the heart of most parallel programs?



Compositional verification and testing

Copyright Teemu Kerola 2011 31.3.2011

# Berkeley

- Applications
  - Need new 21<sup>st</sup> century applications
    - · Medicine, image, music, speech, ...
- Computational bottleneck benchmarks
  - 7+ dwarfs, use them to analyse hw/sw designs
- Parallel SW development (55% faculty)
  - Implement 13 dwarfs as libraries or frameworks
  - Efficiency layer by experts (mutex, deadlock, etc)
  - Productivity layer by "normal" programmers
  - Create Composition and Coordination (C&C) language
  - 21st century code generation
- · OS and Architecture
  - Very thin hypervisors

31.3.2011 Copyright Teemu Kerola 2011 22

20

# **Berkeley**

- HW: build own academic Manycore
  - Research Accelerator for Multiple Processors
  - 4 FGPAs/board, 21 boards (84 FGPA'a)
- 1008 Core RAMP Blue
  - 12 32-bit RISC cores / FPGA
- · Other architectures by FPGA redesign
  - RAMPants: 10 faculty
    - Berkeley, CMU, MIT, Stanford, Texas, Washington
  - · Create HW&SW for Manycore community





21





31.3.2011

Copyright Teemu Kerola 2011



- Marc Snir (Nov 2008):
  - "It is possible that parallel programming is inherently hard, in which case, indeed the sky is falling."

Illinois

The Universal Parallel Computing Research Center (UPCRC)

- "An alternative view is that, intrinsically, parallel programming is not significantly harder than sequential programming; rather, it is hampered by the lack of adequate languages, tools and architectures."

http://www.cccblog.org/2008/11/17/multi-core-and-parallel-programming-is-the-sky-falling/

24

26

# Illinois

http://www.upcrc.illinois.edu/

- Marc Snir & Wen-mei Hwu
- \$17M for 5 years (2008-...)
- Parallel programming can be (should be?) a child's play
- · Simplicity is hard
  - Simpler languages + more complex architectures
  - a feast for compiler developers
- What hooks can HW provide to facilitate programming?
  - Sync primitives, debug/performance support

http://www.cs.helsinki.fi/u/kerola/ño/papers/snir\_2008.pdf
31.3.2011 Copyright TeemuKerola 2011 2

Illinois · Moore's Law Reinterpreted - Number of cores per chip doubles every two years while clock speed decrea Need to be able to easily replace inter-chip parallelism with CPU Speed — DRAM Speed intra-chip parallelism Memory Wall Most area and energy in chip budget is spent on storing and moving bits · Reliability and Variance MTTF (mean time to fail) per chip does not decrease - hardware is used to mask errors Programmers do not have to handle faults

Copyright Teemu Kerola 2011

#### Illinois

- New emphasis on deterministic (repeatable) parallel computation models – focus on producer-consumer or barrier synchronization, not on nondeterministic mutual exclusion
  - Simpler languages + more complex architectures
     = a feast for compiler developers
- Serial semantics, parallel performance model
- Parallel algorithms are designed by <u>programmers</u>, not inferred by compilers



- Make parallel programming synonymous with programming
- What hooks can HW provide to facilitate programming?
  - Sync primitives, debug/performance support
- There is no silver bullet no one technology solution

31.3.2011 Copyright Teemu Kerola 2011 27

## The Multicore Association (2006-...)

- Intel, NSN, Texas Instruments, Plurality, Wind River, PolyCore, Samsung, ... http://www.multicore-association.org/home.php
- - capture the basic elements of communication and synchronization for closely distributed embedded systems.
- Multicore Programming Practices (MPP) wg
  - develop a <u>multicore software programming guide</u> for the industry that will aid in improving consistency and understanding of multicore programming issues
- Multicore Resource Management API (MRAPI) wg
  - specify essential application-level <u>resource management</u> <u>capabilities</u> needed by multicore applications
- Hypervisor wg

31.3.2011

support hypervisor (multiple OS's on same host) portability and multicore capabilities.

31.3.2011 Copyright Teemu Kerola 2011 2





# Intel Experimental SCC "Single-chip Cloud Computer"

http://techresearch.intel.com/ProjectDetails.aspx?ld=1

- 48 cores, no hw cache coherence
- · Message passing programming models
- · On-chip network





31.3.2011 Copyright Teemu Kerola 2011

# Likely Problems with Multicore Challenge

- Symmetric Multiprocessor (SMP) scales up to only somewhere (remember the 1980's)
- Must have <u>interconnect networks</u>
  - Between cores in chip, between chips, boards, and nodes
- · How to distribute memory and access it
  - Local core memory in cache?
  - Memory hierarchy reworked? Disks disappear?
- How to implement <u>I/O</u> and <u>database</u>
- How to distribute disks or other permanent stores
- Avoid communication & I/O bottlenecks
  - Remember Amdahl's Law
  - Minimize communication and serialization

31.3.2011 Copyright Teemu Kerola 2011

32

# More Likely Problems with Multicore Challenge

- Computation + communication + memory use
  - Optimize on overall time/space?
- · Test new processor architectures with FPGA's
- Operating system for new architectures?
- New languages for new architetures?
- · Good compilers for new architectures?
- May end up with lots of different architectures
  - Masters & slaves, control hierarchies, ...
  - Applications run well only on one system?
     E.g., Systems based on STI cell vs. Intel 80-core chip vs. Intel 48 core SCC?

31.3.2011 Copyright Teemu Kerola 2011

# Even More Likely Problems with Multicore Challenge

- Scalable processors architecture needs to be <u>at</u> <u>least</u> 3D?
  - Is 3D enough?
  - Stacked chips 2008

http://www.chinapost.com.tw/taiwan-business/2011/03/30/296633/New-stacked.htm

- · Real stress to get results fast
  - Single-core dead-end?
  - New architectures designed and built <u>now</u>
  - Important applications built on new architectures

31.3.2011 CopyrightTeemuKerola 2011

-----

## Current Research Summary

- Moore's Law and what it means now
- What type of MPP computer to build and how to use it?
- We have jumped into the multi-core train is that OK?
- · Some projects in Universities & Industry
  - Too small scale? \$10M's but not \$100M's or \$1000M's ...
- · No silver bullet?
  - Get computational speed to double every two years for <u>some</u> computations? Is this enough?
- Make parallel programming synonymous with programming?
  - Only experts program the "efficiency layer" in SW?
  - Should every student learn multicore parallel programming?
- · What happens if all these projects fail?
  - Get heterogeneous architectures that are incompatible with each other's code?



33

31.3.2011 Copyright Teemu Kerola 2011

# Concurrent Programming with New Architectures

- Minimize synchronization and communication
  - Amdahl's Law
  - Barrier synchronization often good
  - Avoid any complex synchronizations that do not scale up
  - Mutual exclusion should not be used in computational work
- · Use shared memory when possible
  - Faster than sharing data with messages
- How to partition problem so that overall solution time is minimized?
  - Distribute computing and data
  - Minimize communication and synchronization
     Trade computing time to communication time
- Prepare for faults
  - Many components, some will fail, must be prepared

31.3.2011 Copyright Teemu Kerola 2011 36

31.3.2011 Copyright TeemuKerola 2011 37

#### Course Review

- Concurrency
  - Problems, atomic statements
  - Critical sections, synchronization, communication
  - What are the problems in writing concurrent programs?
- Disabling interrupts, busy wait, or suspension?
  - When to (not) use? HW vs. SW solution?
  - Shared memory or not? One or distributed system?
- Proofs of correctness
  - Functionality
  - Mutex, no deadlock, no starvation
  - How to do the proofs with temporal logic?
  - How to determine what you really are trying to prove?

31.3.2011

Copyright Teemu Kerola 2011 3

# Course Review (contd)

- · Deadlock
  - Detection, prevention, avoidance DDA, Bankers
- Semaphores
  - Private semaphores, split semaphores, baton passing, implementation, busy-wait semaphores
- Monitors, protected objects
  - Condition variables, signal semantics
- · Messages, RPC, channels, rendezvous
  - Concurrent algorithms
- Distributed mutex
  - (token passing) Ricart-Agrawala, Neilsen-Mizuno
- · Current research

31.3.2011 Copyright Teemu Kerola 2011

# Course Review (contd)

- · Basic problems and solutions for them
  - Dining philosophers, sleeping barber, bakery
  - Readers-writers, producer-consumer
- · Distributed system
  - Concurrency control mechanisms
  - Solutions for critical section problem

31.3.2011 Copyright Teemu Kerola 2011

## What Should You Know?

- When to use what method for critical section (mutex), synchronization, or communication?
- How do you know you have a mutex problem?
- When would you use busy waits, semaphores, monitors, protected objects, RPC, channels, rendezvous?
- How do you implement XYZ with busy waits, semaphores, monitors, protected objects, RPC, channels, rendezvous?
- · When is some technology not appropriate?

31.3.2011 Copyright Teemu Kerola 2011 41

#### What Should You Know?

- When do you need concurrent/distributed algorithms?
   If serial solution is ok, use it!
- What type of OS/programming language library tools you have for CS/synchronization/communication problems?
- What do you need to study to solve your problem?
- What type of tools would you need to solve your problem?
- How does current research apply to me?
  - Do I need to study MPI (Message Passing Interface)?

31.3.2011 Copyright Teemu Kerola 2011 42



