





#### Goals

- I want my memory lightning fast
- I want my memory to be gigantic in size
- Register access viewpoint
  - data access as fast as HW register
  - data size as large as memory
- Memory access viewpoint
  - data access as fast as memory
  - data size as large as disk

cache

HW solution



HW help for SW solution

Computer Organization II, Autumn 2010, Teemu Kerola





## Principle of locality (paikallisuus)

- In any given time period, memory references occur only to a <u>small subset</u> of the whole address space
- The reason why memory hierarchies work

Prob (small data set) = 95% "Cost" (small data set) = 0.01  $\mu$ s "Cost" (the rest) = 0.1  $\mu$ s "Aver cost = 95% \* 0.01  $\mu$ s + 5% \* 0.1  $\mu$ s = 0.015  $\mu$ s

- Average cost is close to the cost of small data set
- How to determine data for that small set?
- How to keep track of it?

Computer Organization II, Autumn 2010, Teemu Kerola

5.11.2010



### **Principle of locality**

- In any given time period
  - Memory references occur only to a <u>small subset</u> of the whole address space
- Temporal locality (ajallinen)
  - It is likely that a data item referenced a <u>short time ago</u> will be referenced <u>again</u> soon
- Spatial locality (alueellinen)
  - It is likely that a data items <u>close</u> to the one referenced a short time ago will be referenced soon

MEM:



Computer Organization II, Autumn 2010, Teemu Kerola



# **Computer Organization II**

# Cache

Computer Organization II, Autumn 2010, Teemu Kerola





### Cache Memory (välimuisti)

- How to access main memory as fast as registers?
- Locality → Use (CPU) cache!
  - Keep most probably referenced data in fast cache close to processor, and rest in memory
  - Most of data accesses only to cache
    - hit ratio 0.9-0.99
  - Cache is much smaller than main memory
  - Cache is (much) more expensive (per byte) than memory
- Note: file cache is another thing...

Computer Organization II, Autumn 2010, Teemu Kerola









#### Cache Design

Cache Size

Write Policy

Mapping Function

Write through

Direct Associative Write back Write once

Set Associative

Line Size

Replacement Algorithm

Number of caches

Least recently used (LRU)

Single or two level

First in first out (FIFO) Least frequently used (LFU) Unified or split

Random

(Sta10 Table 4.1)

#### ■ Cache Size & Line Size

■ Many blocks help for temporal locality

Typical sizes: L1: 8 KB - 64 KB

- Large blocks help for spatial locality
- L2: 256 KB 8 MB
- Larger cache is slower

■ Multi-level cache

L3: 2 MB - 48 MB

Computer Organization II, Autumn 2010, Teemu Kerola

Discussion? 5.11.2010



#### **Mapping**

- Which block (if any) contains the referenced memory location?
  - Is the block in cache?
  - Where in the cache is it located?

#### Solutions

- Direct mapping (suora kuvaus)
  - One possible location
- Fully associative mapping (täysin assosiatiivinen)
  - Any possible location
- Set associative mapping (joukkoassosiatiivinen)
  - Some possible locations

Cache simulation tools:

http://www.ecs.umass.edu/ece/koren/architecture/Cache/frame0.htm

Computer Organization II, Autumn 2010, Teemu Kerola













### **Fully Associative Mapping**

- Lots of circuits
  - tag fields are long wasted space?
  - each cache line tag must be compared in parallel with the memory address tag
    - lots of wires, comparison circuits
    - large surface area on chip
- Final comparison "or" has large gate delay
  - did any of these 64 comparisons match?
    - log2(64) = 6 levels of binary OR-gates
  - how about 262144 comparisons?
    - 18 levels?
- ⇒ Can use it only for small caches

Computer Organization II, Autumn 2010, Teemu Kerola









#### **Set Associative Mapping**

- Set associative cache with set size k=2= 2-way cache (common)
- Degree of associativity = nr of blocks in a set = v
  - Large degree of associativity?
    - More data items in one set
    - Less "collisions" within set
    - Final comparison (matching tags?) gate delay?
  - Maximum (nr of cache lines)
    - ⇒ fully associative mapping Whole cache is one set!
  - Minimum (1)
    - ⇒ direct mapping Each cache line is a set!

Computer Organization II, Autumn 2010, Teemu Kerola



### **Cache Replacement Algorithm**

- Which cache block to replace to make room for new block from memory?
- Direct mapping: trivial
- First-In-First-Out (FIFO)?
- Least-Frequently-Used (LFU)?
- Random?
- Which one is best / possible?
  - Chip area?
  - Fast? Easy to implement?

Computer Organization II, Autumn 2010, Teemu Kerola

5.11.2010

25



Coherence

problems:

More users of the same data:

multiple processors

with own caches

memory valid? cache valid?

# Cache Write Policy - memory writes?

- Write through (*läpikirjoittava*)
  - Each write goes always to cache and memory
  - Each write is a cache miss!
- Write back (lopuksi/takaisin kirjoittava)
  - Each write goes only to cache
  - Write cache block back to memory
  - only when it is replaced in cache

A bit set

- Memory may have stale (old) data
- cache coherence problem (eheys, yhdenmukaisuus)
- Write once ("vain kerran kirjoittava?")
  - Write-invalidate Snoopy-cache coherence protocol for multiprocessors
  - Write invalidates data in other caches
  - Write to memory at replacement time, or when some other cache needs it (has read/write miss)

    Discussion?

Computer Organization II, Autumn 2010, Teemu Kerola

5.11.2010 26



#### **Cache Line Size**

- How big cache line?
- Optimise for temporal or spatial locality?
  - bigger cache line
     ⇒ better for spatial locality
     more cache lines
     ⇒ better for temporal locality
- Best size varies with program or program phase?
- Best size different with code and data?
- 2-8 words?
  - word = 1 float??

Computer Organization II, Autumn 2010, Teemu Kerola

5.11.2010



### **Types and Number of Caches**

- Same cache for <u>data</u> and <u>code</u>, or not?
  - Data references and code references behave differently
  - Unified vs. split cache (yhdistetty/erilliset)
  - Split cache: can optimise structure separately for data and code

Trend towards split caches: Pentium, Power PC, ARM...

- One cache too large for best results?
  - "Smaller is faster"
- Multiple levels of caches
  - L1 on same chip as CPU
  - L2 on same package or chip as CPU
    - older systems: same board
  - L3 on same board as CPU

Computer Organization II, Autumn 2010, Teemu Kerola





| Main Memory Types                      |                    |                           |                 | (katoava,<br>haihtuva) |
|----------------------------------------|--------------------|---------------------------|-----------------|------------------------|
| Memory Type                            | Category           | Erasure                   | Write Mechanism | Volatility             |
| Random-access<br>memory (RAM)          | Read-write memory  | Electrically, byte-level  | Electrically    | Volatile               |
| Read-only<br>memory (ROM)              | Read-only memory   | Not possible              | Masks           |                        |
| Programmable<br>ROM (PROM)             |                    |                           |                 | Nonvolatile            |
| Erasable PROM<br>(EPROM)               | Read-mostly memory | UV light, chip-level      | Electrically    |                        |
| Electrically Erasable<br>PROM (EEPROM) |                    | Electrically, byte-level  |                 |                        |
| Flash memory                           |                    | Electrically, block-level |                 |                        |

(Sta10 Table 5.1)

### ■ Random access semiconductor memory

■ Direct access to each memory cell

■ Access time same for all cells

Computer Organization II, Autumn 2010, Teemu Kerola

5.11.2010



#### **RAM**

- Dynamic RAM, DRAM
  - Periodic refreshing required
- Analog:
- Charge on capacitors
  - Refresh required after read
  - Simpler, slower, denser, bigger (bytes per chip)
  - Access time ~ 60 ns (?)
  - Main memory? (early systems)
- Static RAM, SRAM
  - No periodic refreshing needed
- Digital: flip-flop gates
- Data remains until power is lost
- More complex (more chip area/byte), faster, smaller
- Access time ~ 25 ns (?)
- Cache?

Computer Organization II, Autumn 2010, Teemu Kerola







### **SDRAM (Synchronous DRAM)**

- CPU clock synchronizes also the bus
  - Runs on higher clock speeds than ordinary DRAM
  - CPU knows how long it takes to make a reference,
  - can do other work while waiting
- 16 bits in parallel
  - Access 4 DRAMs (4 bits each) in parallel
  - Access time ~ 18 ns, transfer rate ~ 1.3 GB/s
- <u>DDR</u> SDRAM, double data rate
  - Current main memory technology
  - Supports transfers both on rising and falling edge of the clock cycle
  - Consumes less power
  - Access time ~ 12 ns, transfer rate ~ 3.2 GB/s

Computer Organization II, Autumn 2010, Teemu Kerola

5.11.2010

Rambus DRAM (RDRAM) (Sta10 Fig 5.14) RDRAM n RDRAM 1 RDRAM 2 RC1k [2] TClk [2] Gnd (32/18) ■ Works with fast Rambus memory bus (800Mbps) ■ Controller + RDRAM modules Intel Pentium, I tanium STI Cell processor ■ Access time ~ 12 ns, transfer rate ~ 4.8 GB/s ■ Speed slows down with many memory modules ■ Serially connected on Rambus channel Computer Organization II, Autumn 2010, Teemu Kerola 5.11.2010



#### **MRAM**

- Magnetoresistive Random Access Memory (MRAM)
  - Data stored with magnetic fields on two plates
  - Magnetic field directions determine bit value
- Non-volatile, data remains with power off
  - Fast to read/write
  - No upper limit for write counts (Flash has upper limit)
  - Access time comparable to DRAM, almost as fast as SRAM
- Future open
  - Small market share now
  - Expensive now (2006: \$25 4Mbit, 2008: \$15 4Mbit, Freescale)
  - Still under development
  - May replace flash "in a few years"
    - Has not by 2010 ...



http://www.research.ibm.com/journal/rd/501/maffitt.html

Computer Organization II, Autumn 2010, Teemu Kerola

2010



### **Summary**

- Memory hierarchy
- Cache
  - Size, Line size, Mapping, nr of levels, unified or split, write policy, replacement policy
- Memory
  - DRAM, SRAM
  - Overall features, no details required
  - Current technologies
    - Synchronous DRAM (SDRAM)
    - Double Data Rate DRAM (DDR)
    - Rambus DRAM

Computer Organization II, Autumn 2010, Teemu Kerola

5.11.2010 383838



# **Review questions**

- Memory hierarchy and principle of locality?
- Different ways to use locality in cache solutions?
- Differences of associative and set associative mappings?
- Why to have separate caches for instructions and data?

Computer Organization II, Autumn 2010, Teemu Kerola