From Computer Desktop Encyclopedia @ 1999 The Computer Language Co. Inc.



#### Tietokoneen rakenne

# Internal Memory, Cache

Stallings: Ch 4, Ch 5

- Key Characteristics
- Locality
- n Cache
- Main Memory





#### Key Characterics of Memories / Storage

Location

Processor

Internal (main)

External (secondary)

Capacity

Word size

Number of words

Unit of Transfer

Word

Block

Access Method

Sequential

Direct

Random

Associative

Performance

Access time

Cycle time

Transfer rate

Physical Type

Semiconductor

Magnetic

Optical

Magneto-Optical

Physical Characteristics

Volatile/nonvolatile

Erasable/nonerasable

Organization

(Sta06 Table 4.1)



#### Goals

- I want my memory lightning fast
- I want my memory to be gigantic in size
- Register access viewpoint
  - u data access as fast as HW register
  - u data size as large as memory



**HW** solution

- Memory access viewpoint
  - u data access as fast as memory
  - u data size as large as disk



HW help for SW solution



#### Memory Hierarchy

Most often needed data kept close

Access to small data sets

can be made fast

u simpler circuits

Faster ~ more expensive

Large can be biggerand cheaper (per B)

up: smaller, faster, more expensive, more frequent access

down: bigger, slower,

less expensive, less frequent access





#### Principle of locality (paikallisuus)

- In any given time period, memory references occur only to a small subset of the whole address space
- = The reason why memory hierarchies work

```
Prob (small data set) = 99%

Prob (the rest) = 1% "Cost" (small data set) = 2 \mus

"Cost" (the rest) = 20 \mus

Aver cost = 99% * 2 \mus + 1% * 20 \mus = 2.2 \mus
```

- Average cost is close to the cost of small data set
- How to determine data for that small set?
- How to keep track of it?

Sta06 Fig 4.2



#### Principle of locality

- In any given time period
  - u memory references occur only to a small subset of the whole address space
- n Temporal locality (ajallinen)



Spatial locality (alueellinen)



u it is likely that a data items <u>close</u> to the one referenced a short time ago will be referenced soon

MEM:





#### Tietokoneen rakenne

# Cache



#### Cache Memory (välimuisti)

- How to access main memory as fast as registers?
- Locality ž Use (CPU) cache!
  - Keep most probably referenced data in fast cache close to processor, and rest in memory
  - u Most of data accesses only to cache
    - § hit ratio 0.9-0.99
  - Much smaller than main memory
  - u (much) more expensive (per byte) than memory









# **Cache Organization**



(Sta06 Fig 4.6)



#### Cache Design

Cache Size

Mapping Function

Direct

Associative

Set Associative

Replacement Algorithm

Least recently used (LRU)

First in first out (FIFO)

Least frequently used (LFU)

Random

Write Policy

Write through

Write back

Write once

Line Size

Number of caches

Single or two level

Unified or split

n Size

- u Many blocks help for temporal locality
- u Large blocks help for spatial locality
- u Multi-level cache

(Sta06 Table 4.2)



#### Mapping

- Which block contains the memory location?
- Is the block in cache?
- Where is it located?
- Solutions
  - u direct mapping (suora kuvaus)
  - u fully associative mapping (joukkoassosiatiivinen)
  - u set associative mapping (täysin assosiatiivinen)



#### Direct Mapping (4)

- Each block has only one possible location (line) in cache
  - u determined by index field bits
- Several blocks may map into same cache line
  - u identified with tag field bits



PaHe98 Fig 7.10



## Direct Mapping Example (5)





## Direct Mapping Example 2 (5)





## Fully Associative Mapping (6)

- Each block can be in any cache line
  - u tag must be complete block number

Alpha AXP uses 34 bit memory addresses

Block number (in memory)

Block size

34 bit address
(byte address) tagOffset from the beginning of the block (in bytes)

Block size  $= 2^5 = 32 \text{ B}$ 

Unique bits that are different for each block

Each block can be anywhere Cache size can be any number of blocks

Sta06 Fig 4.9



## Fully Associative Example (4)





## Fully Associative Mapping

#### Lots of circuits

- u tag fields are long wasted space?
- each cache line tag must be compared <u>parallelly</u> with the memory address tag
  - § lots of wires, comparison circuits
  - § large surface area on chip
- Final comparison "or" has large gate delay
  - u did any of these 64 comparisons match?
    - $\log_2(64) = 6$  levels of binary OR-gates
  - u how about 262144 comparisons?
    - § 18 levels?

#### ð Can use it only for small caches



## Set Associative Mapping

- With set size k=2, each cache entry contain 2 blocks
  - u Use set (set index) field to find the cache entry
  - u Use tag to determine if the block belongs to the set
  - u Use offset to find the proper byte in the block

Block size  $= 2^{5} = 32 \text{ B}$ (byte address)

tag set offset

Unique bits that are different for each block, stored with block

Nr of sets = 
$$v = 2^7 = 128$$
 blocks = 4 KB

Total cache size = k\*v = 2\*4 KB = 8 KB(without tag bits!) Sta06 Fig 4.11

PaHe98 Fig 7.19



#### 2-way Set Associative Cache

- k=2 g Two blocks in each set (= in one cache entry)
- 4 sets g 2 bits for set index
- 2 words in a block = 8 Bytes g 3 bits for byte offset
- 3 bits for tag

| 3   | 2   | 3      |  |
|-----|-----|--------|--|
| tag | set | offset |  |

8 bit address (byte address)

|     | tag | block                   | tag | block                   |  |
|-----|-----|-------------------------|-----|-------------------------|--|
| 00: | 110 | 12 34 56 78 9A 01 23 45 | 011 | 77 55 55 66 66 22 44 22 |  |
| 01: | 110 | 87 00 32 89 65 A1 B2 00 | 101 | 65 43 21 98 76 65 43 32 |  |
| 10: | 100 | 87 54 00 89 65 A1 B2 00 | 101 | 00 11 22 33 44 55 66 77 |  |
| 11: | 101 | 54 A7 00 91 23 66 32 11 | 111 | 00 11 22 33 44 55 66 77 |  |
|     | 3   | 64                      | 3   | 64                      |  |



#### 2-way Set Assoc. Cache Example (5)





#### **Set Associative Mapping**

- Set associative cache with set size k=2
  - = 2-way cache (common)
- Degree of associativity = nbr of blocks in a set = v
  - Large degree of associativity?
    - More data items in one set
    - § Less "collisions" within set
    - § Final comparison (matching tags?) gate delay?
  - Maximum (nr of cache lines)
    - ð fully associative mapping
  - u Minimum (1)
    - ð direct mapping



## Cache Replacement Algorithm

- Which cache block to replace to make room for new block from memory?
- Direct mapping: trivial
- First-In-First-Out (FIFO)?
- Least-Frequently-Used (LFU)?
- n Random?
- Which one is best / possible?
  - u Chip area?
  - Fast? Easy to implement?



## Cache Write Policy – memory writes?

- Write through (läpikirjoittava)
  - Each write goes always to cache and memory
  - u Each write is a cache miss!
- Write back (lopuksi/takaisin kirjoittava)
  - u Each write goes only to cache
  - Write cache block back to memory only when it is replaced in cache
  - u Memory may have stale (old) data
  - o cache coherence problem (yhdenmukaisuus, yhtäpitävyys)
- Write once ("vain kerran kirjoittava?)
  - § Write-invalidate Snoopy-cache coherence protocol for multiprocessors
  - § Write invalidates data in other caches
  - § Write to memory at replacement time, or when some other cache needs it (has read/write miss)



#### Cache Line Size

- How big cache line?
- Optimise for temporal or spatial locality?
  - u bigger cache line ð better for spatial locality
  - u more cache lines ð better for temporal locality
- Best size varies with program or program phase?
- Best size different with code and data?
- n 2-8 words?
  - u word = 1 float??



#### Types and Number of Caches

- Same cache for data and code, or not?
  - Data references and code references behave differently
- unified vs. split cache (yhdistetty/erilliset)
  - u split cache: can optimise structure separately for data and code
- One cache too large for best results
- Multiple levels of caches
  - u L1 on same chip as CPU
  - u L2 on same package or chip as CPU
    - § older systems: same board
  - u L3 on same board as CPU



#### Example: Pentium 4 Block Diagram



L2, L3: unified, 8-way set-associative, line size 128 B

(Sta06 Fig 4.13)



#### Tietokoneen rakenne

# Main Memory



# Main Memory Types

| Memory Type                            | Category                             | Erasure                   | Write Mechanism | Volatility  |
|----------------------------------------|--------------------------------------|---------------------------|-----------------|-------------|
| Random-access<br>memory (RAM)          | Read-write memory                    | Electrically, byte-level  | Electrically    | Volatile    |
| Read-only<br>memory (ROM)              | Read-only memory  Read-mostly memory | Not possible              | Masks           |             |
| Programmable<br>ROM (PROM)             |                                      |                           | Electrically    | Nonvolatile |
| Erasable PROM<br>(EPROM)               |                                      | UV light, chip-level      |                 |             |
| Electrically Erasable<br>PROM (EEPROM) |                                      | Electrically, byte-level  |                 |             |
| Flash memory                           | 5                                    | Electrically, block-level |                 |             |

(Sta06 Table 5.1)

#### Random access semiconductor memory

- u Direct access to each memory cell
- u Access time same for all cells



#### Dynamic RAM, DRAM

- Periodic refreshing required
- u Refresh required after read
- □ Simpler, slower, denser, bigger (bytes per chip)
- u Access time ~ 60 ns
- Main memory? (early systems)

#### Static RAM, SRAM

- No periodic refreshing needed
- □ Data remains until power is lost
- u More complex (more chip area/byte), faster, smaller
- u Access time ~ 2-5 ns
- u Level 2 cache?



#### DRAM Access, 16 Mb DRAM (4M x 4)





## 256-KB DRAM Memory Organization





#### SDRAM (Synchronous DRAM)

- CPU clock synchronizes also the bus
  - u Runs on higher clock speeds than ordinary DRAM
  - u CPU knows how long it takes to make a reference, can do other work while waiting
- n 16 bits in parallel
  - Access 4 DRAMs (4 bits each) in parallel
  - u Access time ~ 18 ns, transfer rate ~ 1.3 GB/s
- DDR SDRAM, double data rate
  - u Current main memory technology
  - u Supports transfers both on rising and falling edge of the clock cycle
  - □ Consumes less power
  - u Access time ~ 12 ns, transfer rate ~ 3.2 GB/s



#### Rambus DRAM (RDRAM)



- Works with fast Rambus memory bus (800Mbps)
  - u Controller + RDRAM modules

STI Cell processor

- □ Access time ~ 12 ns, transfer rate ~ 4.8 GB/s
- Speed slows down with many memory modules
  - Serially connected on Rambus channel
  - □ Not good for servers with 1 GB memory (for now!)



#### Flash memory

- Based on transistors that are separated by a thin oxide layer
  - uses different charge levels to store 2 (or more) bits in each cell
- Non-volatile, data remains with power off
  - u Electrical erasing in blocks = "flash"
  - u Slow to write
  - u Access time ~ 50 ns
- Used as a solid state storage
  - u No moving parts
  - u FlashBIOS in PC's, USB-memory
  - u In phones, digital cameras, hand-held devices,....







- Magnetoresistive Random Access Memory (MRAM)
  - u Data stored with magnetic fields on two plates
  - u Magnetic field directions determine bit value
- Non-volatile, data remains with power off
  - u Fast to read/write
  - □ No upper limit for write counts (compare to Flash)
  - Access time comparable to DRAM
  - u Almost as fast as SRAM
- Future open
  - u Small market share now
  - u Expensive now (2006: \$25 4Mbit)
  - u Still under development
  - May replace flash in a few years
  - u May replace SRAM later on
  - May replace DRAM and become "universal memory"



http://www.research.ibm.com/journal/rd/501/maffitt.html



## Kertauskysymyksiä

- Muistihierarkia ja paikallisuus?
- Millä tavoin paikallisuutta huomiodaan välimuistiratkaisussa?
- Assosiatiivisen ja joukkoassosiatiivisen kuvauksen erot?
- Miksi käskyille oma välimuisti ja datalle oma?