(Created page with "'''Saltwell''' was a microarchitecture for Intel's 32 nm ultra-low power system on chips first introduced in late 2011 for the {{intel|Atom}} family. Saltwell...") |
|||
| Line 1: | Line 1: | ||
| + | {{intel title|Saltwell}} | ||
'''Saltwell''' was a [[microarchitecture]] for [[Intel]]'s [[32 nm]] ultra-low power [[system on chip]]s first introduced in late 2011 for the {{intel|Atom}} family. Saltwell is a shrink of {{intel|Bonnell}} which also incorporated all older support chips on-die. Saltwell, unlike its predecessor was aimed directly at smartphones (as opposed to MIDs). | '''Saltwell''' was a [[microarchitecture]] for [[Intel]]'s [[32 nm]] ultra-low power [[system on chip]]s first introduced in late 2011 for the {{intel|Atom}} family. Saltwell is a shrink of {{intel|Bonnell}} which also incorporated all older support chips on-die. Saltwell, unlike its predecessor was aimed directly at smartphones (as opposed to MIDs). | ||
| Line 19: | Line 20: | ||
|- | |- | ||
|} | |} | ||
| + | |||
| + | == Architecture == | ||
| + | Saltwell's primary goals were: | ||
| + | # Improve on Bonnell by getting rid of older support chips | ||
| + | # Add enhancements using [[32 nm]] process while transitioning to [[22 nm]] | ||
| + | ## Improve GPU, power | ||
| + | ## Burst frequencies | ||
| + | |||
| + | === Memory Hierarchy === | ||
| + | * Cache | ||
| + | ** Hardware prefetchers | ||
| + | ** L1 Cache: | ||
| + | *** 32 KB 8-way [[set associative]] instruction | ||
| + | **** 1 read and 1 write port | ||
| + | *** 24 KB 6-way set associative data | ||
| + | **** 1 read and 1 write port | ||
| + | *** 8 transistors (instead of 6) to reduce voltage | ||
| + | *** Per core | ||
| + | ** L2 Cache: | ||
| + | *** 512 KB 8-way set associative | ||
| + | *** ECC | ||
| + | *** Shrinkable from 512 KB to 128 KB (2-way) | ||
| + | *** 32B/cycle and 32 outstanding cache requests | ||
| + | *** separate voltage rail, fixed @ 1.05V | ||
| + | *** Per core | ||
| + | ** L3 Cache: | ||
| + | *** No level 3 cache | ||
| + | ** Non-Cache Shared State Memory | ||
| + | *** 256KB low-power SRAM | ||
| + | *** separate voltage plane | ||
| + | *** always-on block that stores architectural states while in various power saving modes | ||
| + | ** RAM | ||
| + | *** Maximum of 1GB, 2 GB, and 4 GB | ||
| + | *** dual 32-bit channels, 1 or 2 ranks per channel | ||
| + | |||
| + | Note that the L1 cache for data and instructions were originally both 32 KB (8-way), however due to power restrictions, the L1d$ was later reduced to 24 KB. | ||
| + | |||
| + | === Functional Units === | ||
| + | The number of functional units were kept to minimum to cut on power consumption. | ||
| + | * 2 Integer [[ALU]]s (1 for jumps, 1 for shifts) | ||
| + | * 2 FP ALUs (1 adder, 1 for others) | ||
| + | * No Integer multiplier & divider | ||
| + | === Pipeline === | ||
| + | Saltwell has an almost identical pipeline to {{intel|Bonnell|Bonnell's}} with a 16-stage pipeline with a 13-stage miss penalty. It's also still a dual-issue [[superscalar]] but with in-order execution. Reordering logic is was still omitted due to power and area restrictions. | ||
| + | |||
| + | :[[File:bonnell pipeline.svg]] | ||
| + | |||
| + | The longer pipeline allows a more evenly spreading of heat across the chip with more units. This also allows a higher clock rate. | ||
| + | |||
| + | * '''Instruction Fetch''' | ||
| + | ** 3 stages | ||
| + | ** 48 Bytes/Cycle (lower if SMT) | ||
| + | * '''Instruction Decode''' | ||
| + | ** 3 stages | ||
| + | ** Instructions with up to 3 prefixes/Cycle | ||
| + | * '''Instruction Dispatch''' | ||
| + | ** 2 stages | ||
| + | * '''Source Operand Read''' | ||
| + | ** 1 stage | ||
| + | *** reading [[register]] operand | ||
| + | * '''Data Cache Access''' | ||
| + | ** 3 stages | ||
| + | *** 1 stage for calculating | ||
| + | *** 2 stages for reading cache | ||
| + | * '''Execution''' | ||
| + | ** 2 clusters | ||
| + | *** integers | ||
| + | **** quick cache access due to direct connection | ||
| + | *** floating point & SIMD | ||
| + | * '''Exception & MT Handling''' | ||
| + | ** 2 stages | ||
| + | * '''Commit''' | ||
| + | ** 1 stage | ||
| + | |||
| + | === Multithreading === | ||
| + | Saltwell has support for multithreading - up to two threads per core. However each thread compete for the same resources which does inherently means they run slower than they would if they were to run alone. | ||
| + | |||
| + | === Branch Prediction === | ||
| + | * [[Two-level adaptive predictor]] | ||
| + | * 12-bit branch history register | ||
| + | * Pattern history table has 8192 entries (shared between threads), twice that of {{intel|Bonnell}} | ||
| + | * Branch buffer target has 128 entries (4-way, 32 sets) | ||
| + | * Unconditional jumps are ignored | ||
| + | * Always-taken and never-taken are marked in the table | ||
| + | * Penalties: | ||
| + | ** 13 stages for miss prediction | ||
| + | ** 7 stages for correct prediction but missing [[branch target buffer]] (BTB) | ||
| + | |||
| + | == Cores == | ||
| + | * '''{{intel|Penwell}}''' - SoCs specifically for smartphones | ||
| + | * '''{{intel|Cedarview}}''' - SoCs for netbooks | ||
| + | * '''{{intel|Cloverview}}''' - SoCs for tablets | ||
| + | * '''{{intel|Centerton}}''' - SoCs for Microservers; added support for Intel VT and ECC memory | ||
| + | * '''{{intel|Briarwood}}''' - SoCs for Microservers | ||
| + | * '''{{intel|Berryville}}''' - SoCs for consumer electronics (e.g. set-tops) | ||
Revision as of 16:30, 8 April 2016
Saltwell was a microarchitecture for Intel's 32 nm ultra-low power system on chips first introduced in late 2011 for the Atom family. Saltwell is a shrink of Bonnell which also incorporated all older support chips on-die. Saltwell, unlike its predecessor was aimed directly at smartphones (as opposed to MIDs).
Contents
Codenames
| Platform | Core | Target |
|---|---|---|
| Medfield | Penwell | Smartphones |
| Cedar Trail | Cedarview | Netbooks |
| Clover Trail+ | Cloverview | Tablets |
| Bordenville | Centerton | Microservers |
| Bordenville | Briarwood | Microservers |
| Berryville | CE (set-tops) |
Architecture
Saltwell's primary goals were:
- Improve on Bonnell by getting rid of older support chips
- Add enhancements using 32 nm process while transitioning to 22 nm
- Improve GPU, power
- Burst frequencies
Memory Hierarchy
- Cache
- Hardware prefetchers
- L1 Cache:
- 32 KB 8-way set associative instruction
- 1 read and 1 write port
- 24 KB 6-way set associative data
- 1 read and 1 write port
- 8 transistors (instead of 6) to reduce voltage
- Per core
- 32 KB 8-way set associative instruction
- L2 Cache:
- 512 KB 8-way set associative
- ECC
- Shrinkable from 512 KB to 128 KB (2-way)
- 32B/cycle and 32 outstanding cache requests
- separate voltage rail, fixed @ 1.05V
- Per core
- L3 Cache:
- No level 3 cache
- Non-Cache Shared State Memory
- 256KB low-power SRAM
- separate voltage plane
- always-on block that stores architectural states while in various power saving modes
- RAM
- Maximum of 1GB, 2 GB, and 4 GB
- dual 32-bit channels, 1 or 2 ranks per channel
Note that the L1 cache for data and instructions were originally both 32 KB (8-way), however due to power restrictions, the L1d$ was later reduced to 24 KB.
Functional Units
The number of functional units were kept to minimum to cut on power consumption.
- 2 Integer ALUs (1 for jumps, 1 for shifts)
- 2 FP ALUs (1 adder, 1 for others)
- No Integer multiplier & divider
Pipeline
Saltwell has an almost identical pipeline to Bonnell's with a 16-stage pipeline with a 13-stage miss penalty. It's also still a dual-issue superscalar but with in-order execution. Reordering logic is was still omitted due to power and area restrictions.
The longer pipeline allows a more evenly spreading of heat across the chip with more units. This also allows a higher clock rate.
- Instruction Fetch
- 3 stages
- 48 Bytes/Cycle (lower if SMT)
- Instruction Decode
- 3 stages
- Instructions with up to 3 prefixes/Cycle
- Instruction Dispatch
- 2 stages
- Source Operand Read
- 1 stage
- reading register operand
- 1 stage
- Data Cache Access
- 3 stages
- 1 stage for calculating
- 2 stages for reading cache
- 3 stages
- Execution
- 2 clusters
- integers
- quick cache access due to direct connection
- floating point & SIMD
- integers
- 2 clusters
- Exception & MT Handling
- 2 stages
- Commit
- 1 stage
Multithreading
Saltwell has support for multithreading - up to two threads per core. However each thread compete for the same resources which does inherently means they run slower than they would if they were to run alone.
Branch Prediction
- Two-level adaptive predictor
- 12-bit branch history register
- Pattern history table has 8192 entries (shared between threads), twice that of Bonnell
- Branch buffer target has 128 entries (4-way, 32 sets)
- Unconditional jumps are ignored
- Always-taken and never-taken are marked in the table
- Penalties:
- 13 stages for miss prediction
- 7 stages for correct prediction but missing branch target buffer (BTB)
Cores
- Penwell - SoCs specifically for smartphones
- Cedarview - SoCs for netbooks
- Cloverview - SoCs for tablets
- Centerton - SoCs for Microservers; added support for Intel VT and ECC memory
- Briarwood - SoCs for Microservers
- Berryville - SoCs for consumer electronics (e.g. set-tops)