(palm cove) |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 7: | Line 7: | ||
|introduction=2018 | |introduction=2018 | ||
|process=10 nm | |process=10 nm | ||
+ | |cores=2 | ||
+ | |type=Superscalar | ||
+ | |oooe=Yes | ||
+ | |speculative=Yes | ||
+ | |renaming=Yes | ||
+ | |stages min=14 | ||
+ | |stages max=19 | ||
|isa=x86-64 | |isa=x86-64 | ||
+ | |extension=MOVBE | ||
+ | |extension 2=MMX | ||
+ | |extension 3=SSE | ||
+ | |extension 4=SSE2 | ||
+ | |extension 5=SSE3 | ||
+ | |extension 6=SSSE3 | ||
+ | |extension 7=SSE4.1 | ||
+ | |extension 8=SSE4.2 | ||
+ | |extension 9=POPCNT | ||
+ | |extension 10=AVX | ||
+ | |extension 11=AVX2 | ||
+ | |extension 12=AES | ||
+ | |extension 13=PCLMUL | ||
+ | |extension 14=FSGSBASE | ||
+ | |extension 15=RDRND | ||
+ | |extension 16=FMA3 | ||
+ | |extension 17=F16C | ||
+ | |extension 18=BMI | ||
+ | |extension 19=BMI2 | ||
+ | |extension 20=VT-x | ||
+ | |extension 21=VT-d | ||
+ | |extension 22=TXT | ||
+ | |extension 23=TSX | ||
+ | |extension 24=RDSEED | ||
+ | |extension 25=ADCX | ||
+ | |extension 26=PREFETCHW | ||
+ | |extension 27=CLFLUSHOPT | ||
+ | |extension 28=XSAVE | ||
+ | |extension 29=SGX | ||
+ | |extension 30=MPX | ||
+ | |extension 31=AVX-512 | ||
+ | |l1i=32 KiB | ||
+ | |l1i per=core | ||
+ | |l1i desc=8-way set associative | ||
+ | |l1d=32 KiB | ||
+ | |l1d per=core | ||
+ | |l1d desc=8-way set associative | ||
+ | |l2=256 KiB | ||
+ | |l2 per=core | ||
+ | |l2 desc=4-way set associative | ||
+ | |l3=2 MiB | ||
+ | |l3 per=core | ||
+ | |l3 desc=16-way set associative | ||
|predecessor=Skylake | |predecessor=Skylake | ||
|predecessor link=intel/microarchitectures/skylake | |predecessor link=intel/microarchitectures/skylake | ||
Line 19: | Line 69: | ||
== Architecture == | == Architecture == | ||
− | === Key changes from {{\\|Skylake ( | + | === Key changes from {{\\|Skylake (Client)}}=== |
− | {{ | + | * [[10 nm process]] (From [[14 nm]]) |
+ | * Front End | ||
+ | ** LSD is re-enabled (See {{\\|skylake_(server)#Front-end|Skylake § Front-end}} for details) | ||
+ | ** 50% smaller L1 instruction cache 4K page TLB (64-entry, down from 128) | ||
+ | * Back-end | ||
+ | ** Execution units | ||
+ | *** Port 4 now performs 512b stores (from 256b) | ||
+ | *** New 512b FMA unit on Port 0 | ||
+ | *** New iDIV unit | ||
+ | * Memory subsystem | ||
+ | ** Store is now 64B/cycle (from 32B/cycle) | ||
+ | ** Load is now 2x64B/cycle (from 2x32B/cycle) | ||
+ | |||
+ | {{expand list}} | ||
+ | |||
+ | ==== New instructions ==== | ||
+ | Cannon Lake introduced a number of {{x86|extensions|new instructions}}: | ||
+ | |||
+ | * {{x86|AVX-512|<code>AVX-512</code>}}, specifically: | ||
+ | ** {{x86|AVX512F|<code>AVX512F</code>}} - AVX-512 Foundation | ||
+ | ** {{x86|AVX512CD|<code>AVX512CD</code>}} - AVX-512 Conflict Detection | ||
+ | ** {{x86|AVX512BW|<code>AVX512BW</code>}} - AVX-512 Byte and Word | ||
+ | ** {{x86|AVX512DQ|<code>AVX512DQ</code>}} - AVX-512 Doubleword and Quadword | ||
+ | ** {{x86|AVX512VL|<code>AVX512VL</code>}} - AVX-512 Vector Length | ||
+ | ** {{x86|AVX512IFMA|<code>AVX512IFMA</code>}} - AVX-512 Integer Fused Multiply-Add | ||
+ | ** {{x86|AVX512VBMI|<code>AVX512VBMI</code>}} - AVX-512 Vector Bit Manipulation | ||
+ | * {{x86|SHA|<code>SHA</code>}} - [[Hardware acceleration]] for SHA hashing operations | ||
+ | * {{x86|UMIP|<code>UMIP</code>}} - User-Mode Instruction Prevention extension | ||
+ | |||
+ | === Memory Hierarchy === | ||
+ | Other than a few organizational changes (e.g. L2$ went from 8-way to 4-way set associative), the overall memory structure is identical to {{\\|Broadwell}}/{{\\|Haswell}}. | ||
+ | |||
+ | * Cache | ||
+ | ** L0 µOP cache: | ||
+ | *** 1,536 µOPs, 8-way set associative | ||
+ | **** 32 sets, 6-µOP line size | ||
+ | **** statically divided between threads, per core, inclusive with L1I | ||
+ | ** L1I Cache: | ||
+ | *** 32 [[KiB]], 8-way set associative | ||
+ | **** 64 sets, 64 B line size | ||
+ | **** shared by the two threads, per core | ||
+ | ** L1D Cache: | ||
+ | *** 32 KiB, 8-way set associative | ||
+ | *** 64 sets, 64 B line size | ||
+ | *** shared by the two threads, per core | ||
+ | *** 4 cycles for fastest load-to-use (simple pointer accesses) | ||
+ | **** 5 cycles for complex addresses | ||
+ | *** 128 B/cycle load bandwidth | ||
+ | *** 64 B/cycle store bandwidth | ||
+ | *** Write-back policy | ||
+ | ** L2 Cache: | ||
+ | *** Unified, 256 KiB, 4-way set associative | ||
+ | *** 1024 sets, 64 B line size | ||
+ | *** Non-inclusive | ||
+ | *** 12 cycles for fastest load-to-use | ||
+ | *** 64 B/cycle bandwidth to L1$ | ||
+ | *** Write-back policy | ||
+ | ** L3 Cache/LLC: | ||
+ | *** Up to 2 MiB Per core, shared across all cores | ||
+ | *** Up to 16-way set associative | ||
+ | *** Inclusive | ||
+ | *** 64 B line size | ||
+ | *** Write-back policy | ||
+ | *** Per each core: | ||
+ | **** Read: 32 B/cycle (@ ring [[clock]]) | ||
+ | **** Write: 32 B/cycle (@ ring clock) | ||
+ | *** 42 cycles for fastest load-to-use | ||
+ | ** System [[DRAM]]: | ||
+ | *** 2 Channels | ||
+ | *** 8 B/cycle/channel (@ memory clock) | ||
+ | |||
+ | Palm Cove TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally there is a unified L2 TLB (STLB). | ||
+ | * TLBs: | ||
+ | ** ITLB | ||
+ | *** 4 KiB page translations: | ||
+ | **** 64 entries; 8-way set associative | ||
+ | **** dynamic partitioning | ||
+ | *** 2 MiB / 4 MiB page translations: | ||
+ | **** 8 entries per thread; fully associative | ||
+ | **** Duplicated for each thread | ||
+ | ** DTLB | ||
+ | *** 4 KiB page translations: | ||
+ | **** 64 entries; 4-way set associative | ||
+ | **** fixed partition | ||
+ | *** 2 MiB / 4 MiB page translations: | ||
+ | **** 32 entries; 4-way set associative | ||
+ | **** fixed partition | ||
+ | *** 1G page translations: | ||
+ | **** 4 entries; 4-way set associative | ||
+ | **** fixed partition | ||
+ | ** STLB | ||
+ | *** 4 KiB + 2 MiB page translations: | ||
+ | **** 1536 entries; 12-way set associative | ||
+ | **** fixed partition | ||
+ | *** 1 GiB page translations: | ||
+ | **** 16 entries; 4-way set associative | ||
+ | **** fixed partition | ||
+ | |||
+ | == Overview == | ||
+ | Palm Cove is the core microarchitecture that is found in Intel's {{\\|Cannon Lake}} SoCs. Although originally intended to be mass manufactured for all client and server markets, due to Intel's prolong [[10 nm process]] problems, Palm Cove is getting skipped with the exception of a single chip. | ||
+ | |||
+ | == See also == | ||
+ | * {{intel|Cannon Lake|l=arch}} |
Latest revision as of 12:56, 9 May 2019
Edit Values | |
Palm Cove µarch | |
General Info | |
Arch Type | CPU |
Designer | Intel |
Manufacturer | Intel |
Introduction | 2018 |
Process | 10 nm |
Core Configs | 2 |
Pipeline | |
Type | Superscalar |
OoOE | Yes |
Speculative | Yes |
Reg Renaming | Yes |
Stages | 14-19 |
Instructions | |
ISA | x86-64 |
Extensions | MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA3, F16C, BMI, BMI2, VT-x, VT-d, TXT, TSX, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVE, SGX, MPX, AVX-512 |
Cache | |
L1I Cache | 32 KiB/core 8-way set associative |
L1D Cache | 32 KiB/core 8-way set associative |
L2 Cache | 256 KiB/core 4-way set associative |
L3 Cache | 2 MiB/core 16-way set associative |
Succession | |
Palm Cove is a high-performance 10 nm x86 core microarchitecture designed by Intel for an array of server and client products.
Contents
Process Technology[edit]
Palm Cove is designed to take advantage of Intel's 10 nm process.
Architecture[edit]
Key changes from Skylake (Client)[edit]
- 10 nm process (From 14 nm)
- Front End
- LSD is re-enabled (See Skylake § Front-end for details)
- 50% smaller L1 instruction cache 4K page TLB (64-entry, down from 128)
- Back-end
- Execution units
- Port 4 now performs 512b stores (from 256b)
- New 512b FMA unit on Port 0
- New iDIV unit
- Execution units
- Memory subsystem
- Store is now 64B/cycle (from 32B/cycle)
- Load is now 2x64B/cycle (from 2x32B/cycle)
This list is incomplete; you can help by expanding it.
New instructions[edit]
Cannon Lake introduced a number of new instructions:
-
AVX-512
, specifically:-
AVX512F
- AVX-512 Foundation -
AVX512CD
- AVX-512 Conflict Detection -
AVX512BW
- AVX-512 Byte and Word -
AVX512DQ
- AVX-512 Doubleword and Quadword -
AVX512VL
- AVX-512 Vector Length -
AVX512IFMA
- AVX-512 Integer Fused Multiply-Add -
AVX512VBMI
- AVX-512 Vector Bit Manipulation
-
-
SHA
- Hardware acceleration for SHA hashing operations -
UMIP
- User-Mode Instruction Prevention extension
Memory Hierarchy[edit]
Other than a few organizational changes (e.g. L2$ went from 8-way to 4-way set associative), the overall memory structure is identical to Broadwell/Haswell.
- Cache
- L0 µOP cache:
- 1,536 µOPs, 8-way set associative
- 32 sets, 6-µOP line size
- statically divided between threads, per core, inclusive with L1I
- 1,536 µOPs, 8-way set associative
- L1I Cache:
- 32 KiB, 8-way set associative
- 64 sets, 64 B line size
- shared by the two threads, per core
- 32 KiB, 8-way set associative
- L1D Cache:
- 32 KiB, 8-way set associative
- 64 sets, 64 B line size
- shared by the two threads, per core
- 4 cycles for fastest load-to-use (simple pointer accesses)
- 5 cycles for complex addresses
- 128 B/cycle load bandwidth
- 64 B/cycle store bandwidth
- Write-back policy
- L2 Cache:
- Unified, 256 KiB, 4-way set associative
- 1024 sets, 64 B line size
- Non-inclusive
- 12 cycles for fastest load-to-use
- 64 B/cycle bandwidth to L1$
- Write-back policy
- L3 Cache/LLC:
- Up to 2 MiB Per core, shared across all cores
- Up to 16-way set associative
- Inclusive
- 64 B line size
- Write-back policy
- Per each core:
- Read: 32 B/cycle (@ ring clock)
- Write: 32 B/cycle (@ ring clock)
- 42 cycles for fastest load-to-use
- System DRAM:
- 2 Channels
- 8 B/cycle/channel (@ memory clock)
- L0 µOP cache:
Palm Cove TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally there is a unified L2 TLB (STLB).
- TLBs:
- ITLB
- 4 KiB page translations:
- 64 entries; 8-way set associative
- dynamic partitioning
- 2 MiB / 4 MiB page translations:
- 8 entries per thread; fully associative
- Duplicated for each thread
- 4 KiB page translations:
- DTLB
- 4 KiB page translations:
- 64 entries; 4-way set associative
- fixed partition
- 2 MiB / 4 MiB page translations:
- 32 entries; 4-way set associative
- fixed partition
- 1G page translations:
- 4 entries; 4-way set associative
- fixed partition
- 4 KiB page translations:
- STLB
- 4 KiB + 2 MiB page translations:
- 1536 entries; 12-way set associative
- fixed partition
- 1 GiB page translations:
- 16 entries; 4-way set associative
- fixed partition
- 4 KiB + 2 MiB page translations:
- ITLB
Overview[edit]
Palm Cove is the core microarchitecture that is found in Intel's Cannon Lake SoCs. Although originally intended to be mass manufactured for all client and server markets, due to Intel's prolong 10 nm process problems, Palm Cove is getting skipped with the exception of a single chip.
See also[edit]
codename | Palm Cove + |
core count | 2 + |
designer | Intel + |
first launched | 2018 + |
full page name | intel/microarchitectures/palm cove + |
instance of | microarchitecture + |
instruction set architecture | x86-64 + |
manufacturer | Intel + |
microarchitecture type | CPU + |
name | Palm Cove + |
pipeline stages (max) | 19 + |
pipeline stages (min) | 14 + |
process | 10 nm (0.01 μm, 1.0e-5 mm) + |