Difference between revisions of "intel/microarchitectures/palm cove"

	Edit Values
	Palm Cove µarch
	General Info
Arch Type	CPU
Designer	Intel
Manufacturer	Intel
Introduction	2018
Process	10 nm
Core Configs	2
	Pipeline
Type	Superscalar
OoOE	Yes
Speculative	Yes
Reg Renaming	Yes
Stages	14-19
	Instructions
ISA	x86-64
Extensions	MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA3, F16C, BMI, BMI2, VT-x, VT-d, TXT, TSX, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVE, SGX, MPX, AVX-512
	Cache
L1I Cache	32 KiB/core; 8-way set associative
L1D Cache	32 KiB/core; 8-way set associative
L2 Cache	256 KiB/core; 4-way set associative
L3 Cache	2 MiB/core; 16-way set associative
	Succession
	Skylake Sunny Cove

Latest revision as of 13:56, 9 May 2019

Palm Cove is a high-performance 10 nm x86 core microarchitecture designed by Intel for an array of server and client products.

Process Technology[edit]

Palm Cove is designed to take advantage of Intel's 10 nm process.

Architecture[edit]

Key changes from Skylake (Client)[edit]

10 nm process (From 14 nm)
Front End
- LSD is re-enabled (See Skylake § Front-end for details)
- 50% smaller L1 instruction cache 4K page TLB (64-entry, down from 128)
Back-end
- Execution units
  - Port 4 now performs 512b stores (from 256b)
  - New 512b FMA unit on Port 0
  - New iDIV unit
Memory subsystem
- Store is now 64B/cycle (from 32B/cycle)
- Load is now 2x64B/cycle (from 2x32B/cycle)

This list is incomplete; you can help by expanding it.

New instructions[edit]

Cannon Lake introduced a number of new instructions:

AVX-512, specifically:
- AVX512F - AVX-512 Foundation
- AVX512CD - AVX-512 Conflict Detection
- AVX512BW - AVX-512 Byte and Word
- AVX512DQ - AVX-512 Doubleword and Quadword
- AVX512VL - AVX-512 Vector Length
- AVX512IFMA - AVX-512 Integer Fused Multiply-Add
- AVX512VBMI - AVX-512 Vector Bit Manipulation
SHA - Hardware acceleration for SHA hashing operations
UMIP - User-Mode Instruction Prevention extension

Memory Hierarchy[edit]

Other than a few organizational changes (e.g. L2$ went from 8-way to 4-way set associative), the overall memory structure is identical to Broadwell/Haswell.

Cache
- L0 µOP cache:
  - 1,536 µOPs, 8-way set associative
    - 32 sets, 6-µOP line size
    - statically divided between threads, per core, inclusive with L1I
- L1I Cache:
  - 32 KiB, 8-way set associative
    - 64 sets, 64 B line size
    - shared by the two threads, per core
- L1D Cache:
  - 32 KiB, 8-way set associative
  - 64 sets, 64 B line size
  - shared by the two threads, per core
  - 4 cycles for fastest load-to-use (simple pointer accesses)
    - 5 cycles for complex addresses
  - 128 B/cycle load bandwidth
  - 64 B/cycle store bandwidth
  - Write-back policy
- L2 Cache:
  - Unified, 256 KiB, 4-way set associative
  - 1024 sets, 64 B line size
  - Non-inclusive
  - 12 cycles for fastest load-to-use
  - 64 B/cycle bandwidth to L1$
  - Write-back policy
- L3 Cache/LLC:
  - Up to 2 MiB Per core, shared across all cores
  - Up to 16-way set associative
  - Inclusive
  - 64 B line size
  - Write-back policy
  - Per each core:
    - Read: 32 B/cycle (@ ring clock)
    - Write: 32 B/cycle (@ ring clock)
  - 42 cycles for fastest load-to-use
- System DRAM:
  - 2 Channels
  - 8 B/cycle/channel (@ memory clock)

Palm Cove TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally there is a unified L2 TLB (STLB).

TLBs:
- ITLB
  - 4 KiB page translations:
    - 64 entries; 8-way set associative
    - dynamic partitioning
  - 2 MiB / 4 MiB page translations:
    - 8 entries per thread; fully associative
    - Duplicated for each thread
- DTLB
  - 4 KiB page translations:
    - 64 entries; 4-way set associative
    - fixed partition
  - 2 MiB / 4 MiB page translations:
    - 32 entries; 4-way set associative
    - fixed partition
  - 1G page translations:
    - 4 entries; 4-way set associative
    - fixed partition
- STLB
  - 4 KiB + 2 MiB page translations:
    - 1536 entries; 12-way set associative
    - fixed partition
  - 1 GiB page translations:
    - 16 entries; 4-way set associative
    - fixed partition

Overview[edit]

Palm Cove is the core microarchitecture that is found in Intel's Cannon Lake SoCs. Although originally intended to be mass manufactured for all client and server markets, due to Intel's prolong 10 nm process problems, Palm Cove is getting skipped with the exception of a single chip.

@@ Line 7: / Line 7: @@
 |introduction=2018
 |process=10 nm
+|cores=2
+|type=Superscalar
+|oooe=Yes
+|speculative=Yes
+|renaming=Yes
+|stages min=14
+|stages max=19
 |isa=x86-64
+|extension=MOVBE
+|extension 2=MMX
+|extension 3=SSE
+|extension 4=SSE2
+|extension 5=SSE3
+|extension 6=SSSE3
+|extension 7=SSE4.1
+|extension 8=SSE4.2
+|extension 9=POPCNT
+|extension 10=AVX
+|extension 11=AVX2
+|extension 12=AES
+|extension 13=PCLMUL
+|extension 14=FSGSBASE
+|extension 15=RDRND
+|extension 16=FMA3
+|extension 17=F16C
+|extension 18=BMI
+|extension 19=BMI2
+|extension 20=VT-x
+|extension 21=VT-d
+|extension 22=TXT
+|extension 23=TSX
+|extension 24=RDSEED
+|extension 25=ADCX
+|extension 26=PREFETCHW
+|extension 27=CLFLUSHOPT
+|extension 28=XSAVE
+|extension 29=SGX
+|extension 30=MPX
+|extension 31=AVX-512
+|l1i=32 KiB
+|l1i per=core
+|l1i desc=8-way set associative
+|l1d=32 KiB
+|l1d per=core
+|l1d desc=8-way set associative
+|l2=256 KiB
+|l2 per=core
+|l2 desc=4-way set associative
+|l3=2 MiB
+|l3 per=core
+|l3 desc=16-way set associative
 |predecessor=Skylake
 |predecessor link=intel/microarchitectures/skylake
@@ Line 19: / Line 69: @@
 == Architecture ==
-=== Key changes from {{\\|Skylake (Server)}}===
+=== Key changes from {{\\|Skylake (Client)}}===
-{{future information}}
+* [[10 nm process]] (From [[14 nm]])
+* Front End
+** LSD is re-enabled (See {{\\|skylake_(server)#Front-end|Skylake § Front-end}} for details)
+** 50% smaller L1 instruction cache 4K page TLB (64-entry, down from 128)
+* Back-end
+** Execution units
+*** Port 4 now performs 512b stores (from 256b)
+*** New 512b FMA unit on Port 0
+*** New iDIV unit
+* Memory subsystem
+** Store is now 64B/cycle (from 32B/cycle)
+** Load is now 2x64B/cycle (from 2x32B/cycle)
+{{expand list}}
+==== New instructions ====
+Cannon Lake introduced a number of {{x86|extensions|new instructions}}:
+* {{x86|AVX-512|<code>AVX-512</code>}}, specifically:
+** {{x86|AVX512F|<code>AVX512F</code>}} - AVX-512 Foundation
+** {{x86|AVX512CD|<code>AVX512CD</code>}} - AVX-512 Conflict Detection
+** {{x86|AVX512BW|<code>AVX512BW</code>}} - AVX-512 Byte and Word
+** {{x86|AVX512DQ|<code>AVX512DQ</code>}} - AVX-512 Doubleword and Quadword
+** {{x86|AVX512VL|<code>AVX512VL</code>}} - AVX-512 Vector Length
+** {{x86|AVX512IFMA|<code>AVX512IFMA</code>}} - AVX-512 Integer Fused Multiply-Add
+** {{x86|AVX512VBMI|<code>AVX512VBMI</code>}} - AVX-512 Vector Bit Manipulation
+* {{x86|SHA|<code>SHA</code>}} - [[Hardware acceleration]] for SHA hashing operations
+* {{x86|UMIP|<code>UMIP</code>}} - User-Mode Instruction Prevention extension
+=== Memory Hierarchy ===
+Other than a few organizational changes (e.g. L2$ went from 8-way to 4-way set associative), the overall memory structure is identical to {{\\|Broadwell}}/{{\\|Haswell}}.
+* Cache
+** L0 µOP cache:
+*** 1,536 µOPs, 8-way set associative
+**** 32 sets, 6-µOP line size
+**** statically divided between threads, per core, inclusive with L1I
+** L1I Cache:
+*** 32 [[KiB]], 8-way set associative
+**** 64 sets, 64 B line size
+**** shared by the two threads, per core
+** L1D Cache:
+*** 32 KiB, 8-way set associative
+*** 64 sets, 64 B line size
+*** shared by the two threads, per core
+*** 4 cycles for fastest load-to-use (simple pointer accesses)
+**** 5 cycles for complex addresses
+*** 128 B/cycle load bandwidth
+*** 64 B/cycle store bandwidth
+*** Write-back policy
+** L2 Cache:
+*** Unified, 256 KiB, 4-way set associative
+*** 1024 sets, 64 B line size
+*** Non-inclusive
+*** 12 cycles for fastest load-to-use
+*** 64 B/cycle bandwidth to L1$
+*** Write-back policy
+** L3 Cache/LLC:
+*** Up to 2 MiB Per core, shared across all cores
+*** Up to 16-way set associative
+*** Inclusive
+*** 64 B line size
+*** Write-back policy
+*** Per each core:
+**** Read: 32 B/cycle (@ ring [[clock]])
+**** Write: 32 B/cycle (@ ring clock)
+*** 42 cycles for fastest load-to-use
+** System [[DRAM]]:
+*** 2 Channels
+*** 8 B/cycle/channel (@ memory clock)
+Palm Cove TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally there is a unified L2 TLB (STLB).
+* TLBs:
+** ITLB
+*** 4 KiB page translations:
+**** 64 entries; 8-way set associative
+**** dynamic partitioning
+*** 2 MiB / 4 MiB page translations:
+**** 8 entries per thread; fully associative
+**** Duplicated for each thread
+** DTLB
+*** 4 KiB page translations:
+**** 64 entries; 4-way set associative
+**** fixed partition
+*** 2 MiB / 4 MiB page translations:
+**** 32 entries; 4-way set associative
+**** fixed partition
+*** 1G page translations:
+**** 4 entries; 4-way set associative
+**** fixed partition
+** STLB
+*** 4 KiB + 2 MiB page translations:
+**** 1536 entries; 12-way set associative
+**** fixed partition
+*** 1 GiB page translations:
+**** 16 entries; 4-way set associative
+**** fixed partition
+== Overview ==
+Palm Cove is the core microarchitecture that is found in Intel's {{\\|Cannon Lake}} SoCs. Although originally intended to be mass manufactured for all client and server markets, due to Intel's prolong [[10 nm process]] problems, Palm Cove is getting skipped with the exception of a single chip.
+== See also ==
+* {{intel|Cannon Lake|l=arch}}

codename	Palm Cove +
core count	2 +
designer	Intel +
first launched	2018 +
full page name	intel/microarchitectures/palm cove +
instance of	microarchitecture +
instruction set architecture	x86-64 +
manufacturer	Intel +
microarchitecture type	CPU +
name	Palm Cove +
pipeline stages (max)	19 +
pipeline stages (min)	14 +
process	10 nm (0.01 μm, 1.0e-5 mm) +

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung

Intel

AMD

Ampere

Apple

Cavium

HiSilicon

MediaTek

NXP

Qualcomm

Renesas