From WikiChip
Difference between revisions of "intel/microarchitectures/palm cove"
< intel‎ | microarchitectures

(Overview)
 
(2 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
|introduction=2018
 
|introduction=2018
 
|process=10 nm
 
|process=10 nm
 +
|cores=2
 +
|type=Superscalar
 +
|oooe=Yes
 +
|speculative=Yes
 +
|renaming=Yes
 +
|stages min=14
 +
|stages max=19
 
|isa=x86-64
 
|isa=x86-64
 +
|extension=MOVBE
 +
|extension 2=MMX
 +
|extension 3=SSE
 +
|extension 4=SSE2
 +
|extension 5=SSE3
 +
|extension 6=SSSE3
 +
|extension 7=SSE4.1
 +
|extension 8=SSE4.2
 +
|extension 9=POPCNT
 +
|extension 10=AVX
 +
|extension 11=AVX2
 +
|extension 12=AES
 +
|extension 13=PCLMUL
 +
|extension 14=FSGSBASE
 +
|extension 15=RDRND
 +
|extension 16=FMA3
 +
|extension 17=F16C
 +
|extension 18=BMI
 +
|extension 19=BMI2
 +
|extension 20=VT-x
 +
|extension 21=VT-d
 +
|extension 22=TXT
 +
|extension 23=TSX
 +
|extension 24=RDSEED
 +
|extension 25=ADCX
 +
|extension 26=PREFETCHW
 +
|extension 27=CLFLUSHOPT
 +
|extension 28=XSAVE
 +
|extension 29=SGX
 +
|extension 30=MPX
 +
|extension 31=AVX-512
 +
|l1i=32 KiB
 +
|l1i per=core
 +
|l1i desc=8-way set associative
 +
|l1d=32 KiB
 +
|l1d per=core
 +
|l1d desc=8-way set associative
 +
|l2=256 KiB
 +
|l2 per=core
 +
|l2 desc=4-way set associative
 +
|l3=2 MiB
 +
|l3 per=core
 +
|l3 desc=16-way set associative
 
|predecessor=Skylake
 
|predecessor=Skylake
 
|predecessor link=intel/microarchitectures/skylake
 
|predecessor link=intel/microarchitectures/skylake
Line 19: Line 69:
  
 
== Architecture ==
 
== Architecture ==
=== Key changes from {{\\|Skylake (Server)}}===
+
=== Key changes from {{\\|Skylake (Client)}}===
 
* [[10 nm process]] (From [[14 nm]])
 
* [[10 nm process]] (From [[14 nm]])
 +
* Front End
 +
** LSD is re-enabled (See {{\\|skylake_(server)#Front-end|Skylake § Front-end}} for details)
 +
** 50% smaller L1 instruction cache 4K page TLB (64-entry, down from 128)
 +
* Back-end
 +
** Execution units
 +
*** Port 4 now performs 512b stores (from 256b)
 +
*** New 512b FMA unit on Port 0
 +
*** New iDIV unit
 +
* Memory subsystem
 +
** Store is now 64B/cycle (from 32B/cycle)
 +
** Load is now 2x64B/cycle (from 2x32B/cycle)
 +
 
{{expand list}}
 
{{expand list}}
  
Line 36: Line 98:
 
* {{x86|SHA|<code>SHA</code>}} - [[Hardware acceleration]] for SHA hashing operations
 
* {{x86|SHA|<code>SHA</code>}} - [[Hardware acceleration]] for SHA hashing operations
 
* {{x86|UMIP|<code>UMIP</code>}} - User-Mode Instruction Prevention extension
 
* {{x86|UMIP|<code>UMIP</code>}} - User-Mode Instruction Prevention extension
 +
 +
=== Memory Hierarchy ===
 +
Other than a few organizational changes (e.g. L2$ went from 8-way to 4-way set associative), the overall memory structure is identical to {{\\|Broadwell}}/{{\\|Haswell}}.
 +
 +
* Cache
 +
** L0 µOP cache:
 +
*** 1,536 µOPs, 8-way set associative
 +
**** 32 sets, 6-µOP line size
 +
**** statically divided between threads, per core, inclusive with L1I
 +
** L1I Cache:
 +
*** 32 [[KiB]], 8-way set associative
 +
**** 64 sets, 64 B line size
 +
**** shared by the two threads, per core
 +
** L1D Cache:
 +
*** 32 KiB, 8-way set associative
 +
*** 64 sets, 64 B line size
 +
*** shared by the two threads, per core
 +
*** 4 cycles for fastest load-to-use (simple pointer accesses)
 +
**** 5 cycles for complex addresses
 +
*** 128 B/cycle load bandwidth
 +
*** 64 B/cycle store bandwidth
 +
*** Write-back policy
 +
** L2 Cache:
 +
*** Unified, 256 KiB, 4-way set associative
 +
*** 1024 sets, 64 B line size
 +
*** Non-inclusive
 +
*** 12 cycles for fastest load-to-use
 +
*** 64 B/cycle bandwidth to L1$
 +
*** Write-back policy
 +
** L3 Cache/LLC:
 +
*** Up to 2 MiB Per core, shared across all cores
 +
*** Up to 16-way set associative
 +
*** Inclusive
 +
*** 64 B line size
 +
*** Write-back policy
 +
*** Per each core:
 +
**** Read: 32 B/cycle (@ ring [[clock]])
 +
**** Write: 32 B/cycle (@ ring clock)
 +
*** 42 cycles for fastest load-to-use
 +
** System [[DRAM]]:
 +
*** 2 Channels
 +
*** 8 B/cycle/channel (@ memory clock)
 +
 +
Palm Cove TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally there is a unified L2 TLB (STLB).
 +
* TLBs:
 +
** ITLB
 +
*** 4 KiB page translations:
 +
**** 64 entries; 8-way set associative
 +
**** dynamic partitioning
 +
*** 2 MiB / 4 MiB page translations:
 +
**** 8 entries per thread; fully associative
 +
**** Duplicated for each thread
 +
** DTLB
 +
*** 4 KiB page translations:
 +
**** 64 entries; 4-way set associative
 +
**** fixed partition
 +
*** 2 MiB / 4 MiB page translations:
 +
**** 32 entries; 4-way set associative
 +
**** fixed partition
 +
*** 1G page translations:
 +
**** 4 entries; 4-way set associative
 +
**** fixed partition
 +
** STLB
 +
*** 4 KiB + 2 MiB page translations:
 +
**** 1536 entries; 12-way set associative
 +
**** fixed partition
 +
*** 1 GiB page translations:
 +
**** 16 entries; 4-way set associative
 +
**** fixed partition
  
 
== Overview ==
 
== Overview ==

Latest revision as of 13:56, 9 May 2019

Edit Values
Palm Cove µarch
General Info
Arch TypeCPU
DesignerIntel
ManufacturerIntel
Introduction2018
Process10 nm
Core Configs2
Pipeline
TypeSuperscalar
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages14-19
Instructions
ISAx86-64
ExtensionsMOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA3, F16C, BMI, BMI2, VT-x, VT-d, TXT, TSX, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVE, SGX, MPX, AVX-512
Cache
L1I Cache32 KiB/core
8-way set associative
L1D Cache32 KiB/core
8-way set associative
L2 Cache256 KiB/core
4-way set associative
L3 Cache2 MiB/core
16-way set associative
Succession

Palm Cove is a high-performance 10 nm x86 core microarchitecture designed by Intel for an array of server and client products.

Process Technology[edit]

Palm Cove is designed to take advantage of Intel's 10 nm process.

Architecture[edit]

Key changes from Skylake (Client)[edit]

  • 10 nm process (From 14 nm)
  • Front End
    • LSD is re-enabled (See Skylake § Front-end for details)
    • 50% smaller L1 instruction cache 4K page TLB (64-entry, down from 128)
  • Back-end
    • Execution units
      • Port 4 now performs 512b stores (from 256b)
      • New 512b FMA unit on Port 0
      • New iDIV unit
  • Memory subsystem
    • Store is now 64B/cycle (from 32B/cycle)
    • Load is now 2x64B/cycle (from 2x32B/cycle)

This list is incomplete; you can help by expanding it.

New instructions[edit]

Cannon Lake introduced a number of new instructions:

Memory Hierarchy[edit]

Other than a few organizational changes (e.g. L2$ went from 8-way to 4-way set associative), the overall memory structure is identical to Broadwell/Haswell.

  • Cache
    • L0 µOP cache:
      • 1,536 µOPs, 8-way set associative
        • 32 sets, 6-µOP line size
        • statically divided between threads, per core, inclusive with L1I
    • L1I Cache:
      • 32 KiB, 8-way set associative
        • 64 sets, 64 B line size
        • shared by the two threads, per core
    • L1D Cache:
      • 32 KiB, 8-way set associative
      • 64 sets, 64 B line size
      • shared by the two threads, per core
      • 4 cycles for fastest load-to-use (simple pointer accesses)
        • 5 cycles for complex addresses
      • 128 B/cycle load bandwidth
      • 64 B/cycle store bandwidth
      • Write-back policy
    • L2 Cache:
      • Unified, 256 KiB, 4-way set associative
      • 1024 sets, 64 B line size
      • Non-inclusive
      • 12 cycles for fastest load-to-use
      • 64 B/cycle bandwidth to L1$
      • Write-back policy
    • L3 Cache/LLC:
      • Up to 2 MiB Per core, shared across all cores
      • Up to 16-way set associative
      • Inclusive
      • 64 B line size
      • Write-back policy
      • Per each core:
        • Read: 32 B/cycle (@ ring clock)
        • Write: 32 B/cycle (@ ring clock)
      • 42 cycles for fastest load-to-use
    • System DRAM:
      • 2 Channels
      • 8 B/cycle/channel (@ memory clock)

Palm Cove TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally there is a unified L2 TLB (STLB).

  • TLBs:
    • ITLB
      • 4 KiB page translations:
        • 64 entries; 8-way set associative
        • dynamic partitioning
      • 2 MiB / 4 MiB page translations:
        • 8 entries per thread; fully associative
        • Duplicated for each thread
    • DTLB
      • 4 KiB page translations:
        • 64 entries; 4-way set associative
        • fixed partition
      • 2 MiB / 4 MiB page translations:
        • 32 entries; 4-way set associative
        • fixed partition
      • 1G page translations:
        • 4 entries; 4-way set associative
        • fixed partition
    • STLB
      • 4 KiB + 2 MiB page translations:
        • 1536 entries; 12-way set associative
        • fixed partition
      • 1 GiB page translations:
        • 16 entries; 4-way set associative
        • fixed partition

Overview[edit]

Palm Cove is the core microarchitecture that is found in Intel's Cannon Lake SoCs. Although originally intended to be mass manufactured for all client and server markets, due to Intel's prolong 10 nm process problems, Palm Cove is getting skipped with the exception of a single chip.

See also[edit]