Latest revision |
Your text |
Line 7: |
Line 7: |
| |introduction=2018 | | |introduction=2018 |
| |process=10 nm | | |process=10 nm |
− | |cores=2
| |
− | |type=Superscalar
| |
− | |oooe=Yes
| |
− | |speculative=Yes
| |
− | |renaming=Yes
| |
− | |stages min=14
| |
− | |stages max=19
| |
| |isa=x86-64 | | |isa=x86-64 |
− | |extension=MOVBE
| |
− | |extension 2=MMX
| |
− | |extension 3=SSE
| |
− | |extension 4=SSE2
| |
− | |extension 5=SSE3
| |
− | |extension 6=SSSE3
| |
− | |extension 7=SSE4.1
| |
− | |extension 8=SSE4.2
| |
− | |extension 9=POPCNT
| |
− | |extension 10=AVX
| |
− | |extension 11=AVX2
| |
− | |extension 12=AES
| |
− | |extension 13=PCLMUL
| |
− | |extension 14=FSGSBASE
| |
− | |extension 15=RDRND
| |
− | |extension 16=FMA3
| |
− | |extension 17=F16C
| |
− | |extension 18=BMI
| |
− | |extension 19=BMI2
| |
− | |extension 20=VT-x
| |
− | |extension 21=VT-d
| |
− | |extension 22=TXT
| |
− | |extension 23=TSX
| |
− | |extension 24=RDSEED
| |
− | |extension 25=ADCX
| |
− | |extension 26=PREFETCHW
| |
− | |extension 27=CLFLUSHOPT
| |
− | |extension 28=XSAVE
| |
− | |extension 29=SGX
| |
− | |extension 30=MPX
| |
− | |extension 31=AVX-512
| |
− | |l1i=32 KiB
| |
− | |l1i per=core
| |
− | |l1i desc=8-way set associative
| |
− | |l1d=32 KiB
| |
− | |l1d per=core
| |
− | |l1d desc=8-way set associative
| |
− | |l2=256 KiB
| |
− | |l2 per=core
| |
− | |l2 desc=4-way set associative
| |
− | |l3=2 MiB
| |
− | |l3 per=core
| |
− | |l3 desc=16-way set associative
| |
| |predecessor=Skylake | | |predecessor=Skylake |
| |predecessor link=intel/microarchitectures/skylake | | |predecessor link=intel/microarchitectures/skylake |
Line 69: |
Line 19: |
| | | |
| == Architecture == | | == Architecture == |
− | === Key changes from {{\\|Skylake (Client)}}=== | + | === Key changes from {{\\|Skylake (Server)}}=== |
| * [[10 nm process]] (From [[14 nm]]) | | * [[10 nm process]] (From [[14 nm]]) |
− | * Front End
| |
− | ** LSD is re-enabled (See {{\\|skylake_(server)#Front-end|Skylake § Front-end}} for details)
| |
− | ** 50% smaller L1 instruction cache 4K page TLB (64-entry, down from 128)
| |
− | * Back-end
| |
− | ** Execution units
| |
− | *** Port 4 now performs 512b stores (from 256b)
| |
− | *** New 512b FMA unit on Port 0
| |
− | *** New iDIV unit
| |
− | * Memory subsystem
| |
− | ** Store is now 64B/cycle (from 32B/cycle)
| |
− | ** Load is now 2x64B/cycle (from 2x32B/cycle)
| |
− |
| |
| {{expand list}} | | {{expand list}} |
− |
| |
− | ==== New instructions ====
| |
− | Cannon Lake introduced a number of {{x86|extensions|new instructions}}:
| |
− |
| |
− | * {{x86|AVX-512|<code>AVX-512</code>}}, specifically:
| |
− | ** {{x86|AVX512F|<code>AVX512F</code>}} - AVX-512 Foundation
| |
− | ** {{x86|AVX512CD|<code>AVX512CD</code>}} - AVX-512 Conflict Detection
| |
− | ** {{x86|AVX512BW|<code>AVX512BW</code>}} - AVX-512 Byte and Word
| |
− | ** {{x86|AVX512DQ|<code>AVX512DQ</code>}} - AVX-512 Doubleword and Quadword
| |
− | ** {{x86|AVX512VL|<code>AVX512VL</code>}} - AVX-512 Vector Length
| |
− | ** {{x86|AVX512IFMA|<code>AVX512IFMA</code>}} - AVX-512 Integer Fused Multiply-Add
| |
− | ** {{x86|AVX512VBMI|<code>AVX512VBMI</code>}} - AVX-512 Vector Bit Manipulation
| |
− | * {{x86|SHA|<code>SHA</code>}} - [[Hardware acceleration]] for SHA hashing operations
| |
− | * {{x86|UMIP|<code>UMIP</code>}} - User-Mode Instruction Prevention extension
| |
− |
| |
− | === Memory Hierarchy ===
| |
− | Other than a few organizational changes (e.g. L2$ went from 8-way to 4-way set associative), the overall memory structure is identical to {{\\|Broadwell}}/{{\\|Haswell}}.
| |
− |
| |
− | * Cache
| |
− | ** L0 µOP cache:
| |
− | *** 1,536 µOPs, 8-way set associative
| |
− | **** 32 sets, 6-µOP line size
| |
− | **** statically divided between threads, per core, inclusive with L1I
| |
− | ** L1I Cache:
| |
− | *** 32 [[KiB]], 8-way set associative
| |
− | **** 64 sets, 64 B line size
| |
− | **** shared by the two threads, per core
| |
− | ** L1D Cache:
| |
− | *** 32 KiB, 8-way set associative
| |
− | *** 64 sets, 64 B line size
| |
− | *** shared by the two threads, per core
| |
− | *** 4 cycles for fastest load-to-use (simple pointer accesses)
| |
− | **** 5 cycles for complex addresses
| |
− | *** 128 B/cycle load bandwidth
| |
− | *** 64 B/cycle store bandwidth
| |
− | *** Write-back policy
| |
− | ** L2 Cache:
| |
− | *** Unified, 256 KiB, 4-way set associative
| |
− | *** 1024 sets, 64 B line size
| |
− | *** Non-inclusive
| |
− | *** 12 cycles for fastest load-to-use
| |
− | *** 64 B/cycle bandwidth to L1$
| |
− | *** Write-back policy
| |
− | ** L3 Cache/LLC:
| |
− | *** Up to 2 MiB Per core, shared across all cores
| |
− | *** Up to 16-way set associative
| |
− | *** Inclusive
| |
− | *** 64 B line size
| |
− | *** Write-back policy
| |
− | *** Per each core:
| |
− | **** Read: 32 B/cycle (@ ring [[clock]])
| |
− | **** Write: 32 B/cycle (@ ring clock)
| |
− | *** 42 cycles for fastest load-to-use
| |
− | ** System [[DRAM]]:
| |
− | *** 2 Channels
| |
− | *** 8 B/cycle/channel (@ memory clock)
| |
− |
| |
− | Palm Cove TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally there is a unified L2 TLB (STLB).
| |
− | * TLBs:
| |
− | ** ITLB
| |
− | *** 4 KiB page translations:
| |
− | **** 64 entries; 8-way set associative
| |
− | **** dynamic partitioning
| |
− | *** 2 MiB / 4 MiB page translations:
| |
− | **** 8 entries per thread; fully associative
| |
− | **** Duplicated for each thread
| |
− | ** DTLB
| |
− | *** 4 KiB page translations:
| |
− | **** 64 entries; 4-way set associative
| |
− | **** fixed partition
| |
− | *** 2 MiB / 4 MiB page translations:
| |
− | **** 32 entries; 4-way set associative
| |
− | **** fixed partition
| |
− | *** 1G page translations:
| |
− | **** 4 entries; 4-way set associative
| |
− | **** fixed partition
| |
− | ** STLB
| |
− | *** 4 KiB + 2 MiB page translations:
| |
− | **** 1536 entries; 12-way set associative
| |
− | **** fixed partition
| |
− | *** 1 GiB page translations:
| |
− | **** 16 entries; 4-way set associative
| |
− | **** fixed partition
| |
| | | |
| == Overview == | | == Overview == |
− | Palm Cove is the core microarchitecture that is found in Intel's {{\\|Cannon Lake}} SoCs. Although originally intended to be mass manufactured for all client and server markets, due to Intel's prolong [[10 nm process]] problems, Palm Cove is getting skipped with the exception of a single chip. | + | Palm Cove is the code microarchitecture that is found in Intel's {{\\|Cannon Lake}} SoCs. Although originally intended to be mass manufactured for all client and server markets, due to Intel's prolong [[10 nm process]] problems, Palm Cove is getting skipped with the exception of a single chip. |
| | | |
| == See also == | | == See also == |
| * {{intel|Cannon Lake|l=arch}} | | * {{intel|Cannon Lake|l=arch}} |