From WikiChip
Editing intel/microarchitectures/skylake (server)

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 94: Line 94:
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
! Core !! Abbrev !! Platform !! Target
+
! Core !! Abbrev !! Target
 
|-
 
|-
| {{intel|Skylake SP|l=core}} || SKL-SP || {{intel|Purley|l=platform}} || Server Scalable Processors
+
| {{intel|Skylake SP|l=core}} || SKL-SP || Server Scalable Processors
 
|-
 
|-
| {{intel|Skylake X|l=core}} || SKL-X || {{intel|Basin Falls|l=platform}} || High-end desktops & enthusiasts market
+
| {{intel|Skylake X|l=core}} || SKL-X || High-end desktops & enthusiasts market
 
|-
 
|-
| {{intel|Skylake W|l=core}} || SKL-W || {{intel|Basin Falls|l=platform}} || Enterprise/Business workstations
+
| {{intel|Skylake W|l=core}} || SKL-W || Enterprise/Business workstations
 
|-
 
|-
| {{intel|Skylake DE|l=core}} || SKL-DE || || Dense server/edge computing
+
| {{intel|Skylake DE|l=core}} || SKL-DE || Dense server/edge computing
 
|}
 
|}
  
Line 143: Line 143:
  
 
== Process Technology ==
 
== Process Technology ==
{{main|14 nm lithography process}}
+
{{main|intel/microarchitectures/kaby lake#Process_Technology|l1=Kaby Lake § Process Technology}}
 
Unlike mainstream Skylake models, all Skylake server configuration models are fabricated on Intel's [[14 nm process#Intel|enhanced 14+ nm process]] which is used by {{\\|Kaby Lake}}.
 
Unlike mainstream Skylake models, all Skylake server configuration models are fabricated on Intel's [[14 nm process#Intel|enhanced 14+ nm process]] which is used by {{\\|Kaby Lake}}.
  
Line 183: Line 183:
 
! Core !! Extended<br>Family !! Family !! Extended<br>Model !! Model
 
! Core !! Extended<br>Family !! Family !! Extended<br>Model !! Model
 
|-
 
|-
| rowspan="2" | {{intel|Skylake X|X|l=core}}, {{intel|Skylake SP|SP|l=core}}, {{intel|Skylake DE|DE|l=core}}, {{intel|Skylake W|W|l=core}} || 0 || 0x6 || 0x5 || 0x5
+
| rowspan="2" | {{intel|Skylake X|X|l=core}}/{{intel|Skylake SP|SP|l=core}} || 0 || 0x6 || 0x5 || 0x5
 
|-
 
|-
 
| colspan="4" | Family 6 Model 85
 
| colspan="4" | Family 6 Model 85
Line 189: Line 189:
  
 
== Architecture ==
 
== Architecture ==
Skylake server configuration introduces a number of significant changes from both Intel's previous microarchitecture, {{\\|Broadwell}}, as well as the {{\\|Skylake (client)}} architecture. Unlike client models, Skylake servers and HEDT models will still incorporate the fully integrated voltage regulator (FIVR) on-die. Those chips also have an entirely new multi-core system architecture that brought a new {{intel|mesh interconnect}} network (from [[ring topology]]).
+
Skylake server configuration introduces a number of significant changes from both Intel's previous microarchitecture, {{\\|Broadwell}}, as well as the {{\\|Skylake (client)}} architecture. Unlike client models, Skylake servers and HEDT models will still incorporate the fully integrated voltage regulator (FIVR) on-die. Those chips also have an entirely new multi-core architecture along with a new [[mesh topology]] interconnect network (from [[ring topology]]).
  
 
=== Key changes from {{\\|Broadwell}} ===
 
=== Key changes from {{\\|Broadwell}} ===
Line 195: Line 195:
 
* Improved "14 nm+" process (see {{\\|kaby_lake#Process_Technology|Kaby Lake § Process Technology}})
 
* Improved "14 nm+" process (see {{\\|kaby_lake#Process_Technology|Kaby Lake § Process Technology}})
 
* {{intel|Omni-Path Architecture}} (OPA)
 
* {{intel|Omni-Path Architecture}} (OPA)
* {{intel|Mesh architecture}} (from {{intel|Ring architecture|ring}})
+
* Mesh architecture (from ring)
 
** {{intel|Sub-NUMA Clustering}} (SNC) support (replaces the {{intel|Cluster-on-Die}} (COD) implementation)
 
** {{intel|Sub-NUMA Clustering}} (SNC) support (replaces the {{intel|Cluster-on-Die}} (COD) implementation)
 
* Chipset
 
* Chipset
Line 205: Line 205:
 
** DMI upgraded to Gen3
 
** DMI upgraded to Gen3
 
* Core
 
* Core
** All the changes from Skylake Client (For full list, see {{\\|Skylake (Client)#Key changes from Broadwell|Skylake (Client) § Key changes from Broadwell}})
 
 
** Front End
 
** Front End
 
*** LSD is disabled (Likely due to a bug; see [[#Front-end|§ Front-end]] for details)
 
*** LSD is disabled (Likely due to a bug; see [[#Front-end|§ Front-end]] for details)
 +
*** Larger legacy pipeline delivery (5 µOPs, up from 4)
 +
**** Another simple decoder has been added.
 +
*** Allocation Queue (IDQ)
 +
**** Larger delivery (6 µOPs, up from 4)
 +
**** 2.28x larger buffer (64/thread, up from 56)
 +
**** Partitioned for each active thread (from unified)
 +
*** Improved [[branch prediction unit]]
 +
**** reduced penalty for wrong direct jump target
 +
**** No specifics were disclosed
 +
*** µOP Cache
 +
**** instruction window is now 64 Bytes (from 32)
 +
**** 1.5x bandwidth (6 µOPs/cycle, up from 4)
 +
** Execution Engine
 +
*** Larger [[re-order buffer]] (224 entries, up from 192)
 +
*** Larger scheduler (97 entries, up from 64)
 +
**** Larger Integer Register File (180 entries, up from 168)
 
** Back-end
 
** Back-end
 
*** Port 4 now performs 512b stores (from 256b)
 
*** Port 4 now performs 512b stores (from 256b)
Line 224: Line 239:
 
** L2$
 
** L2$
 
*** Increased to 1 MiB/core (from 256 KiB/core)
 
*** Increased to 1 MiB/core (from 256 KiB/core)
*** Latency increased from 12 to 14
 
 
** L3$
 
** L3$
 
*** Reduced to 1.375 MiB/core (from 2.5 MiB/core)
 
*** Reduced to 1.375 MiB/core (from 2.5 MiB/core)
Line 241: Line 255:
  
 
==== CPU changes ====
 
==== CPU changes ====
See {{\\|Skylake (Client)#CPU changes|Skylake (Client) § CPU changes}}
+
* Most ALU operations have 4 op/cycle 1 for 8 and 32-bit registers. 64-bit ops are still limited to 3 op/cycle. (16-bit throughput varies per op, can be 4, 3.5 or 2 op/cycle).
 +
* MOVSX and MOVZX have 4 op/cycle throughput for 16->32 and 32->64 forms, in addition to Haswell's 8->32, 8->64 and 16->64 bit forms.
 +
* ADC and SBB have throughput of 1 op/cycle, same as Haswell.
 +
* Vector moves have throughput of 4 op/cycle (move elimination).
 +
* Not only zeroing vector vpXORxx and vpSUBxx ops, but also vPCMPxxx on the same register, have throughput of 4 op/cycle.
 +
* Vector ALU ops are often "standardized" to latency of 4. for example, vADDPS and vMULPS used to have L of 3 and 5, now both are 4.
 +
* Fused multiply-add ops have latency of 4 and throughput of 0.5 op/cycle.
 +
* Throughput of vADDps, vSUBps, vCMPps, vMAXps, their scalar and double analogs is increased to 2 op/cycle.
 +
* Throughput of vPSLxx and vPSRxx with immediate (i.e. fixed vector shifts) is increased to 2 op/cycle.
 +
* Throughput of vANDps, vANDNps, vORps, vXORps, their scalar and double analogs, vPADDx, vPSUBx is increased to 3 op/cycle.
 +
* vDIVPD, vSQRTPD have approximately twice as good throughput: from 8 to 4 and from 28 to 12 cycles/op.
 +
* Throughput of some MMX ALU ops (such as PAND mm1, mm2) is decreased to 2 or 1 op/cycle (users are expected to use wider SSE/AVX registers instead).
  
 
====New instructions ====
 
====New instructions ====
Line 247: Line 272:
 
Skylake server introduced a number of {{x86|extensions|new instructions}}:
 
Skylake server introduced a number of {{x86|extensions|new instructions}}:
  
* {{x86|MPX|<code>MPX</code>}} - Memory Protection Extensions
+
* {{x86|MPX|<code>MPX</code>}} -Memory Protection Extensions
 
* {{x86|XSAVEC|<code>XSAVEC</code>}} - Save processor extended states with compaction to memory
 
* {{x86|XSAVEC|<code>XSAVEC</code>}} - Save processor extended states with compaction to memory
 
* {{x86|XSAVES|<code>XSAVES</code>}} - Save processor supervisor-mode extended states to memory.
 
* {{x86|XSAVES|<code>XSAVES</code>}} - Save processor supervisor-mode extended states to memory.
Line 255: Line 280:
 
** {{x86|AVX512CD|<code>AVX512CD</code>}} - AVX-512 Conflict Detection
 
** {{x86|AVX512CD|<code>AVX512CD</code>}} - AVX-512 Conflict Detection
 
** {{x86|AVX512BW|<code>AVX512BW</code>}} - AVX-512 Byte and Word
 
** {{x86|AVX512BW|<code>AVX512BW</code>}} - AVX-512 Byte and Word
** {{x86|AVX512DQ|<code>AVX512DQ</code>}} - AVX-512 Doubleword and Quadword  
+
** {{x86|AVX512BW|<code>AVX512DQ</code>}} - AVX-512 Doubleword and Quadword  
** {{x86|AVX512VL|<code>AVX512VL</code>}} - AVX-512 Vector Length
+
** {{x86|AVX512BW|<code>AVX512VL</code>}} - AVX-512 Vector Length
 
* {{x86|PKU|<code>PKU</code>}} - Memory Protection Keys for Userspace
 
* {{x86|PKU|<code>PKU</code>}} - Memory Protection Keys for Userspace
 
* {{x86|PCOMMIT|<code>PCOMMIT</code>}} - PCOMMIT instruction
 
* {{x86|PCOMMIT|<code>PCOMMIT</code>}} - PCOMMIT instruction
* {{x86|CLWB|<code>CLWB</code>}} - Force cache line write-back without flush
+
* {{x86|CLWB|<code>CLWB</code>}} - CLWB instruction
  
 
=== Block Diagram ===
 
=== Block Diagram ===
 
==== Entire SoC Overview ====
 
==== Entire SoC Overview ====
===== LCC SoC =====
+
Note that the LCC die is identical without the two bottom rows. The XCC (28-core) die has one additional row and two additional columns of cores. Otherwise the die is identical.
:[[File:skylake sp lcc block diagram.svg|500px]]
+
 
===== HCC SoC =====
+
[[File:skylake sp hcc block diagram.svg|650px]]
:[[File:skylake sp hcc block diagram.svg|600px]]
+
 
===== XCC SoC =====
+
* '''CHA''' - Caching and Home Agent
:[[File:skylake sp xcc block diagram.svg|800px]]
+
* '''SF''' - Snooping Filter
 +
 
 
===== Individual Core =====
 
===== Individual Core =====
:[[File:skylake server block diagram.svg|850px]]
+
[[File:skylake server block diagram.svg|950px]]
  
 
=== Memory Hierarchy ===
 
=== Memory Hierarchy ===
Line 304: Line 330:
 
*** 1.375 MiB/core, 11-way set associative, shared across all cores
 
*** 1.375 MiB/core, 11-way set associative, shared across all cores
 
**** Note that a few models have non-default cache sizes due to disabled cores
 
**** Note that a few models have non-default cache sizes due to disabled cores
*** 2,048 sets, 64 B line size
+
*** 64 B line size
 
*** Non-inclusive victim cache
 
*** Non-inclusive victim cache
 
*** Write-back policy
 
*** Write-back policy
 
*** 50-70 cycles latency
 
*** 50-70 cycles latency
** Snoop Filter (SF):
 
*** 2,048 sets, 12-way set associative
 
* DRAM
 
** 6 channels of DDR4, up to 2666 MT/s
 
*** RDIMM and LRDIMM
 
*** bandwidth of 21.33 GB/s
 
*** aggregated bandwidth of 128 GB/s
 
  
 
Skylake TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally there is a unified L2 TLB (STLB).
 
Skylake TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally there is a unified L2 TLB (STLB).
Line 333: Line 352:
 
**** fixed partition
 
**** fixed partition
 
*** 1G page translations:
 
*** 1G page translations:
**** 4 entries; 4-way set associative
+
**** 4 entries; fully associative
 
**** fixed partition
 
**** fixed partition
 
** STLB
 
** STLB
 
*** 4 KiB + 2 MiB page translations:
 
*** 4 KiB + 2 MiB page translations:
**** 1536 entries; 12-way set associative. (Note: STLB is incorrectly reported as "6-way" by CPUID leaf 2 (EAX=02H). Skylake erratum SKL148 recommends software to simply ignore that value.)
+
**** 1536 entries; 12-way set associative
 
**** fixed partition
 
**** fixed partition
 
*** 1 GiB page translations:
 
*** 1 GiB page translations:
Line 364: Line 383:
 
Intel has been experiencing a growing divergence in functionality over the last number of iterations of [[intel/microarchitectures|their microarchitecture]] between their mainstream consumer products and their high-end HPC/server models. Traditionally, Intel has been using the same exact core design for everything from their lowest end value models (e.g. {{intel|Celeron}}) all the way up to the highest-performance enterprise models (e.g. {{intel|Xeon E7}}). While the two have fundamentally different chip architectures, they use the same exact CPU core architecture as the building block.  
 
Intel has been experiencing a growing divergence in functionality over the last number of iterations of [[intel/microarchitectures|their microarchitecture]] between their mainstream consumer products and their high-end HPC/server models. Traditionally, Intel has been using the same exact core design for everything from their lowest end value models (e.g. {{intel|Celeron}}) all the way up to the highest-performance enterprise models (e.g. {{intel|Xeon E7}}). While the two have fundamentally different chip architectures, they use the same exact CPU core architecture as the building block.  
  
This design philosophy has changed with Skylake. In order to better accommodate the different functionalities of each segment without sacrificing features or making unnecessary compromises, Intel went with a configurable core. The Skylake core is a single development project, making up a master superset core. The project results in two derivatives: one for servers (the substance of this article) and {{\\|skylake (client)|one for clients}}. All mainstream models (from {{intel|Celeron}}/{{intel|Pentium (2009)|Pentium}} all the way up to {{intel|Core i7}}/{{intel|Xeon E3}}) use {{\\|skylake (client)|the client core configuration}}. Server models (e.g. {{intel|Xeon Gold}}/{{intel|Xeon Platinum}}) are using the new server configuration instead.
+
This design philosophy has changed with Skylake. In order to better accommodate the different functionalities of each segment without sacrificing features or making unnecessary compromises Intel went with a configurable core. The Skylake core is a single development project, making up a master superset core. The project result in two derivatives: one for servers (the substance of this article) and {{\\|skylake (client)|one for clients}}. All mainstream models (from {{intel|Celeron}}/{{intel|Pentium (2009)|Pentium}} all the way up to {{intel|Core i7}}/{{intel|Xeon E3}}) use {{\\|skylake (client)|the client core configuration}}. Server models (e.g. {{intel|Xeon Gold}}/{{intel|Xeon Platinum}}) are using the new server configuration instead.
 +
 
 +
The server core is considerably larger than the client one, featuring [[Advanced Vector Extensions 512]] (AVX-512). Skylake servers support what was formerly called AVX3.2 (AVX512F + AVX512CD + AVX512BW + AVX512DQ + AVX512VL). Additionally, those processors Memory Protection Keys for Userspace (PKU), {{x86|PCOMMIT}}, and {{x86|CLWB}}.
  
The server core is considerably larger than the client one, featuring [[Advanced Vector Extensions 512]] (AVX-512). Skylake servers support what was formerly called AVX3.2 (AVX512F + AVX512CD + AVX512BW + AVX512DQ + AVX512VL). The server core also incorporates a number of new technologies not found in the client configuration. In addition to the execution units that were added, the cache hierarchy has changed for the server core as well, incorporating a large L2 and a portion of the LLC as well as the caching and home agent and the snoop filter that needs to accommodate the new cache changes.
+
In addition to the execution units that were added, the cache hierarchy has changed for the server core as well, incorporating a large L2 and a portion of the LLC as well as the caching and home agent and the snoop filter that needs to accommodate the new cache changes.
  
 
Below is a visual that helps show how the server core was evolved from the client core.
 
Below is a visual that helps show how the server core was evolved from the client core.
Line 388: Line 409:
  
 
[[File:skylake sp added cach and vpu.png|left|300px]]
 
[[File:skylake sp added cach and vpu.png|left|300px]]
This is the first implementation to incorporate {{x86|AVX-512}}, a 512-bit [[SIMD]] [[x86]] instruction set extension. AVX-512 operations can take place on every port. For 512-bit wide FMA SIMD operations, Intel introduced two different mechanisms ways:
+
This is the first implementation to incorporate {{x86|AVX-512}}, a 512-bit [[SIMD]] [[x86]] instruction set extension. Intel introduced AVX-512 in two different ways:
  
In the simple implementation, the variants used in the {{intel|Xeon Bronze|entry-level}} and {{intel|Xeon Silver|mid-range}} Xeon servers, AVX-512 fuses Port 0 and Port 1 to form a 512-bit FMA unit. Since those two ports are 256-wide, an AVX-512 option that is dispatched by the scheduler to port 0 will execute on both ports. Note that unrelated operations can still execute in parallel. For example, an AVX-512 operation and an Int ALU operation may execute in parallel - the AVX-512 is dispatched on port 0 and use the AVX unit on port 1 as well and the Int ALU operation will execute independently in parallel on port 1.
+
In the simple implementation, the variants used in the {{intel|Xeon Bronze|entry-level}} and {{intel|Xeon Silver|mid-range}} Xeon servers, AVX-512 fuses Port 0 and Port 1 to form a 512-bit unit. Since those two ports are 256-wide, an AVX-512 option that is dispatched by the scheduler to port 0 will execute on both ports. Note that unrelated operations can still execute in parallel. For example, an AVX-512 operation and an Int ALU operation may execute in parallel - the AVX-512 is dispatched on port 0 and use the AVX unit on port 1 as well and the Int ALU operation will execute independently in parallel on port 1.
  
In the {{intel|Xeon Gold|high-end}} and {{intel|Xeon Platinum|highest}} performance Xeons, Intel added a second dedicated 512-bit wide AVX-512 FMA unit in addition to the fused Port0-1 operations described above. The dedicated unit is situated on Port 5.
+
In the {{intel|Xeon Gold|high-end}} and {{intel|Xeon Platinum|highest}} performance Xeons, Intel added a second dedicated AVX-512 unit in addition to the fused Port0-1 operations described above. The dedicated unit is situated on Port 5.
  
 
Physically, Intel added 768 KiB L2 cache and the second AVX-512 VPU externally to the core.  
 
Physically, Intel added 768 KiB L2 cache and the second AVX-512 VPU externally to the core.  
Line 477: Line 498:
  
 
=== Mode-Based Execute (MBE) Control ===
 
=== Mode-Based Execute (MBE) Control ===
'''Mode-Based Execute''' ('''MBE''') is an enhancement to the Extended Page Tables (EPT) that provides finer level of control of execute permissions. With MBE the previous Execute Enable (''X'') bit is turned into Execute Userspace page (XU) and Execute Supervisor page (XS). The processor selects the mode based on the guest page permission. With proper software support, hypervisors can take advantage of this as well to ensure integrity of kernel-level code.
+
'''Mode-Based Execute''' ('''MBE''') is an enhancement to the Extended Page Tables (EPT) that provides finer level of control of execute permissions. With MBE the previous Execute Enable (''X'') bit is turned into Excuse Userspace page (XU) and Execute Supervisor page (XS). The processor selects the mode based on the guest page permission. With proper software support, hypervisors can take advantage of this as well to ensure integrity of kernel-level code.
  
 
== Mesh Architecture ==
 
== Mesh Architecture ==
Line 487: Line 508:
  
 
=== Organization ===
 
=== Organization ===
[[File:skylake (server) half rings.png|right|400px]]
+
Each die has a grid of CMSs. For example, for the XCC die, there are 36 converged mesh stops (CMS). As the name implies, the CMS is a block that effectively interfaces between all the various subsystems and the mesh interconnect. The locations of the CMSes for the large core count is shown on the diagram below. It should be pointed that although the CMS appears to be inside the core tiles, most of the mesh is likely routed above the cores in a similar fashion to how Intel has done it with the ring interconnect which was wired above the caches in order reduce the die area.
Each die has a grid of converged mesh stops (CMS). For example, for the XCC die, there are 36 CMSs. As the name implies, the CMS is a block that effectively interfaces between all the various subsystems and the mesh interconnect. The locations of the CMSes for the large core count is shown on the diagram below. It should be pointed that although the CMS appears to be inside the core tiles, most of the mesh is likely routed above the cores in a similar fashion to how Intel has done it with the ring interconnect which was wired above the caches in order reduce the die area.
 
  
  
Line 517: Line 537:
  
 
==== Sub-NUMA Clustering ====
 
==== Sub-NUMA Clustering ====
In previous generations Intel had a feature called {{intel|cluster-on-die}} (COD) which was introduced with {{intel|Haswell|l=arch}}. With Skylake, there's a similar feature called {{intel|sub-NUMA cluster}} (SNC). With a memory controller physically located on each side of the die, SNC allows for the creation of two localized domains with each memory controller belonging to each domain. The processor can then map the addresses from the controller to the distributed home agents and LLC in its domain. This allows executing code to experience lower LLC and memory latency within its domain compared to accesses outside of the domain.
+
In previous generations Intel had a feature called {{intel|cluster-on-die}} (COD) which was introduced with {{intel|Haswell|l=arch}}. With Skylake, there's a similar feature called {{intel|sub-NUMA cluster}} (SNC). With a memory controller physically located on each side of the die, SNC allows for the creation of two localized domains with each memory controller belonging to each domain. The processor can then map the addresses from the controller to the distributed home ages and LLC in its domain. This allows executing code to experience lower LLC and memory latency within its domain compared to accesses outside of the domain.
  
It should be pointed out that in contrast to COD, SNC has a unique location for every address in the LLC and is never duplicated across LLC banks (previously, COD cache lines could have copies). Additionally, on multiprocessor systems, addresses mapped to memory on remote sockets are still uniformly distributed across all LLC banks irrespective of the localized SNC domain.
+
It should be pointed out that in contrast to COD, SNC has a unique location for every adddress in the LCC and is never duplicated across LLC banks (previously, COD cache lines could have copies). Additionally, on multiprocessor system, address mapped to memory on remote sockets are still uniformally distributed across all LLC banks irrespective of the localized SNC domain.
  
 
== Scalability ==
 
== Scalability ==
 
{{see also|intel/quickpath interconnect|intel/ultra path interconnect|l1=QuickPath Interconnect|l2=Ultra Path Interconnect}}
 
{{see also|intel/quickpath interconnect|intel/ultra path interconnect|l1=QuickPath Interconnect|l2=Ultra Path Interconnect}}
In the last couple of generations, Intel has been utilizing {{intel|QuickPath Interconnect}} (QPI) which served as a high-speed point-to-point interconnect. QPI has been replaced by the {{intel|Ultra Path Interconnect}} (UPI) which is higher-efficiency coherent interconnect for scalable systems, allowing multiple processors to share a single shared address space. Depending on the exact model, each processor can have either two or three UPI links connecting to the other processors.
+
In the last couple of generations, Intel has been utilizing {{intel|QuickPath Interconnect}} (QPI) which served as a high-speed point-to-point interconnect. QPI has been replaced by the {{intel|Ultra Path Interconnect}} (UPI) which is higher-efficiency coherent interconnect for scalable systems, allowing multiple processors to share a single shared address space. Depending on the exact model, each processor can have either either two or three UPI links connecting to the other processors.
  
UPI links eliminate some of the scalability limitations that surfaced in QPI over the past few microarchitecture iterations. They use directory-based home snoop coherency protocol and operate at up either 10.4 GT/s or 9.6 GT/s. This is quite a bit different from previous generations. In addition to the various improvements done to the protocol layer, {{intel|Skylake SP|l=core}} now implements a distributed CHA that is situated along with the LLC bank on each core. It's in charge of tracking the various requests from the core as well as responding to snoop requests from both local and remote agents. The ease of distributing the home agent is a result of Intel getting rid of the requirement on preallocation of resources at the home agent. This also means that future architectures should be able to scale up well.
+
UPI links eliminate some of the scalability limitations that surfaced in QPI over the past few microarchitecture iterations. They use directory-based home snoop coherency protocol and operate at up either 10.4 GT/s or 9.6 GT/s. This is quite a bit different form previous generations. In addition to the various improvements done to the protocol layer, {{intel|Skylake SP|l=core}} now implements a distributed CHA that is situated along with the LLC bank on each core. It's in charge of tracking the various requests form the core as well as responding to snoop requests from both local and remote agents. The ease of distributing the home agent is a result of Intel getting rid of the requirement on preallocation of resources at the home agent. This also means that future architectures should be able to scale up well.
  
 
Depending on the exact model, Skylake processors can scale from 2-way all the way up to 8-way multiprocessing. Note that the high-end models that support 8-way multiprocessing also only come with three UPI links for this purpose while the lower end processors can have either two or three UPI links. Below are the typical configurations for those processors.
 
Depending on the exact model, Skylake processors can scale from 2-way all the way up to 8-way multiprocessing. Note that the high-end models that support 8-way multiprocessing also only come with three UPI links for this purpose while the lower end processors can have either two or three UPI links. Below are the typical configurations for those processors.
Line 620: Line 640:
 
==== Layout ====
 
==== Layout ====
 
:[[File:skylake (server) die area layout.svg|600px]]
 
:[[File:skylake (server) die area layout.svg|600px]]
 
==== Evolution ====
 
The original Skylake large die started out as a 5 by 5 core tile (25 tiles, 25 cores) as shown by the image from Intel on the left side. The memory controllers were next to the PHYs on the east and west side. An additional row was inserted to get to a 5 by 6 grid. Two core tiles one from each of the sides was then replaced by the new memory controller module which can interface with the mesh just like any other core tile. The final die is shown in the image below as well on the right side.
 
 
:[[File:skylaake server layout evoluation.png|800px]]
 
  
 
== Die ==
 
== Die ==
Line 653: Line 668:
  
 
:[[File:skylake sp memory phys (annotated).png|700px]]
 
:[[File:skylake sp memory phys (annotated).png|700px]]
 
=== Core Tile ===
 
* ~4.8375 x 3.7163
 
* ~ 17.978 mm² die area
 
 
:[[File:skylake sp core.png|500px]]
 
 
:[[File:skylake sp mesh core tile zoom.png|700px]]
 
  
 
=== Low Core Count (LCC) ===
 
=== Low Core Count (LCC) ===
Line 832: Line 839:
 
* Intel Unveils Powerful Intel Xeon Scalable Processors, Live Event, July 11, 2017
 
* Intel Unveils Powerful Intel Xeon Scalable Processors, Live Event, July 11, 2017
 
* [[:File:intel xeon scalable processor architecture deep dive.pdf|Intel Xeon Scalable Process Architecture Deep Dive]], Akhilesh Kumar & Malay Trivedi, Skylake-SP CPU & Lewisburg PCH Architects, June 12th, 2017.
 
* [[:File:intel xeon scalable processor architecture deep dive.pdf|Intel Xeon Scalable Process Architecture Deep Dive]], Akhilesh Kumar & Malay Trivedi, Skylake-SP CPU & Lewisburg PCH Architects, June 12th, 2017.
* IEEE Hot Chips (HC28) 2017.
 
* IEEE ISSCC 2018
 
  
 
== Documents ==
 
== Documents ==

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)
codenameSkylake (server) +
core count4 +, 6 +, 8 +, 10 +, 12 +, 14 +, 16 +, 18 +, 20 +, 22 +, 24 +, 26 + and 28 +
designerIntel +
first launchedMay 4, 2017 +
full page nameintel/microarchitectures/skylake (server) +
instance ofmicroarchitecture +
instruction set architecturex86-64 +
manufacturerIntel +
microarchitecture typeCPU +
nameSkylake (server) +
pipeline stages (max)19 +
pipeline stages (min)14 +
process14 nm (0.014 μm, 1.4e-5 mm) +