From WikiChip
Editing intel/microarchitectures/skylake (server)
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.
Latest revision | Your text | ||
Line 94: | Line 94: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | ! Core !! Abbrev | + | ! Core !! Abbrev !! Target |
|- | |- | ||
− | | {{intel|Skylake SP|l=core}} || SKL-SP | + | | {{intel|Skylake SP|l=core}} || SKL-SP || Server Scalable Processors |
|- | |- | ||
− | | {{intel|Skylake X|l=core}} || SKL-X | + | | {{intel|Skylake X|l=core}} || SKL-X || High-end desktops & enthusiasts market |
|- | |- | ||
− | | {{intel|Skylake W|l=core}} || SKL-W | + | | {{intel|Skylake W|l=core}} || SKL-W || Enterprise/Business workstations |
|- | |- | ||
− | | {{intel|Skylake DE|l=core}} || SKL-DE | + | | {{intel|Skylake DE|l=core}} || SKL-DE || Dense server/edge computing |
|} | |} | ||
Line 143: | Line 143: | ||
== Process Technology == | == Process Technology == | ||
− | {{main| | + | {{main|intel/microarchitectures/kaby lake#Process_Technology|l1=Kaby Lake § Process Technology}} |
Unlike mainstream Skylake models, all Skylake server configuration models are fabricated on Intel's [[14 nm process#Intel|enhanced 14+ nm process]] which is used by {{\\|Kaby Lake}}. | Unlike mainstream Skylake models, all Skylake server configuration models are fabricated on Intel's [[14 nm process#Intel|enhanced 14+ nm process]] which is used by {{\\|Kaby Lake}}. | ||
Line 183: | Line 183: | ||
! Core !! Extended<br>Family !! Family !! Extended<br>Model !! Model | ! Core !! Extended<br>Family !! Family !! Extended<br>Model !! Model | ||
|- | |- | ||
− | | rowspan="2" | {{intel|Skylake X|X|l=core}} | + | | rowspan="2" | {{intel|Skylake X|X|l=core}}/{{intel|Skylake SP|SP|l=core}} || 0 || 0x6 || 0x5 || 0x5 |
|- | |- | ||
| colspan="4" | Family 6 Model 85 | | colspan="4" | Family 6 Model 85 | ||
Line 189: | Line 189: | ||
== Architecture == | == Architecture == | ||
− | Skylake server configuration introduces a number of significant changes from both Intel's previous microarchitecture, {{\\|Broadwell}}, as well as the {{\\|Skylake (client)}} architecture. Unlike client models, Skylake servers and HEDT models will still incorporate the fully integrated voltage regulator (FIVR) on-die. Those chips also have an entirely new multi-core | + | Skylake server configuration introduces a number of significant changes from both Intel's previous microarchitecture, {{\\|Broadwell}}, as well as the {{\\|Skylake (client)}} architecture. Unlike client models, Skylake servers and HEDT models will still incorporate the fully integrated voltage regulator (FIVR) on-die. Those chips also have an entirely new multi-core architecture along with a new [[mesh topology]] interconnect network (from [[ring topology]]). |
=== Key changes from {{\\|Broadwell}} === | === Key changes from {{\\|Broadwell}} === | ||
Line 195: | Line 195: | ||
* Improved "14 nm+" process (see {{\\|kaby_lake#Process_Technology|Kaby Lake § Process Technology}}) | * Improved "14 nm+" process (see {{\\|kaby_lake#Process_Technology|Kaby Lake § Process Technology}}) | ||
* {{intel|Omni-Path Architecture}} (OPA) | * {{intel|Omni-Path Architecture}} (OPA) | ||
− | * | + | * Mesh architecture (from ring) |
** {{intel|Sub-NUMA Clustering}} (SNC) support (replaces the {{intel|Cluster-on-Die}} (COD) implementation) | ** {{intel|Sub-NUMA Clustering}} (SNC) support (replaces the {{intel|Cluster-on-Die}} (COD) implementation) | ||
* Chipset | * Chipset | ||
Line 205: | Line 205: | ||
** DMI upgraded to Gen3 | ** DMI upgraded to Gen3 | ||
* Core | * Core | ||
− | |||
** Front End | ** Front End | ||
*** LSD is disabled (Likely due to a bug; see [[#Front-end|§ Front-end]] for details) | *** LSD is disabled (Likely due to a bug; see [[#Front-end|§ Front-end]] for details) | ||
+ | *** Larger legacy pipeline delivery (5 µOPs, up from 4) | ||
+ | **** Another simple decoder has been added. | ||
+ | *** Allocation Queue (IDQ) | ||
+ | **** Larger delivery (6 µOPs, up from 4) | ||
+ | **** 2.28x larger buffer (64/thread, up from 56) | ||
+ | **** Partitioned for each active thread (from unified) | ||
+ | *** Improved [[branch prediction unit]] | ||
+ | **** reduced penalty for wrong direct jump target | ||
+ | **** No specifics were disclosed | ||
+ | *** µOP Cache | ||
+ | **** instruction window is now 64 Bytes (from 32) | ||
+ | **** 1.5x bandwidth (6 µOPs/cycle, up from 4) | ||
+ | ** Execution Engine | ||
+ | *** Larger [[re-order buffer]] (224 entries, up from 192) | ||
+ | *** Larger scheduler (97 entries, up from 64) | ||
+ | **** Larger Integer Register File (180 entries, up from 168) | ||
** Back-end | ** Back-end | ||
*** Port 4 now performs 512b stores (from 256b) | *** Port 4 now performs 512b stores (from 256b) | ||
Line 224: | Line 239: | ||
** L2$ | ** L2$ | ||
*** Increased to 1 MiB/core (from 256 KiB/core) | *** Increased to 1 MiB/core (from 256 KiB/core) | ||
− | |||
** L3$ | ** L3$ | ||
*** Reduced to 1.375 MiB/core (from 2.5 MiB/core) | *** Reduced to 1.375 MiB/core (from 2.5 MiB/core) | ||
Line 241: | Line 255: | ||
==== CPU changes ==== | ==== CPU changes ==== | ||
− | + | * Most ALU operations have 4 op/cycle 1 for 8 and 32-bit registers. 64-bit ops are still limited to 3 op/cycle. (16-bit throughput varies per op, can be 4, 3.5 or 2 op/cycle). | |
+ | * MOVSX and MOVZX have 4 op/cycle throughput for 16->32 and 32->64 forms, in addition to Haswell's 8->32, 8->64 and 16->64 bit forms. | ||
+ | * ADC and SBB have throughput of 1 op/cycle, same as Haswell. | ||
+ | * Vector moves have throughput of 4 op/cycle (move elimination). | ||
+ | * Not only zeroing vector vpXORxx and vpSUBxx ops, but also vPCMPxxx on the same register, have throughput of 4 op/cycle. | ||
+ | * Vector ALU ops are often "standardized" to latency of 4. for example, vADDPS and vMULPS used to have L of 3 and 5, now both are 4. | ||
+ | * Fused multiply-add ops have latency of 4 and throughput of 0.5 op/cycle. | ||
+ | * Throughput of vADDps, vSUBps, vCMPps, vMAXps, their scalar and double analogs is increased to 2 op/cycle. | ||
+ | * Throughput of vPSLxx and vPSRxx with immediate (i.e. fixed vector shifts) is increased to 2 op/cycle. | ||
+ | * Throughput of vANDps, vANDNps, vORps, vXORps, their scalar and double analogs, vPADDx, vPSUBx is increased to 3 op/cycle. | ||
+ | * vDIVPD, vSQRTPD have approximately twice as good throughput: from 8 to 4 and from 28 to 12 cycles/op. | ||
+ | * Throughput of some MMX ALU ops (such as PAND mm1, mm2) is decreased to 2 or 1 op/cycle (users are expected to use wider SSE/AVX registers instead). | ||
====New instructions ==== | ====New instructions ==== | ||
Line 247: | Line 272: | ||
Skylake server introduced a number of {{x86|extensions|new instructions}}: | Skylake server introduced a number of {{x86|extensions|new instructions}}: | ||
− | * {{x86|MPX|<code>MPX</code>}} - Memory Protection Extensions | + | * {{x86|MPX|<code>MPX</code>}} -Memory Protection Extensions |
* {{x86|XSAVEC|<code>XSAVEC</code>}} - Save processor extended states with compaction to memory | * {{x86|XSAVEC|<code>XSAVEC</code>}} - Save processor extended states with compaction to memory | ||
* {{x86|XSAVES|<code>XSAVES</code>}} - Save processor supervisor-mode extended states to memory. | * {{x86|XSAVES|<code>XSAVES</code>}} - Save processor supervisor-mode extended states to memory. | ||
Line 255: | Line 280: | ||
** {{x86|AVX512CD|<code>AVX512CD</code>}} - AVX-512 Conflict Detection | ** {{x86|AVX512CD|<code>AVX512CD</code>}} - AVX-512 Conflict Detection | ||
** {{x86|AVX512BW|<code>AVX512BW</code>}} - AVX-512 Byte and Word | ** {{x86|AVX512BW|<code>AVX512BW</code>}} - AVX-512 Byte and Word | ||
− | ** {{x86| | + | ** {{x86|AVX512BW|<code>AVX512DQ</code>}} - AVX-512 Doubleword and Quadword |
− | ** {{x86| | + | ** {{x86|AVX512BW|<code>AVX512VL</code>}} - AVX-512 Vector Length |
* {{x86|PKU|<code>PKU</code>}} - Memory Protection Keys for Userspace | * {{x86|PKU|<code>PKU</code>}} - Memory Protection Keys for Userspace | ||
* {{x86|PCOMMIT|<code>PCOMMIT</code>}} - PCOMMIT instruction | * {{x86|PCOMMIT|<code>PCOMMIT</code>}} - PCOMMIT instruction | ||
− | * {{x86|CLWB|<code>CLWB</code>}} - | + | * {{x86|CLWB|<code>CLWB</code>}} - CLWB instruction |
=== Block Diagram === | === Block Diagram === | ||
==== Entire SoC Overview ==== | ==== Entire SoC Overview ==== | ||
− | + | Note that the LCC die is identical without the two bottom rows. The XCC (28-core) die has one additional row and two additional columns of cores. Otherwise the die is identical. | |
− | + | ||
− | + | [[File:skylake sp hcc block diagram.svg|650px]] | |
− | + | ||
− | + | * '''CHA''' - Caching and Home Agent | |
− | + | * '''SF''' - Snooping Filter | |
+ | |||
===== Individual Core ===== | ===== Individual Core ===== | ||
− | + | [[File:skylake server block diagram.svg|950px]] | |
=== Memory Hierarchy === | === Memory Hierarchy === | ||
Line 304: | Line 330: | ||
*** 1.375 MiB/core, 11-way set associative, shared across all cores | *** 1.375 MiB/core, 11-way set associative, shared across all cores | ||
**** Note that a few models have non-default cache sizes due to disabled cores | **** Note that a few models have non-default cache sizes due to disabled cores | ||
− | *** | + | *** 64 B line size |
*** Non-inclusive victim cache | *** Non-inclusive victim cache | ||
*** Write-back policy | *** Write-back policy | ||
*** 50-70 cycles latency | *** 50-70 cycles latency | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
Skylake TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally there is a unified L2 TLB (STLB). | Skylake TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally there is a unified L2 TLB (STLB). | ||
Line 333: | Line 352: | ||
**** fixed partition | **** fixed partition | ||
*** 1G page translations: | *** 1G page translations: | ||
− | **** 4 entries; | + | **** 4 entries; fully associative |
**** fixed partition | **** fixed partition | ||
** STLB | ** STLB | ||
*** 4 KiB + 2 MiB page translations: | *** 4 KiB + 2 MiB page translations: | ||
− | **** 1536 entries; 12-way set associative | + | **** 1536 entries; 12-way set associative |
**** fixed partition | **** fixed partition | ||
*** 1 GiB page translations: | *** 1 GiB page translations: | ||
Line 364: | Line 383: | ||
Intel has been experiencing a growing divergence in functionality over the last number of iterations of [[intel/microarchitectures|their microarchitecture]] between their mainstream consumer products and their high-end HPC/server models. Traditionally, Intel has been using the same exact core design for everything from their lowest end value models (e.g. {{intel|Celeron}}) all the way up to the highest-performance enterprise models (e.g. {{intel|Xeon E7}}). While the two have fundamentally different chip architectures, they use the same exact CPU core architecture as the building block. | Intel has been experiencing a growing divergence in functionality over the last number of iterations of [[intel/microarchitectures|their microarchitecture]] between their mainstream consumer products and their high-end HPC/server models. Traditionally, Intel has been using the same exact core design for everything from their lowest end value models (e.g. {{intel|Celeron}}) all the way up to the highest-performance enterprise models (e.g. {{intel|Xeon E7}}). While the two have fundamentally different chip architectures, they use the same exact CPU core architecture as the building block. | ||
− | This design philosophy has changed with Skylake. In order to better accommodate the different functionalities of each segment without sacrificing features or making unnecessary compromises | + | This design philosophy has changed with Skylake. In order to better accommodate the different functionalities of each segment without sacrificing features or making unnecessary compromises Intel went with a configurable core. The Skylake core is a single development project, making up a master superset core. The project result in two derivatives: one for servers (the substance of this article) and {{\\|skylake (client)|one for clients}}. All mainstream models (from {{intel|Celeron}}/{{intel|Pentium (2009)|Pentium}} all the way up to {{intel|Core i7}}/{{intel|Xeon E3}}) use {{\\|skylake (client)|the client core configuration}}. Server models (e.g. {{intel|Xeon Gold}}/{{intel|Xeon Platinum}}) are using the new server configuration instead. |
+ | |||
+ | The server core is considerably larger than the client one, featuring [[Advanced Vector Extensions 512]] (AVX-512). Skylake servers support what was formerly called AVX3.2 (AVX512F + AVX512CD + AVX512BW + AVX512DQ + AVX512VL). Additionally, those processors Memory Protection Keys for Userspace (PKU), {{x86|PCOMMIT}}, and {{x86|CLWB}}. | ||
− | + | In addition to the execution units that were added, the cache hierarchy has changed for the server core as well, incorporating a large L2 and a portion of the LLC as well as the caching and home agent and the snoop filter that needs to accommodate the new cache changes. | |
Below is a visual that helps show how the server core was evolved from the client core. | Below is a visual that helps show how the server core was evolved from the client core. | ||
Line 388: | Line 409: | ||
[[File:skylake sp added cach and vpu.png|left|300px]] | [[File:skylake sp added cach and vpu.png|left|300px]] | ||
− | This is the first implementation to incorporate {{x86|AVX-512}}, a 512-bit [[SIMD]] [[x86]] instruction set extension. AVX-512 | + | This is the first implementation to incorporate {{x86|AVX-512}}, a 512-bit [[SIMD]] [[x86]] instruction set extension. Intel introduced AVX-512 in two different ways: |
− | In the simple implementation, the variants used in the {{intel|Xeon Bronze|entry-level}} and {{intel|Xeon Silver|mid-range}} Xeon servers, AVX-512 fuses Port 0 and Port 1 to form a 512-bit | + | In the simple implementation, the variants used in the {{intel|Xeon Bronze|entry-level}} and {{intel|Xeon Silver|mid-range}} Xeon servers, AVX-512 fuses Port 0 and Port 1 to form a 512-bit unit. Since those two ports are 256-wide, an AVX-512 option that is dispatched by the scheduler to port 0 will execute on both ports. Note that unrelated operations can still execute in parallel. For example, an AVX-512 operation and an Int ALU operation may execute in parallel - the AVX-512 is dispatched on port 0 and use the AVX unit on port 1 as well and the Int ALU operation will execute independently in parallel on port 1. |
− | In the {{intel|Xeon Gold|high-end}} and {{intel|Xeon Platinum|highest}} performance Xeons, Intel added a second dedicated | + | In the {{intel|Xeon Gold|high-end}} and {{intel|Xeon Platinum|highest}} performance Xeons, Intel added a second dedicated AVX-512 unit in addition to the fused Port0-1 operations described above. The dedicated unit is situated on Port 5. |
Physically, Intel added 768 KiB L2 cache and the second AVX-512 VPU externally to the core. | Physically, Intel added 768 KiB L2 cache and the second AVX-512 VPU externally to the core. | ||
Line 477: | Line 498: | ||
=== Mode-Based Execute (MBE) Control === | === Mode-Based Execute (MBE) Control === | ||
− | '''Mode-Based Execute''' ('''MBE''') is an enhancement to the Extended Page Tables (EPT) that provides finer level of control of execute permissions. With MBE the previous Execute Enable (''X'') bit is turned into | + | '''Mode-Based Execute''' ('''MBE''') is an enhancement to the Extended Page Tables (EPT) that provides finer level of control of execute permissions. With MBE the previous Execute Enable (''X'') bit is turned into Excuse Userspace page (XU) and Execute Supervisor page (XS). The processor selects the mode based on the guest page permission. With proper software support, hypervisors can take advantage of this as well to ensure integrity of kernel-level code. |
== Mesh Architecture == | == Mesh Architecture == | ||
Line 487: | Line 508: | ||
=== Organization === | === Organization === | ||
− | + | Each die has a grid of CMSs. For example, for the XCC die, there are 36 converged mesh stops (CMS). As the name implies, the CMS is a block that effectively interfaces between all the various subsystems and the mesh interconnect. The locations of the CMSes for the large core count is shown on the diagram below. It should be pointed that although the CMS appears to be inside the core tiles, most of the mesh is likely routed above the cores in a similar fashion to how Intel has done it with the ring interconnect which was wired above the caches in order reduce the die area. | |
− | Each die has a grid of | ||
Line 517: | Line 537: | ||
==== Sub-NUMA Clustering ==== | ==== Sub-NUMA Clustering ==== | ||
− | In previous generations Intel had a feature called {{intel|cluster-on-die}} (COD) which was introduced with {{intel|Haswell|l=arch}}. With Skylake, there's a similar feature called {{intel|sub-NUMA cluster}} (SNC). With a memory controller physically located on each side of the die, SNC allows for the creation of two localized domains with each memory controller belonging to each domain. The processor can then map the addresses from the controller to the distributed home | + | In previous generations Intel had a feature called {{intel|cluster-on-die}} (COD) which was introduced with {{intel|Haswell|l=arch}}. With Skylake, there's a similar feature called {{intel|sub-NUMA cluster}} (SNC). With a memory controller physically located on each side of the die, SNC allows for the creation of two localized domains with each memory controller belonging to each domain. The processor can then map the addresses from the controller to the distributed home ages and LLC in its domain. This allows executing code to experience lower LLC and memory latency within its domain compared to accesses outside of the domain. |
− | It should be pointed out that in contrast to COD, SNC has a unique location for every | + | It should be pointed out that in contrast to COD, SNC has a unique location for every adddress in the LCC and is never duplicated across LLC banks (previously, COD cache lines could have copies). Additionally, on multiprocessor system, address mapped to memory on remote sockets are still uniformally distributed across all LLC banks irrespective of the localized SNC domain. |
== Scalability == | == Scalability == | ||
{{see also|intel/quickpath interconnect|intel/ultra path interconnect|l1=QuickPath Interconnect|l2=Ultra Path Interconnect}} | {{see also|intel/quickpath interconnect|intel/ultra path interconnect|l1=QuickPath Interconnect|l2=Ultra Path Interconnect}} | ||
− | In the last couple of generations, Intel has been utilizing {{intel|QuickPath Interconnect}} (QPI) which served as a high-speed point-to-point interconnect. QPI has been replaced by the {{intel|Ultra Path Interconnect}} (UPI) which is higher-efficiency coherent interconnect for scalable systems, allowing multiple processors to share a single shared address space. Depending on the exact model, each processor can have either two or three UPI links connecting to the other processors. | + | In the last couple of generations, Intel has been utilizing {{intel|QuickPath Interconnect}} (QPI) which served as a high-speed point-to-point interconnect. QPI has been replaced by the {{intel|Ultra Path Interconnect}} (UPI) which is higher-efficiency coherent interconnect for scalable systems, allowing multiple processors to share a single shared address space. Depending on the exact model, each processor can have either either two or three UPI links connecting to the other processors. |
− | UPI links eliminate some of the scalability limitations that surfaced in QPI over the past few microarchitecture iterations. They use directory-based home snoop coherency protocol and operate at up either 10.4 GT/s or 9.6 GT/s. This is quite a bit different | + | UPI links eliminate some of the scalability limitations that surfaced in QPI over the past few microarchitecture iterations. They use directory-based home snoop coherency protocol and operate at up either 10.4 GT/s or 9.6 GT/s. This is quite a bit different form previous generations. In addition to the various improvements done to the protocol layer, {{intel|Skylake SP|l=core}} now implements a distributed CHA that is situated along with the LLC bank on each core. It's in charge of tracking the various requests form the core as well as responding to snoop requests from both local and remote agents. The ease of distributing the home agent is a result of Intel getting rid of the requirement on preallocation of resources at the home agent. This also means that future architectures should be able to scale up well. |
Depending on the exact model, Skylake processors can scale from 2-way all the way up to 8-way multiprocessing. Note that the high-end models that support 8-way multiprocessing also only come with three UPI links for this purpose while the lower end processors can have either two or three UPI links. Below are the typical configurations for those processors. | Depending on the exact model, Skylake processors can scale from 2-way all the way up to 8-way multiprocessing. Note that the high-end models that support 8-way multiprocessing also only come with three UPI links for this purpose while the lower end processors can have either two or three UPI links. Below are the typical configurations for those processors. | ||
Line 620: | Line 640: | ||
==== Layout ==== | ==== Layout ==== | ||
:[[File:skylake (server) die area layout.svg|600px]] | :[[File:skylake (server) die area layout.svg|600px]] | ||
− | |||
− | |||
− | |||
− | |||
− | |||
== Die == | == Die == | ||
Line 653: | Line 668: | ||
:[[File:skylake sp memory phys (annotated).png|700px]] | :[[File:skylake sp memory phys (annotated).png|700px]] | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
=== Low Core Count (LCC) === | === Low Core Count (LCC) === | ||
Line 832: | Line 839: | ||
* Intel Unveils Powerful Intel Xeon Scalable Processors, Live Event, July 11, 2017 | * Intel Unveils Powerful Intel Xeon Scalable Processors, Live Event, July 11, 2017 | ||
* [[:File:intel xeon scalable processor architecture deep dive.pdf|Intel Xeon Scalable Process Architecture Deep Dive]], Akhilesh Kumar & Malay Trivedi, Skylake-SP CPU & Lewisburg PCH Architects, June 12th, 2017. | * [[:File:intel xeon scalable processor architecture deep dive.pdf|Intel Xeon Scalable Process Architecture Deep Dive]], Akhilesh Kumar & Malay Trivedi, Skylake-SP CPU & Lewisburg PCH Architects, June 12th, 2017. | ||
− | |||
− | |||
== Documents == | == Documents == |
Facts about "Skylake (server) - Microarchitectures - Intel"
codename | Skylake (server) + |
core count | 4 +, 6 +, 8 +, 10 +, 12 +, 14 +, 16 +, 18 +, 20 +, 22 +, 24 +, 26 + and 28 + |
designer | Intel + |
first launched | May 4, 2017 + |
full page name | intel/microarchitectures/skylake (server) + |
instance of | microarchitecture + |
instruction set architecture | x86-64 + |
manufacturer | Intel + |
microarchitecture type | CPU + |
name | Skylake (server) + |
pipeline stages (max) | 19 + |
pipeline stages (min) | 14 + |
process | 14 nm (0.014 μm, 1.4e-5 mm) + |