From WikiChip
Editing intel/microarchitectures/skylake (server)
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.
Latest revision | Your text | ||
Line 94: | Line 94: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | ! Core !! Abbrev | + | ! Core !! Abbrev !! Target |
|- | |- | ||
− | | {{intel|Skylake SP|l=core}} || SKL-SP | + | | {{intel|Skylake SP|l=core}} || SKL-SP || Server Scalable Processors |
|- | |- | ||
− | | {{intel|Skylake X|l=core}} || SKL-X | + | | {{intel|Skylake X|l=core}} || SKL-X || High-end desktops & enthusiasts market |
|- | |- | ||
− | | {{intel|Skylake W|l=core}} || SKL-W | + | | {{intel|Skylake W|l=core}} || SKL-W || Enterprise/Business workstations |
|- | |- | ||
− | | {{intel|Skylake DE|l=core}} || SKL-DE | + | | {{intel|Skylake DE|l=core}} || SKL-DE || Dense server/edge computing |
|} | |} | ||
Line 143: | Line 143: | ||
== Process Technology == | == Process Technology == | ||
− | {{main| | + | {{main|intel/microarchitectures/kaby lake#Process_Technology|l1=Kaby Lake § Process Technology}} |
Unlike mainstream Skylake models, all Skylake server configuration models are fabricated on Intel's [[14 nm process#Intel|enhanced 14+ nm process]] which is used by {{\\|Kaby Lake}}. | Unlike mainstream Skylake models, all Skylake server configuration models are fabricated on Intel's [[14 nm process#Intel|enhanced 14+ nm process]] which is used by {{\\|Kaby Lake}}. | ||
Line 205: | Line 205: | ||
** DMI upgraded to Gen3 | ** DMI upgraded to Gen3 | ||
* Core | * Core | ||
− | |||
** Front End | ** Front End | ||
*** LSD is disabled (Likely due to a bug; see [[#Front-end|§ Front-end]] for details) | *** LSD is disabled (Likely due to a bug; see [[#Front-end|§ Front-end]] for details) | ||
+ | *** Larger legacy pipeline delivery (5 µOPs, up from 4) | ||
+ | **** Another simple decoder has been added. | ||
+ | *** Allocation Queue (IDQ) | ||
+ | **** Larger delivery (6 µOPs, up from 4) | ||
+ | **** 2.28x larger buffer (64/thread, up from 56) | ||
+ | **** Partitioned for each active thread (from unified) | ||
+ | *** Improved [[branch prediction unit]] | ||
+ | **** reduced penalty for wrong direct jump target | ||
+ | **** No specifics were disclosed | ||
+ | *** µOP Cache | ||
+ | **** instruction window is now 64 Bytes (from 32) | ||
+ | **** 1.5x bandwidth (6 µOPs/cycle, up from 4) | ||
+ | ** Execution Engine | ||
+ | *** Larger [[re-order buffer]] (224 entries, up from 192) | ||
+ | *** Larger scheduler (97 entries, up from 64) | ||
+ | **** Larger Integer Register File (180 entries, up from 168) | ||
** Back-end | ** Back-end | ||
*** Port 4 now performs 512b stores (from 256b) | *** Port 4 now performs 512b stores (from 256b) | ||
Line 224: | Line 239: | ||
** L2$ | ** L2$ | ||
*** Increased to 1 MiB/core (from 256 KiB/core) | *** Increased to 1 MiB/core (from 256 KiB/core) | ||
− | |||
** L3$ | ** L3$ | ||
*** Reduced to 1.375 MiB/core (from 2.5 MiB/core) | *** Reduced to 1.375 MiB/core (from 2.5 MiB/core) | ||
Line 241: | Line 255: | ||
==== CPU changes ==== | ==== CPU changes ==== | ||
− | + | * Most ALU operations have 4 op/cycle 1 for 8 and 32-bit registers. 64-bit ops are still limited to 3 op/cycle. (16-bit throughput varies per op, can be 4, 3.5 or 2 op/cycle). | |
+ | * MOVSX and MOVZX have 4 op/cycle throughput for 16->32 and 32->64 forms, in addition to Haswell's 8->32, 8->64 and 16->64 bit forms. | ||
+ | * ADC and SBB have throughput of 1 op/cycle, same as Haswell. | ||
+ | * Vector moves have throughput of 4 op/cycle (move elimination). | ||
+ | * Not only zeroing vector vpXORxx and vpSUBxx ops, but also vPCMPxxx on the same register, have throughput of 4 op/cycle. | ||
+ | * Vector ALU ops are often "standardized" to latency of 4. for example, vADDPS and vMULPS used to have L of 3 and 5, now both are 4. | ||
+ | * Fused multiply-add ops have latency of 4 and throughput of 0.5 op/cycle. | ||
+ | * Throughput of vADDps, vSUBps, vCMPps, vMAXps, their scalar and double analogs is increased to 2 op/cycle. | ||
+ | * Throughput of vPSLxx and vPSRxx with immediate (i.e. fixed vector shifts) is increased to 2 op/cycle. | ||
+ | * Throughput of vANDps, vANDNps, vORps, vXORps, their scalar and double analogs, vPADDx, vPSUBx is increased to 3 op/cycle. | ||
+ | * vDIVPD, vSQRTPD have approximately twice as good throughput: from 8 to 4 and from 28 to 12 cycles/op. | ||
+ | * Throughput of some MMX ALU ops (such as PAND mm1, mm2) is decreased to 2 or 1 op/cycle (users are expected to use wider SSE/AVX registers instead). | ||
====New instructions ==== | ====New instructions ==== | ||
Line 263: | Line 288: | ||
=== Block Diagram === | === Block Diagram === | ||
==== Entire SoC Overview ==== | ==== Entire SoC Overview ==== | ||
− | + | Note that the LCC die is identical without the two bottom rows. The XCC (28-core) die has one additional row and two additional columns of cores. Otherwise the die is identical. | |
− | + | ||
− | + | [[File:skylake sp hcc block diagram.svg|650px]] | |
− | + | ||
− | + | * '''CHA''' - Caching and Home Agent | |
− | + | * '''SF''' - Snooping Filter | |
+ | |||
===== Individual Core ===== | ===== Individual Core ===== | ||
− | + | [[File:skylake server block diagram.svg|950px]] | |
=== Memory Hierarchy === | === Memory Hierarchy === | ||
Line 333: | Line 359: | ||
**** fixed partition | **** fixed partition | ||
*** 1G page translations: | *** 1G page translations: | ||
− | **** 4 entries; | + | **** 4 entries; fully associative |
**** fixed partition | **** fixed partition | ||
** STLB | ** STLB | ||
*** 4 KiB + 2 MiB page translations: | *** 4 KiB + 2 MiB page translations: | ||
− | **** 1536 entries; 12-way set associative | + | **** 1536 entries; 12-way set associative |
**** fixed partition | **** fixed partition | ||
*** 1 GiB page translations: | *** 1 GiB page translations: | ||
Line 388: | Line 414: | ||
[[File:skylake sp added cach and vpu.png|left|300px]] | [[File:skylake sp added cach and vpu.png|left|300px]] | ||
− | This is the first implementation to incorporate {{x86|AVX-512}}, a 512-bit [[SIMD]] [[x86]] instruction set extension. AVX-512 | + | This is the first implementation to incorporate {{x86|AVX-512}}, a 512-bit [[SIMD]] [[x86]] instruction set extension. Intel introduced AVX-512 in two different ways: |
− | In the simple implementation, the variants used in the {{intel|Xeon Bronze|entry-level}} and {{intel|Xeon Silver|mid-range}} Xeon servers, AVX-512 fuses Port 0 and Port 1 to form a 512-bit | + | In the simple implementation, the variants used in the {{intel|Xeon Bronze|entry-level}} and {{intel|Xeon Silver|mid-range}} Xeon servers, AVX-512 fuses Port 0 and Port 1 to form a 512-bit unit. Since those two ports are 256-wide, an AVX-512 option that is dispatched by the scheduler to port 0 will execute on both ports. Note that unrelated operations can still execute in parallel. For example, an AVX-512 operation and an Int ALU operation may execute in parallel - the AVX-512 is dispatched on port 0 and use the AVX unit on port 1 as well and the Int ALU operation will execute independently in parallel on port 1. |
− | In the {{intel|Xeon Gold|high-end}} and {{intel|Xeon Platinum|highest}} performance Xeons, Intel added a second dedicated | + | In the {{intel|Xeon Gold|high-end}} and {{intel|Xeon Platinum|highest}} performance Xeons, Intel added a second dedicated AVX-512 unit in addition to the fused Port0-1 operations described above. The dedicated unit is situated on Port 5. |
Physically, Intel added 768 KiB L2 cache and the second AVX-512 VPU externally to the core. | Physically, Intel added 768 KiB L2 cache and the second AVX-512 VPU externally to the core. | ||
Line 477: | Line 503: | ||
=== Mode-Based Execute (MBE) Control === | === Mode-Based Execute (MBE) Control === | ||
− | '''Mode-Based Execute''' ('''MBE''') is an enhancement to the Extended Page Tables (EPT) that provides finer level of control of execute permissions. With MBE the previous Execute Enable (''X'') bit is turned into | + | '''Mode-Based Execute''' ('''MBE''') is an enhancement to the Extended Page Tables (EPT) that provides finer level of control of execute permissions. With MBE the previous Execute Enable (''X'') bit is turned into Excuse Userspace page (XU) and Execute Supervisor page (XS). The processor selects the mode based on the guest page permission. With proper software support, hypervisors can take advantage of this as well to ensure integrity of kernel-level code. |
== Mesh Architecture == | == Mesh Architecture == | ||
Line 488: | Line 514: | ||
=== Organization === | === Organization === | ||
[[File:skylake (server) half rings.png|right|400px]] | [[File:skylake (server) half rings.png|right|400px]] | ||
− | Each die has a grid of | + | Each die has a grid of CMSs. For example, for the XCC die, there are 36 converged mesh stops (CMS). As the name implies, the CMS is a block that effectively interfaces between all the various subsystems and the mesh interconnect. The locations of the CMSes for the large core count is shown on the diagram below. It should be pointed that although the CMS appears to be inside the core tiles, most of the mesh is likely routed above the cores in a similar fashion to how Intel has done it with the ring interconnect which was wired above the caches in order reduce the die area. |
Line 517: | Line 543: | ||
==== Sub-NUMA Clustering ==== | ==== Sub-NUMA Clustering ==== | ||
− | In previous generations Intel had a feature called {{intel|cluster-on-die}} (COD) which was introduced with {{intel|Haswell|l=arch}}. With Skylake, there's a similar feature called {{intel|sub-NUMA cluster}} (SNC). With a memory controller physically located on each side of the die, SNC allows for the creation of two localized domains with each memory controller belonging to each domain. The processor can then map the addresses from the controller to the distributed home | + | In previous generations Intel had a feature called {{intel|cluster-on-die}} (COD) which was introduced with {{intel|Haswell|l=arch}}. With Skylake, there's a similar feature called {{intel|sub-NUMA cluster}} (SNC). With a memory controller physically located on each side of the die, SNC allows for the creation of two localized domains with each memory controller belonging to each domain. The processor can then map the addresses from the controller to the distributed home ages and LLC in its domain. This allows executing code to experience lower LLC and memory latency within its domain compared to accesses outside of the domain. |
− | It should be pointed out that in contrast to COD, SNC has a unique location for every | + | It should be pointed out that in contrast to COD, SNC has a unique location for every adddress in the LCC and is never duplicated across LLC banks (previously, COD cache lines could have copies). Additionally, on multiprocessor system, address mapped to memory on remote sockets are still uniformally distributed across all LLC banks irrespective of the localized SNC domain. |
== Scalability == | == Scalability == | ||
{{see also|intel/quickpath interconnect|intel/ultra path interconnect|l1=QuickPath Interconnect|l2=Ultra Path Interconnect}} | {{see also|intel/quickpath interconnect|intel/ultra path interconnect|l1=QuickPath Interconnect|l2=Ultra Path Interconnect}} | ||
− | In the last couple of generations, Intel has been utilizing {{intel|QuickPath Interconnect}} (QPI) which served as a high-speed point-to-point interconnect. QPI has been replaced by the {{intel|Ultra Path Interconnect}} (UPI) which is higher-efficiency coherent interconnect for scalable systems, allowing multiple processors to share a single shared address space. Depending on the exact model, each processor can have either two or three UPI links connecting to the other processors. | + | In the last couple of generations, Intel has been utilizing {{intel|QuickPath Interconnect}} (QPI) which served as a high-speed point-to-point interconnect. QPI has been replaced by the {{intel|Ultra Path Interconnect}} (UPI) which is higher-efficiency coherent interconnect for scalable systems, allowing multiple processors to share a single shared address space. Depending on the exact model, each processor can have either either two or three UPI links connecting to the other processors. |
− | UPI links eliminate some of the scalability limitations that surfaced in QPI over the past few microarchitecture iterations. They use directory-based home snoop coherency protocol and operate at up either 10.4 GT/s or 9.6 GT/s. This is quite a bit different | + | UPI links eliminate some of the scalability limitations that surfaced in QPI over the past few microarchitecture iterations. They use directory-based home snoop coherency protocol and operate at up either 10.4 GT/s or 9.6 GT/s. This is quite a bit different form previous generations. In addition to the various improvements done to the protocol layer, {{intel|Skylake SP|l=core}} now implements a distributed CHA that is situated along with the LLC bank on each core. It's in charge of tracking the various requests form the core as well as responding to snoop requests from both local and remote agents. The ease of distributing the home agent is a result of Intel getting rid of the requirement on preallocation of resources at the home agent. This also means that future architectures should be able to scale up well. |
Depending on the exact model, Skylake processors can scale from 2-way all the way up to 8-way multiprocessing. Note that the high-end models that support 8-way multiprocessing also only come with three UPI links for this purpose while the lower end processors can have either two or three UPI links. Below are the typical configurations for those processors. | Depending on the exact model, Skylake processors can scale from 2-way all the way up to 8-way multiprocessing. Note that the high-end models that support 8-way multiprocessing also only come with three UPI links for this purpose while the lower end processors can have either two or three UPI links. Below are the typical configurations for those processors. | ||
Line 655: | Line 681: | ||
=== Core Tile === | === Core Tile === | ||
− | |||
− | |||
− | |||
:[[File:skylake sp core.png|500px]] | :[[File:skylake sp core.png|500px]] | ||
Facts about "Skylake (server) - Microarchitectures - Intel"
codename | Skylake (server) + |
core count | 4 +, 6 +, 8 +, 10 +, 12 +, 14 +, 16 +, 18 +, 20 +, 22 +, 24 +, 26 + and 28 + |
designer | Intel + |
first launched | May 4, 2017 + |
full page name | intel/microarchitectures/skylake (server) + |
instance of | microarchitecture + |
instruction set architecture | x86-64 + |
manufacturer | Intel + |
microarchitecture type | CPU + |
name | Skylake (server) + |
pipeline stages (max) | 19 + |
pipeline stages (min) | 14 + |
process | 14 nm (0.014 μm, 1.4e-5 mm) + |