From WikiChip
Editing intel/microarchitectures/skylake (server)

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 183: Line 183:
 
! Core !! Extended<br>Family !! Family !! Extended<br>Model !! Model
 
! Core !! Extended<br>Family !! Family !! Extended<br>Model !! Model
 
|-
 
|-
| rowspan="2" | {{intel|Skylake X|X|l=core}}, {{intel|Skylake SP|SP|l=core}}, {{intel|Skylake DE|DE|l=core}}, {{intel|Skylake W|W|l=core}} || 0 || 0x6 || 0x5 || 0x5
+
| rowspan="2" | {{intel|Skylake X|X|l=core}}, {{intel|Skylake SP|SP|l=core}}, {{intel|Skylake DE|DE|l=core}}, {{intel|Skylake W|W|l=core}} || 0 || 0x6 || 0x5 || 0x55
 
|-
 
|-
 
| colspan="4" | Family 6 Model 85
 
| colspan="4" | Family 6 Model 85
Line 205: Line 205:
 
** DMI upgraded to Gen3
 
** DMI upgraded to Gen3
 
* Core
 
* Core
** All the changes from Skylake Client (For full list, see {{\\|Skylake (Client)#Key changes from Broadwell|Skylake (Client) § Key changes from Broadwell}})
 
 
** Front End
 
** Front End
 
*** LSD is disabled (Likely due to a bug; see [[#Front-end|§ Front-end]] for details)
 
*** LSD is disabled (Likely due to a bug; see [[#Front-end|§ Front-end]] for details)
 +
*** Larger legacy pipeline delivery (5 µOPs, up from 4)
 +
**** Another simple decoder has been added.
 +
*** Allocation Queue (IDQ)
 +
**** Larger delivery (6 µOPs, up from 4)
 +
**** 2.28x larger buffer (64/thread, up from 56)
 +
**** Partitioned for each active thread (from unified)
 +
*** Improved [[branch prediction unit]]
 +
**** reduced penalty for wrong direct jump target
 +
**** No specifics were disclosed
 +
*** µOP Cache
 +
**** instruction window is now 64 Bytes (from 32)
 +
**** 1.5x bandwidth (6 µOPs/cycle, up from 4)
 +
** Execution Engine
 +
*** Larger [[re-order buffer]] (224 entries, up from 192)
 +
*** Larger scheduler (97 entries, up from 64)
 +
**** Larger Integer Register File (180 entries, up from 168)
 
** Back-end
 
** Back-end
 
*** Port 4 now performs 512b stores (from 256b)
 
*** Port 4 now performs 512b stores (from 256b)
Line 224: Line 239:
 
** L2$
 
** L2$
 
*** Increased to 1 MiB/core (from 256 KiB/core)
 
*** Increased to 1 MiB/core (from 256 KiB/core)
*** Latency increased from 12 to 14
 
 
** L3$
 
** L3$
 
*** Reduced to 1.375 MiB/core (from 2.5 MiB/core)
 
*** Reduced to 1.375 MiB/core (from 2.5 MiB/core)
Line 241: Line 255:
  
 
==== CPU changes ====
 
==== CPU changes ====
See {{\\|Skylake (Client)#CPU changes|Skylake (Client) § CPU changes}}
+
* Most ALU operations have 4 op/cycle 1 for 8 and 32-bit registers. 64-bit ops are still limited to 3 op/cycle. (16-bit throughput varies per op, can be 4, 3.5 or 2 op/cycle).
 +
* MOVSX and MOVZX have 4 op/cycle throughput for 16->32 and 32->64 forms, in addition to Haswell's 8->32, 8->64 and 16->64 bit forms.
 +
* ADC and SBB have throughput of 1 op/cycle, same as Haswell.
 +
* Vector moves have throughput of 4 op/cycle (move elimination).
 +
* Not only zeroing vector vpXORxx and vpSUBxx ops, but also vPCMPxxx on the same register, have throughput of 4 op/cycle.
 +
* Vector ALU ops are often "standardized" to latency of 4. for example, vADDPS and vMULPS used to have L of 3 and 5, now both are 4.
 +
* Fused multiply-add ops have latency of 4 and throughput of 0.5 op/cycle.
 +
* Throughput of vADDps, vSUBps, vCMPps, vMAXps, their scalar and double analogs is increased to 2 op/cycle.
 +
* Throughput of vPSLxx and vPSRxx with immediate (i.e. fixed vector shifts) is increased to 2 op/cycle.
 +
* Throughput of vANDps, vANDNps, vORps, vXORps, their scalar and double analogs, vPADDx, vPSUBx is increased to 3 op/cycle.
 +
* vDIVPD, vSQRTPD have approximately twice as good throughput: from 8 to 4 and from 28 to 12 cycles/op.
 +
* Throughput of some MMX ALU ops (such as PAND mm1, mm2) is decreased to 2 or 1 op/cycle (users are expected to use wider SSE/AVX registers instead).
  
 
====New instructions ====
 
====New instructions ====
Line 333: Line 358:
 
**** fixed partition
 
**** fixed partition
 
*** 1G page translations:
 
*** 1G page translations:
**** 4 entries; 4-way set associative
+
**** 4 entries; fully associative
 
**** fixed partition
 
**** fixed partition
 
** STLB
 
** STLB
 
*** 4 KiB + 2 MiB page translations:
 
*** 4 KiB + 2 MiB page translations:
**** 1536 entries; 12-way set associative. (Note: STLB is incorrectly reported as "6-way" by CPUID leaf 2 (EAX=02H). Skylake erratum SKL148 recommends software to simply ignore that value.)
+
**** 1536 entries; 12-way set associative
 
**** fixed partition
 
**** fixed partition
 
*** 1 GiB page translations:
 
*** 1 GiB page translations:
Line 388: Line 413:
  
 
[[File:skylake sp added cach and vpu.png|left|300px]]
 
[[File:skylake sp added cach and vpu.png|left|300px]]
This is the first implementation to incorporate {{x86|AVX-512}}, a 512-bit [[SIMD]] [[x86]] instruction set extension. AVX-512 operations can take place on every port. For 512-bit wide FMA SIMD operations, Intel introduced two different mechanisms ways:
+
This is the first implementation to incorporate {{x86|AVX-512}}, a 512-bit [[SIMD]] [[x86]] instruction set extension. Intel introduced AVX-512 in two different ways:
  
In the simple implementation, the variants used in the {{intel|Xeon Bronze|entry-level}} and {{intel|Xeon Silver|mid-range}} Xeon servers, AVX-512 fuses Port 0 and Port 1 to form a 512-bit FMA unit. Since those two ports are 256-wide, an AVX-512 option that is dispatched by the scheduler to port 0 will execute on both ports. Note that unrelated operations can still execute in parallel. For example, an AVX-512 operation and an Int ALU operation may execute in parallel - the AVX-512 is dispatched on port 0 and use the AVX unit on port 1 as well and the Int ALU operation will execute independently in parallel on port 1.
+
In the simple implementation, the variants used in the {{intel|Xeon Bronze|entry-level}} and {{intel|Xeon Silver|mid-range}} Xeon servers, AVX-512 fuses Port 0 and Port 1 to form a 512-bit unit. Since those two ports are 256-wide, an AVX-512 option that is dispatched by the scheduler to port 0 will execute on both ports. Note that unrelated operations can still execute in parallel. For example, an AVX-512 operation and an Int ALU operation may execute in parallel - the AVX-512 is dispatched on port 0 and use the AVX unit on port 1 as well and the Int ALU operation will execute independently in parallel on port 1.
  
In the {{intel|Xeon Gold|high-end}} and {{intel|Xeon Platinum|highest}} performance Xeons, Intel added a second dedicated 512-bit wide AVX-512 FMA unit in addition to the fused Port0-1 operations described above. The dedicated unit is situated on Port 5.
+
In the {{intel|Xeon Gold|high-end}} and {{intel|Xeon Platinum|highest}} performance Xeons, Intel added a second dedicated AVX-512 unit in addition to the fused Port0-1 operations described above. The dedicated unit is situated on Port 5.
  
 
Physically, Intel added 768 KiB L2 cache and the second AVX-512 VPU externally to the core.  
 
Physically, Intel added 768 KiB L2 cache and the second AVX-512 VPU externally to the core.  
Line 477: Line 502:
  
 
=== Mode-Based Execute (MBE) Control ===
 
=== Mode-Based Execute (MBE) Control ===
'''Mode-Based Execute''' ('''MBE''') is an enhancement to the Extended Page Tables (EPT) that provides finer level of control of execute permissions. With MBE the previous Execute Enable (''X'') bit is turned into Execute Userspace page (XU) and Execute Supervisor page (XS). The processor selects the mode based on the guest page permission. With proper software support, hypervisors can take advantage of this as well to ensure integrity of kernel-level code.
+
'''Mode-Based Execute''' ('''MBE''') is an enhancement to the Extended Page Tables (EPT) that provides finer level of control of execute permissions. With MBE the previous Execute Enable (''X'') bit is turned into Excuse Userspace page (XU) and Execute Supervisor page (XS). The processor selects the mode based on the guest page permission. With proper software support, hypervisors can take advantage of this as well to ensure integrity of kernel-level code.
  
 
== Mesh Architecture ==
 
== Mesh Architecture ==
Line 517: Line 542:
  
 
==== Sub-NUMA Clustering ====
 
==== Sub-NUMA Clustering ====
In previous generations Intel had a feature called {{intel|cluster-on-die}} (COD) which was introduced with {{intel|Haswell|l=arch}}. With Skylake, there's a similar feature called {{intel|sub-NUMA cluster}} (SNC). With a memory controller physically located on each side of the die, SNC allows for the creation of two localized domains with each memory controller belonging to each domain. The processor can then map the addresses from the controller to the distributed home agents and LLC in its domain. This allows executing code to experience lower LLC and memory latency within its domain compared to accesses outside of the domain.
+
In previous generations Intel had a feature called {{intel|cluster-on-die}} (COD) which was introduced with {{intel|Haswell|l=arch}}. With Skylake, there's a similar feature called {{intel|sub-NUMA cluster}} (SNC). With a memory controller physically located on each side of the die, SNC allows for the creation of two localized domains with each memory controller belonging to each domain. The processor can then map the addresses from the controller to the distributed home ages and LLC in its domain. This allows executing code to experience lower LLC and memory latency within its domain compared to accesses outside of the domain.
  
It should be pointed out that in contrast to COD, SNC has a unique location for every address in the LLC and is never duplicated across LLC banks (previously, COD cache lines could have copies). Additionally, on multiprocessor systems, addresses mapped to memory on remote sockets are still uniformly distributed across all LLC banks irrespective of the localized SNC domain.
+
It should be pointed out that in contrast to COD, SNC has a unique location for every adddress in the LCC and is never duplicated across LLC banks (previously, COD cache lines could have copies). Additionally, on multiprocessor system, address mapped to memory on remote sockets are still uniformally distributed across all LLC banks irrespective of the localized SNC domain.
  
 
== Scalability ==
 
== Scalability ==
 
{{see also|intel/quickpath interconnect|intel/ultra path interconnect|l1=QuickPath Interconnect|l2=Ultra Path Interconnect}}
 
{{see also|intel/quickpath interconnect|intel/ultra path interconnect|l1=QuickPath Interconnect|l2=Ultra Path Interconnect}}
In the last couple of generations, Intel has been utilizing {{intel|QuickPath Interconnect}} (QPI) which served as a high-speed point-to-point interconnect. QPI has been replaced by the {{intel|Ultra Path Interconnect}} (UPI) which is higher-efficiency coherent interconnect for scalable systems, allowing multiple processors to share a single shared address space. Depending on the exact model, each processor can have either two or three UPI links connecting to the other processors.
+
In the last couple of generations, Intel has been utilizing {{intel|QuickPath Interconnect}} (QPI) which served as a high-speed point-to-point interconnect. QPI has been replaced by the {{intel|Ultra Path Interconnect}} (UPI) which is higher-efficiency coherent interconnect for scalable systems, allowing multiple processors to share a single shared address space. Depending on the exact model, each processor can have either either two or three UPI links connecting to the other processors.
  
UPI links eliminate some of the scalability limitations that surfaced in QPI over the past few microarchitecture iterations. They use directory-based home snoop coherency protocol and operate at up either 10.4 GT/s or 9.6 GT/s. This is quite a bit different from previous generations. In addition to the various improvements done to the protocol layer, {{intel|Skylake SP|l=core}} now implements a distributed CHA that is situated along with the LLC bank on each core. It's in charge of tracking the various requests from the core as well as responding to snoop requests from both local and remote agents. The ease of distributing the home agent is a result of Intel getting rid of the requirement on preallocation of resources at the home agent. This also means that future architectures should be able to scale up well.
+
UPI links eliminate some of the scalability limitations that surfaced in QPI over the past few microarchitecture iterations. They use directory-based home snoop coherency protocol and operate at up either 10.4 GT/s or 9.6 GT/s. This is quite a bit different form previous generations. In addition to the various improvements done to the protocol layer, {{intel|Skylake SP|l=core}} now implements a distributed CHA that is situated along with the LLC bank on each core. It's in charge of tracking the various requests form the core as well as responding to snoop requests from both local and remote agents. The ease of distributing the home agent is a result of Intel getting rid of the requirement on preallocation of resources at the home agent. This also means that future architectures should be able to scale up well.
  
 
Depending on the exact model, Skylake processors can scale from 2-way all the way up to 8-way multiprocessing. Note that the high-end models that support 8-way multiprocessing also only come with three UPI links for this purpose while the lower end processors can have either two or three UPI links. Below are the typical configurations for those processors.
 
Depending on the exact model, Skylake processors can scale from 2-way all the way up to 8-way multiprocessing. Note that the high-end models that support 8-way multiprocessing also only come with three UPI links for this purpose while the lower end processors can have either two or three UPI links. Below are the typical configurations for those processors.
Line 655: Line 680:
  
 
=== Core Tile ===
 
=== Core Tile ===
* ~4.8375 x 3.7163
 
* ~ 17.978 mm² die area
 
 
 
:[[File:skylake sp core.png|500px]]
 
:[[File:skylake sp core.png|500px]]
  

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)
codenameSkylake (server) +
core count4 +, 6 +, 8 +, 10 +, 12 +, 14 +, 16 +, 18 +, 20 +, 22 +, 24 +, 26 + and 28 +
designerIntel +
first launchedMay 4, 2017 +
full page nameintel/microarchitectures/skylake (server) +
instance ofmicroarchitecture +
instruction set architecturex86-64 +
manufacturerIntel +
microarchitecture typeCPU +
nameSkylake (server) +
pipeline stages (max)19 +
pipeline stages (min)14 +
process14 nm (0.014 μm, 1.4e-5 mm) +