From WikiChip
Editing intel/microarchitectures/skylake (client)

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 438: Line 438:
  
 
==== Front-end ====
 
==== Front-end ====
The front-end is tasked with the challenge of fetching the complex [[x86]] instructions from memory, decoding them, and delivering them to the execution units. In other words, the front end needs to be able to consistently deliver enough [[µOPs]] from the instruction code stream to keep the back-end busy. When the back-end is not being fully utilized, the core is not reaching its full performance. A poorly or under-performing front-end will translate directly to a poorly performing core. This challenge is further complicated by various redirection such as branches and the complex nature of the [[x86]] instructions themselves.
+
The front-end is is tasked with the challenge of fetching the complex [[x86]] instructions from memory, decoding them, and delivering them to the execution units. In other words, the front end needs to be able to consistently deliver enough [[µOPs]] from the instruction code stream to keep the back-end busy. When the back-end is not being fully utilized, the core is not reaching its full performance. A poorly or under-performing front-end will translate directly to a poorly performing core. This challenge is further complicated by various redirection such as branches and the complex nature of the [[x86]] instructions themselves.
  
 
===== Fetch & pre-decoding =====  
 
===== Fetch & pre-decoding =====  
On their first pass, instructions should have already been prefetched from the [[L2 cache]] and into the [[L1 cache]]. The L1 is a 32 [[KiB]], 8-way set associative cache, identical in size and organization to {{intel|microarchitectures|previous generations}}. Skylake fetching is done on a 16-byte fetch window. A window size that has not changed in a number of generations. Up to 16 bytes of code can be fetched each cycle. Note that fetcher is shared evenly between the two threads so that each thread gets every other cycle. At this point they are still [[macro-ops]] (i.e. variable-length [[x86]] architectural instruction). Instructions are brought into the pre-decode buffer for initial preparation.
+
On their first pass, instructions should have already been prefetched from the [[L2 cache]] and into the [[L1 cache]]. The L1 is a 32 [[KiB]], 8-way set associative cache, identical in size and organization to {{intel|microarchitectures|previous generations}}. Skylake fetching is done on a 16-byte fetch window. A window size that has not changed in a number of generations. Up to 16 bytes of code can be fetched each cycle. Note that fetcher is shared evenly between two thread, so that each thread gets every other cycle. At this point they are still [[macro-ops]] (i.e. variable-length [[x86]] architectural instruction). Instructions are brought into the pre-decode buffer for initial preparation.
  
 
[[File:skylake fetch.svg|left|300px]]
 
[[File:skylake fetch.svg|left|300px]]
  
[[x86]] instructions are complex, variable length, have inconsistent encoding, and may contain multiple operations. At the pre-decode buffer, the instructions boundaries get detected and marked. This is a fairly difficult task because each instruction can vary from a single byte all the way up to fifteen. Moreover, determining the length requires inspecting a couple of bytes of the instruction. In addition to boundary marking, prefixes are also decoded and checked for various properties such as branches. As with previous microarchitectures, the pre-decoder has a [[throughput]] of 6 [[macro-ops]] per cycle or until all 16 bytes are consumed, whichever happens first. Note that the predecoder will not load a new 16-byte block until the previous block has been fully exhausted. For example, suppose a new chunk was loaded, resulting in 7 instructions. In the first cycle, 6 instructions will be processed and a whole second cycle will be wasted for that last instruction. This will produce the much lower throughput of 3.5 instructions per cycle which is considerably less than optimal. Likewise, if the 16-byte block resulted in just 4 instructions with 1 byte of the 5th instruction received, the first 4 instructions will be processed in the first cycle and a second cycle will be required for the last instruction. This will produce an average throughput of 2.5 instructions per cycle. Note that there is a special case for {{x86|length-changing prefix}} (LCPs) which will incur additional pre-decoding costs. Real code is often less than 4 bytes which usually results in a good rate.  
+
[[x86]] instructions are complex, variable length, have inconsistent encoding, and may contain multiple operations. At the pre-decode buffer the instructions boundaries get detected and marked. This is a fairly difficult task because each instruction can vary from a single byte all the way up to fifteen. Moreover, determining the length requires inspecting a couple of bytes of the instruction. In addition boundary marking, prefixes are also decoded and checked for various properties such as branches. As with previous microarchitectures, the pre-decoder has a [[throughput]] of 6 [[macro-ops]] per cycle or until all 16 bytes are consumed, whichever happens first. Note that the predecoder will not load a new 16-byte block until the previous block has been fully exhausted. For example, suppose a new chunk was loaded, resulting in 7 instructions. In the first cycle, 6 instructions will be processed and a whole second cycle will be wasted for that last instruction. This will produce the much lower throughput of 3.5 instructions per cycle which is considerably less than optimal. Likewise, if the 16-byte block resulted in just 4 instructions with 1 byte of the 5th instruction received, the first 4 instructions will be processed in the first cycle and a second cycle will be required for the last instruction. This will produce an average throughput of 2.5 instructions per cycle. Note that there is a special case for {{x86|length-changing prefix}} (LCPs) which will incur additional pre-decoding costs. Real code is often less than 4 bytes which usually results in a good rate.  
  
 
All of this works along with the branch prediction unit which attempts to guess the flow of instructions. In Skylake, the [[branch predictor]] has also been improved. The branch predictor now has reduced penalty (i.e. lower latency) for wrong direct jump target prediction. Additionally, the predictor in Skylake can inspect further in the byte stream than in previous architectures. The intimate improvements done in the branch predictor were not further disclosed by Intel.
 
All of this works along with the branch prediction unit which attempts to guess the flow of instructions. In Skylake, the [[branch predictor]] has also been improved. The branch predictor now has reduced penalty (i.e. lower latency) for wrong direct jump target prediction. Additionally, the predictor in Skylake can inspect further in the byte stream than in previous architectures. The intimate improvements done in the branch predictor were not further disclosed by Intel.
Line 459: Line 459:
 
|}
 
|}
 
{{see also|macro-operation fusion|l1=Macro-Operation Fusion}}
 
{{see also|macro-operation fusion|l1=Macro-Operation Fusion}}
The pre-decoded instructions are delivered to the Instruction Queue (IQ). In {{\\|Broadwell}}, the Instruction Queue has been increased to 25 entries duplicated over for each thread (i.e. 50 total entries). It's unclear if that has changed with Skylake. One key optimization the instruction queue does is [[macro-op fusion]]. Skylake can fuse two [[macro-ops]] into a single complex one in a number of cases. In cases where a {{x86|test}} or {{x86|compare}} instruction with a subsequent conditional jump is detected, it will be converted into a single compare-and-branch instruction. Those fused instructions remain fused throughout the entire pipeline and get executed as a single operation by the branch unit thereby saving bandwidth everywhere. Only one such fusion can be performed during each cycle.
+
The pre-decoded instructions are delivered to the Instruction Queue (IQ). In {{\\|Broadwell}}, the Instruction Queue has been increased to 25 entries duplicated over for each thread (i.e. 50 total entries). It's unclear if that has changed with Skylake. One key optimization the instruction queue does is [[macro-op fusion]]. Skylake can fuse two [[macro-ops]] into a single complex one in a number of cases. In cases where a {{x86|test}} or {{x86|compare}} instruction with a subsequent conditional jump is detected, it will be converted into a single compare-and-branch instruction. Those fused instructions remain fused throughout the entire pipeline and get executed as a single operation by the branch unit thereby saving bandwidth everywhere. Only one such fusion can be performed each cycle.
  
 
===== Decoding =====
 
===== Decoding =====

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)
codenameSkylake (client) +
core count2 + and 4 +
designerIntel +
first launchedAugust 5, 2015 +
full page nameintel/microarchitectures/skylake (client) +
instance ofmicroarchitecture +
instruction set architecturex86-64 +
manufacturerIntel +
microarchitecture typeCPU +
nameSkylake (client) +
pipeline stages (max)19 +
pipeline stages (min)14 +
process14 nm (0.014 μm, 1.4e-5 mm) +