From WikiChip
Editing intel/microarchitectures/sandy bridge (client)

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 295: Line 295:
 
Within each cache slice is cache box. The cache box is the controller and agent serving as the interface between the LLC and the rest of the system (i.e., cores, graphics, and system agent). The cache box implements the ring logic and the address hashing and arbitration. It is also tasked with communicating with the {{intel|System Agent}} on cache misses, non-cacheable accesses (such as in the case of I/O pulls), as well as external I/O [[snoop]] requests. The cache box is fully pipelined and is capable of working on multiple requests at the same time. Agent requests (e.g., ones from the GPU) are handled by the individual cache boxes via the ring. This is much different to how it was previously done in {{\\|Westmere}} where a single unified cache handled everything. The distribution of slices allows Sandy Bridge to have higher associativity bandwidth while reducing traffic.  
 
Within each cache slice is cache box. The cache box is the controller and agent serving as the interface between the LLC and the rest of the system (i.e., cores, graphics, and system agent). The cache box implements the ring logic and the address hashing and arbitration. It is also tasked with communicating with the {{intel|System Agent}} on cache misses, non-cacheable accesses (such as in the case of I/O pulls), as well as external I/O [[snoop]] requests. The cache box is fully pipelined and is capable of working on multiple requests at the same time. Agent requests (e.g., ones from the GPU) are handled by the individual cache boxes via the ring. This is much different to how it was previously done in {{\\|Westmere}} where a single unified cache handled everything. The distribution of slices allows Sandy Bridge to have higher associativity bandwidth while reducing traffic.  
  
The entire physical address space is mapped distributively across all the slices using a [[hash function]]. On a [[cache miss|miss]], the core needs to decode the address to figure out which slice ID to request the data from. Physical addresses are hashed at the source in order to prevent hot spots. The cache box is responsible for the maintaining of coherency and ordering between requests. Because the LLC slices are fully inclusive, it can make efficienct use of an on-die snoop filter. Each slice makes use of {{intel|Core Valid Bits}} (CVB) which is used to eliminate unnecessary snoops to the cores. A single bit per cache line is needed to indicate if the line may be in the core. Snoops are therefore only needed if the line is in the LLC and a CVB is asserted on that line. This mechanism helps limits external snoops to the cache box most of the time without resorting to going to the cores.
+
The entire physical address space is mapped distributively across all the slices using a [[hash function]]. On a [[cache miss|miss]], the core needs decodes the address to figure out which slice ID to request the data from. Physical addresses are hashed at the source in order to prevent hot spots. The cache box is responsible for the maintaining of coherency and ordering between requests. Because the LLC slices are fully inclusive, it can make efficienct use of an on-die snoop filter. Each slice makes use of {{intel|Core Valid Bits}} (CVB) which is used to eliminate unnecessary snoops to the cores. A single bit per cache line is needed to indicate if the line may be in the core. Snoops are therefore only needed if the line is in the LLC and a CVB is asserted on that line. This mechanism helps limits external snoops to the cache box most of the time without resorting to going to the cores.
  
 
Note that both the cache slices and the cache boxes reside within the same [[clock domain]] as the cores themselves - sharing the same voltage and frequency and scaling along with the cores when needed.
 
Note that both the cache slices and the cache boxes reside within the same [[clock domain]] as the cores themselves - sharing the same voltage and frequency and scaling along with the cores when needed.
Line 317: Line 317:
  
 
[[File:sandy bridge ring flow.svg|right|275px]]
 
[[File:sandy bridge ring flow.svg|right|275px]]
The four rings consists of a considerable amount of wiring and routing. Because the routing runs in the upper metal layers over the LLC, the rings have no real impact on die area. As with the LLC slices, the ring is fully pipelined and operates within the core's [[clock domain]] as it scales with the frequency of the core. The bandwidth of the ring also scales in bandwidth with each additional core/$slice pair that is added onto the ring, however with more cores the ring becomes more congested and adding latency as the average hop count increases. Intel expected the ring to support a fairly large amount of cores before facing real performance issues.
+
The four rings consists of a considerable amount of wiring and routing. Because the routing runs in the upper metal layers over the LLC, the rings have no real impact on die area. As with the LLC slices, the ring is fully pipelined and operates within the core's [[clock domain]] as it scales with the frequency of the core. The bandwidth of the ring also scales in bandwidth with each additional core/$slice pair that is added onto the ring, however with more cores the ring becomes more congested and adding latency as the average hop count increases. Intel expected the ring to support a fairly large amount of core before facing real performance issues.
  
 
It's important to note that the term ring refers to its structure and not necessarily how the data flows. The ring is not a round-robin and requests may travel up or down as needed. The use of address hashing allows the source agent to know exactly where the destination is. In order reduce latency, the ring is designed such as that all accesses on the ring always picks the shortest path. Because of this aspect of the ring and the fact that some requests can take longer than others to complete, the ring might have requests being handled out of order. It is the responsibility of the source agents to handle the ordering requirements. The ring [[cache coherency]] protocol is largely an enhancement based on Intel's {{intel|QuickPath Interconnect|QPI}} protocols with [[MESI]]-based source snooping protocol. On each cycle, the agents receive an indication whether there is an available slot on the ring for communication in the next cycle. When asserted, the agent can sent any type of communication (e.g. data or snoop) on the ring the following cycle.
 
It's important to note that the term ring refers to its structure and not necessarily how the data flows. The ring is not a round-robin and requests may travel up or down as needed. The use of address hashing allows the source agent to know exactly where the destination is. In order reduce latency, the ring is designed such as that all accesses on the ring always picks the shortest path. Because of this aspect of the ring and the fact that some requests can take longer than others to complete, the ring might have requests being handled out of order. It is the responsibility of the source agents to handle the ordering requirements. The ring [[cache coherency]] protocol is largely an enhancement based on Intel's {{intel|QuickPath Interconnect|QPI}} protocols with [[MESI]]-based source snooping protocol. On each cycle, the agents receive an indication whether there is an available slot on the ring for communication in the next cycle. When asserted, the agent can sent any type of communication (e.g. data or snoop) on the ring the following cycle.
Line 327: Line 327:
 
The System Agent (SA) is a centralized peripheral device integration unit. It contains what was previously the traditional [[Memory Controller Hub]] (MCH) which includes all the I/O such as the [[PCIe]], {{intel|DMI}}, and others. Additionally the SA incorporates the [[memory controller]] and the display engine which works in tandem with the integrated graphics. The major enabler for the new System Agent is in fact the [[32 nm process]] which allowed for considerably higher integration, over a dozen [[clock domain]]s and [[PHY]]s.
 
The System Agent (SA) is a centralized peripheral device integration unit. It contains what was previously the traditional [[Memory Controller Hub]] (MCH) which includes all the I/O such as the [[PCIe]], {{intel|DMI}}, and others. Additionally the SA incorporates the [[memory controller]] and the display engine which works in tandem with the integrated graphics. The major enabler for the new System Agent is in fact the [[32 nm process]] which allowed for considerably higher integration, over a dozen [[clock domain]]s and [[PHY]]s.
  
The System Agent interfaces with the rest of the system via the ring in a similar manner to the cache boxes in the LLC slices. It is also in charge of handling I/O to cache coherency. The System Agent enables [[direct memory access]] (DMA), which allows devices to snoop the cache hierarchy. Address conflicts resulting from multiple concurrent requests associated with the same cache line are also handled by the System Agent.
+
The system agent interfaces with the rest of the system via the ring in a similar manner to the cache boxes in the LLC slices. It is also in charge of handling I/O to cache coherency. The SA enables [[direct memory access]] (DMA) allows devices to snoop the cache hierarchy. Address conflicts resulting from multiple concurrent requests associated with the same cache line are also handled by the SA.
  
 
[[File:sandy bridge power overview.svg|275px|left]]
 
[[File:sandy bridge power overview.svg|275px|left]]
Line 333: Line 333:
  
 
==== <span style="float: right;">Power</span> ====
 
==== <span style="float: right;">Power</span> ====
With Sandy Bridge, Intel introduced a large number of power features to save powers depending on the workload, temperature, and what I/O is being utilized. The new power features are handled at the System Agent. Sandy Bridge introduced a fully-fledged Power Control Unit (PCU). The root concept for this unit started with {{\\|Nehalem}} which embedded a discrete microcontroller on-die which handled all the power management related tasks in firmware. The new PCU extends on this functionality with numerous state machines and firmware algorithms. The use of firmware, which Intel Fellow Opher Kahn described at IDF 2010, contains 1000s of lines of code to handle very complex power management tasks. Intel uses firmware to be able to fine-tune chips post-silicon as well as fix problems to extract the best performance.
+
With Sandy Bridge, Intel introduced a large number of power features to save powers depending on the workload, temperature, and what's I/O is being utilized. The new power features are handled at the System Agent. Sandy Bridge introduced a fully-fledged Power Control Unit (PCU). The root concept for this unit started with {{\\|Nehalem}} which embedded a discrete microcontroller on-die which handled all the power management related tasks in firmware. The new PCU extends on this functionality with numerous state machines and firmware algorithms. The use of firmware, which Intel Fellow Opher Kahn described at IDF 2010, contains 1000s of lines of code to handle very complex power management tasks. Intel uses firmware to be able to fine-tune chips post-silicon as well as fix problems to extract the best performance.
  
 
Sandy Bridge has three separate voltage domains: System Agent, Cores, and Graphics. Both the Cores and Graphics are variable voltage domains. Both of those domains are able to scale up and down with voltage and frequency based on controls from the Power Control Unit. The cores and graphics scale up and down depending on the workloads in order to deliver instantaneous performance boost through higher frequency whenever possible. The System Agent sits on a third power plane which is low voltage (especially on mobile devices) in order to consume as little power as necessary.
 
Sandy Bridge has three separate voltage domains: System Agent, Cores, and Graphics. Both the Cores and Graphics are variable voltage domains. Both of those domains are able to scale up and down with voltage and frequency based on controls from the Power Control Unit. The cores and graphics scale up and down depending on the workloads in order to deliver instantaneous performance boost through higher frequency whenever possible. The System Agent sits on a third power plane which is low voltage (especially on mobile devices) in order to consume as little power as necessary.
Line 343: Line 343:
  
 
=== Pipeline ===
 
=== Pipeline ===
The Sandy Bridge core focuses on extracting performance and reducing power in a great number of ways. Intel placed heavy emphasis in the cores on performance enhancing features that can provide more-than-linear performance-to-power ratio as well as features that provide more performance while reducing power. The various enhancements can be found in both the front-end and the back-end of the core.
+
The Sandy Lake core focuses on extracting performance and reducing power through a great number ways. Intel placed heavy emphasis in the cores on performance enhancing features that can provide more-than-linear performance-to-power ratio as well as features that provide more performance while reducing power. The various enhancements can be found in both the front-end and the back-end of the core.
  
 
==== Broad Overview ====
 
==== Broad Overview ====
Line 349: Line 349:
  
 
==== Front-end ====
 
==== Front-end ====
The front-end is tasked with the challenge of fetching the complex [[x86]] instructions from memory, decoding them, and delivering them to the execution units. In other words, the front end needs to be able to consistently deliver enough [[µOPs]] from the [[instruction code stream]] to keep the back-end busy. When the back-end is not being fully utilized, the core is not reaching its full performance. A weak or under-performing front-end will directly affect the back-end, resulting in a poorly performing core. In the case of Sandy Bridge base, this challenge is further complicated by various redirections such as branches and the complex nature of the [[x86]] instructions themselves.
+
The front-end is is tasked with the challenge of fetching the complex [[x86]] instructions from memory, decoding them, and delivering them to the execution units. In other words, the front end needs to be able to consistently deliver enough [[µOPs]] from the [[instruction code stream]] to keep the back-end busy. When the back-end is not being fully utilized, the core is not reaching its full performance. A weak or under-performing front-end will directly affect the back-end, resulting in a poorly performing core. In the case of Sandy Bridge base, this challenge is further complicated by various redirection such as branches and the complex nature of the [[x86]] instructions themselves.
  
The entire front-end was redesigned from the ground up in Sandy Bridge. The four major changes in the front-end of Sandy Bridge is the entirely new [[µOP cache]], the overhauled branch predictor and further decoupling of the front-end, and the improved macro-op fusion capabilities. All those features not only improve performance but they also reduce power draw at the same time.  
+
The entire front-end was redesigned from the ground up in Sandy Bridge. The four major changes in the front-end of Sandy Bridge is the entirely new [[µOP cache]], the overhauled branch predictor and further decoupling of the front-end, and the improved macro-op fusion capabilities. All those features not only improve performance but they also reduce power at the same time.  
  
 
===== Fetch & pre-decoding =====  
 
===== Fetch & pre-decoding =====  
Blocks of memory arrive at the core from either the cache slice or further down [[#Ring_Interconnect|the ring]] from one of the other cache slice. On occasion, far less desirably, from main memory. On their first pass, instructions should have already been prefetched from the [[L2 cache]] and into the [[L1 cache]]. The L1 is a 32 [[KiB]], 64B line, 8-way set associative cache. The [[instruction cache]] is identical in size to that of {{\\|Nehalem}}'s but its associativity was increased to 8-way. Sandy Bridge fetching is done on a 16-byte fetch window. A window size that has not changed in a number of generations. Up to 16 bytes of code can be fetched each cycle. Note that the fetcher is shared evenly between two threads, so that each thread gets every other cycle. At this point they are still [[macro-ops]] (i.e. variable-length [[x86]] architectural instruction). Instructions are brought into the pre-decode buffer for initial preparation.
+
Blocks of memory arrive at the core from either the cache slice or further down [[#Ring_Interconnect|the ring]] from one of the other cache slice. On occasion, far less desirably, from main memory. On their first pass, instructions should have already been prefetched from the [[L2 cache]] and into the [[L1 cache]]. The L1 is a 32 [[KiB]], 64B line, 8-way set associative cache. The [[instruction cache]] is identical in size to that of {{\\|Nehalem}}'s but its associativity was increased to 8-way. Sandy Bridge fetching is done on a 16-byte fetch window. A window size that has not changed in a number of generations. Up to 16 bytes of code can be fetched each cycle. Note that fetcher is shared evenly between two thread, so that each thread gets every other cycle. At this point they are still [[macro-ops]] (i.e. variable-length [[x86]] architectural instruction). Instructions are brought into the pre-decode buffer for initial preparation.
  
 
[[File:sandy bridge fetch.svg|left|300px]]
 
[[File:sandy bridge fetch.svg|left|300px]]
Line 384: Line 384:
 
===== Decoding =====
 
===== Decoding =====
 
[[File:sandy bridge decode.svg|right|350px]]
 
[[File:sandy bridge decode.svg|right|350px]]
Up to four instructions (or five in cases where one of the instructions was macro-fused) pre-decoded instructions are sent to the decoders each cycle. Like the fetchers, the decoders alternate between the two threads each cycle. Decoders read in [[macro-operations]] and emit regular, fixed length [[µOPs]]. The decoders organization in Sandy Bridge has been kept more or less the same as {{\\|Nehalem}}. As with its predecessor, Sandy Bridge features four decodes. The decoders are asymmetric; the first one, Decoder 0, is a [[complex decoder]] while the other three are [[simple decoders]]. A simple decoder is capable of translating instructions that emit a single fused-[[µOP]]. By contrast, a [[complex decoder]] can decode anywhere from one to four fused-µOPs. Overall up to 4 simple instructions can be decoded each cycle with lesser amounts if the complex decoder needs to emit additional µOPs; i.e., for each additional µOP the complex decoder needs to emit, 1 less simple decoder can operate. In other words, for each additional µOP the complex decoder emits, one less decoder is active.
+
Up to four instructions (or five in cases where one of the instructions was macro-fused) pre-decoded instructions are sent to the decoders each cycle. Like the fetchers, the Decoders alternate between the two thread each cycle. Decoders read in [[macro-operations]] and emit regular, fixed length [[µOPs]]. The decoders organization in Sandy Bridge has been kept more or less the same as {{\\|Nehalem}}. As with its predecessor, Sandy Bridge features four decodes. The decoders are asymmetric; the first one, Decoder 0, is a [[complex decoder]] while the other three are [[simple decoders]]. A simple decoder is capable of translating instructions that emit a single fused-[[µOP]]. By contrast, a [[complex decoder]] can decode anywhere from one to four fused-µOPs. Overall up to 4 simple instructions can be decoded each cycle with lesser amounts if the complex decoder needs to emit addition µOPs; i.e., for each additional µOP the complex decoder needs to emit, 1 less simple decoder can operate. In other words, for each additional µOP the complex decoder emits, one less decoder is active.
  
Sandy Bridge brought about the first 256-bit [[SIMD]] set of instructions called {{x86|AVX}}. This extension expanded the sixteen pre-existing 128-bit {{x86|XMM}} registers to 256-bit {{x86|YMM}} registers for floating point vector operations (note that {{\\|Haswell}} expanded this further to [[Integer]] operations as well). Most of the new AVX instructions have been designed as simple instructions that can be decoded by the simple decoders.
+
Sandy Bridge brought about the first 256-bit [[SIMD]] set of instructions called {{x86|AVX}}. This extension expanded the sixteen pre-existing 128-bit {{x86|XMM}} registers to 256-bit {{x86|YMM}} registers for floating point vector operations (note that {{\\|Haswell}} expanded this further to [[Integer]] operations as well). Most of the new AVX instructions have been designed as simple instructions that can be decoded by the simple decoders  
  
 
====== MSROM & Stack Engine ======
 
====== MSROM & Stack Engine ======
 
There are more complex instructions that are not trivial to be decoded even by complex decoder. For instructions that transform into more than four µOPs, the instruction detours through the [[microcode sequencer]] (MS) ROM. When that happens, up to 4 µOPs/cycle are emitted until the microcode sequencer is done. During that time, the decoders are disabled.
 
There are more complex instructions that are not trivial to be decoded even by complex decoder. For instructions that transform into more than four µOPs, the instruction detours through the [[microcode sequencer]] (MS) ROM. When that happens, up to 4 µOPs/cycle are emitted until the microcode sequencer is done. During that time, the decoders are disabled.
  
[[x86]] has dedicated [[stack machine]] operations. Instructions such as <code>{{x86|PUSH}}</code>, <code>{{x86|POP}}</code>, as well as <code>{{x86|CALL}}</code>, and <code>{{x86|RET}}</code> all operate on the [[stack pointer]] (<code>{{x86|ESP}}</code>). Without any specialized hardware, such operations would need to be sent to the back-end for execution using the general purpose ALUs, using up some of the bandwidth and utilizing scheduler and execution units resources. Since {{\\|Pentium M}}, Intel has been making use of a [[Stack Engine]]. The Stack Engine has a set of three dedicated adders it uses to perform and eliminate the stack-updating µOPs (i.e. capable of handling three additions per cycle). Instruction such as <code>{{x86|PUSH}}</code> are translated into a store and a subtraction of 4 from <code>{{x86|ESP}}</code>. The subtraction in this case will be done by the Stack Engine. The Stack Engine sits after the [[instruction decode|decoders]] and monitors the µOPs stream as it passes by. Incoming stack-modifying operations are caught by the Stack Engine. This operation alleviates the burden of the pipeline from stack pointer-modifying µOPs. In other words, it's cheaper and faster to calculate stack pointer targets at the Stack Engine than it is to send those operations down the pipeline to be done by the execution units (i.e., general purpose ALUs).
+
[[x86]] has dedicated [[stack machine]] operations. Instructions such as <code>{{x86|PUSH}}</code>, <code>{{x86|POP}}</code>, as well as <code>{{x86|CALL}}</code>, and <code>{{x86|RET}}</code> all operate on the [[stack pointer]] (<code>{{x86|ESP}}</code>). Without any specialized hardware, such operations would would need to be sent to the back-end for execution using the general purpose ALUs, using up some of the bandwidth and utilizing scheduler and execution units resources. Since {{\\|Pentium M}}, Intel has been making use of a [[Stack Engine]]. The Stack Engine has a set of three dedicated adders it uses to perform and eliminate the stack-updating µOPs (i.e. capable of handling three additions per cycle). Instruction such as <code>{{x86|PUSH}}</code> are translated into a store and a subtraction of 4 from <code>{{x86|ESP}}</code>. The subtraction in this case will be done by the Stack Engine. The Stack Engine sits after the [[instruction decode|decoders]] and monitors the µOPs stream as it passes by. Incoming stack-modifying operations are caught by the Stack Engine. This operation alleviate the burden of the pipeline from stack pointer-modifying µOPs. In other words, it's cheaper and faster to calculate stack pointer targets at the Stack Engine than it is to send those operations down the pipeline to be done by the execution units (i.e., general purpose ALUs).
  
 
===== New µOP cache & x86 tax =====
 
===== New µOP cache & x86 tax =====
 
[[File:sandy bridge ucache.svg|right|400px]]
 
[[File:sandy bridge ucache.svg|right|400px]]
Decoding the variable-length, inconsistent, and complex [[x86]] instructions is a nontrivial task. It's also expensive in terms of performance and power. Therefore, the best way for the pipeline to avoid those things is to simply not decode the instructions. This is exactly what Intel has done with Sandy Bridge and what's perhaps the single biggest feature that has been added to the core. With Sandy Bridge Intel introduced a new [[µOP cache]] unit or perhaps more appropriately called the Decoded Stream Buffer (DSB). The micro-op cache is unique in that it not only does substantially improve performance but it does so while significantly reducing power.
+
Decoding the variable-length, inconsistent, and complex [[x86]] instructions is a nontrivial task. It's also expensive in terms of performance and power. Therefore, the best way for the pipeline to avoid those things is to simply not decode the instructions. This is exactly what Intel has done with Sandy Bridge and what's perhaps the single biggest feature that has been added to the core. With Sandy Bridge Intel introduced a new [[µOP cache]] unit or perhaps more appropriately called the Decoded Stream Buffer (DSB). The micro-op cache is unique in that not only does it substantially improve performance but it does so while significantly reducing power.
  
 
On the surface, the µOP cache can be conceptualized as a second instruction cache unit that is subset of the [[level one instruction cache]]. What's unique about it is that it stores actual decoded instructions (i.e., µOPs). While it shares many of the goals of {{\\|NetBurst}}'s [[trace cache]], the two implementations are entirely different, this is especially true as it pertains to how it augments the rest of the front-end. The idea behind both mechanisms is to increase the front-end bandwidth by reducing reliance on the decoders.
 
On the surface, the µOP cache can be conceptualized as a second instruction cache unit that is subset of the [[level one instruction cache]]. What's unique about it is that it stores actual decoded instructions (i.e., µOPs). While it shares many of the goals of {{\\|NetBurst}}'s [[trace cache]], the two implementations are entirely different, this is especially true as it pertains to how it augments the rest of the front-end. The idea behind both mechanisms is to increase the front-end bandwidth by reducing reliance on the decoders.
Line 407: Line 407:
 
The µOP cache has an average hit rate of around 80%. During the instruction fetch, the branch predictor will probe the µOPs cache tags. A hit in the cache allows for up to 4 µOPs (possibly fused macro-ops) per cycle to be sent directly to the Instruction Decode Queue (IDQ), bypassing all the pre-decoding and decoding that would otherwise have to be done. During those cycles, the rest of the front-end is entirely clock-gated which is how the substantial power saving is gained. Whereas the legacy decode path works in 16-byte instruction fetch windows, the µOP cache has no such restriction and can deliver 4 µOPs/cycle corresponding to the much bigger 32-byte window. Since a single window can be made of up to 18 µOPs, up to 5 whole cycles may be required to read out the entire decoded stream. Nonetheless, the µOPs cache can deliver consistently higher bandwidth than the legacy pipeline which is limited to the 16-byte fetch window and can be a serious [[bottleneck]] if the average instruction length is more than four bytes per window.
 
The µOP cache has an average hit rate of around 80%. During the instruction fetch, the branch predictor will probe the µOPs cache tags. A hit in the cache allows for up to 4 µOPs (possibly fused macro-ops) per cycle to be sent directly to the Instruction Decode Queue (IDQ), bypassing all the pre-decoding and decoding that would otherwise have to be done. During those cycles, the rest of the front-end is entirely clock-gated which is how the substantial power saving is gained. Whereas the legacy decode path works in 16-byte instruction fetch windows, the µOP cache has no such restriction and can deliver 4 µOPs/cycle corresponding to the much bigger 32-byte window. Since a single window can be made of up to 18 µOPs, up to 5 whole cycles may be required to read out the entire decoded stream. Nonetheless, the µOPs cache can deliver consistently higher bandwidth than the legacy pipeline which is limited to the 16-byte fetch window and can be a serious [[bottleneck]] if the average instruction length is more than four bytes per window.
  
It's interesting to note that the µOPs cache only operates on full windows, that is, a full 32B window that has all the µOPs cached. Any partial hits are required to go through the legacy decode pipeline as if nothing was cached. The choice to not handle partial cache hits is rooted in this features efficiency. On partial window hits the core will end up emitting some µOPs from the micro-op cache as well as having the legacy decode pipeline decoding the remaining missed µOPs. Effectively multiple components will end up emitting µOPs. Not only would such mechanism increase complexity, but it's also unclear how much, if any, benefits would be gained by that.
+
It's interesting to note that the µOPs cache only operates on full windows, that is, a full 32B window that has all the µOPs cached. Any partial hits are required to go through the legacy decode pipeline as if nothing was cached. The choice to not handle partial cache hits is rooted in this feature's efficiency. On partial window hits the core will end up emitting some µOPs from the micro-op cache as well as having the legacy decode pipeline decoding the remaining missed µOPs. Effectively multiple components will end up emitting µOPs. Not only such mechanism would increase complexity, but it's also unclear how much, if any, benefits would be gained by that.
  
 
As noted earlier, the µOPs cache has its root in the original {{\\|NetBurst}} trace cache, particularly as far as goals are concerned. But that's where the similarities end. The micro-op cache in Sandy Bridge can be seen as a light-weight, efficient µOPs delivery mechanism that can surpass the legacy pipeline whenever possible. This is different to the trace cache that attempted to effectively replace the entire front-end and use vastly inferior and slow fetch/decode for workloads that the trace cache could not handle. It's worth noting that the trace cache was costly and complicated (having dedicated components such as a trace BTB unit) and had various side-effects such as needing to flush on context switches. Deficiencies the µOP cache doesn't have. The trace cache resulted in a lot of duplication as well, for example {{\\|NetBurst}}'s 12k uops trace cache had a hit rate comparable to an 8 KiB-16 KiB L1I$ whereas this 1.5K µOPs cache cache is comparable to a 6 KiB instruction cache. This implies a significant storage efficiency of four-fold or greater.
 
As noted earlier, the µOPs cache has its root in the original {{\\|NetBurst}} trace cache, particularly as far as goals are concerned. But that's where the similarities end. The micro-op cache in Sandy Bridge can be seen as a light-weight, efficient µOPs delivery mechanism that can surpass the legacy pipeline whenever possible. This is different to the trace cache that attempted to effectively replace the entire front-end and use vastly inferior and slow fetch/decode for workloads that the trace cache could not handle. It's worth noting that the trace cache was costly and complicated (having dedicated components such as a trace BTB unit) and had various side-effects such as needing to flush on context switches. Deficiencies the µOP cache doesn't have. The trace cache resulted in a lot of duplication as well, for example {{\\|NetBurst}}'s 12k uops trace cache had a hit rate comparable to an 8 KiB-16 KiB L1I$ whereas this 1.5K µOPs cache cache is comparable to a 6 KiB instruction cache. This implies a significant storage efficiency of four-fold or greater.
Line 428: Line 428:
  
 
===== Renaming & Allocation =====
 
===== Renaming & Allocation =====
On each cycle, up to 4 µOPs can be delivered here from the front-end from one of the two threads. As stated earlier, the Re-Order Buffer is now a light-weight component that tracks the in-flight µOPs. The  ROB in Sandy Bridge is 168 entries, allowing for up to 40 additional µOPs in-flight over {{\\|Nehalem}}. At this point of the pipeline the µOPs are still handled sequentially (i.e., in order) with each µOP occupying the next entry in the ROB. This entry is used to track the correct execution order and statuses. In order for the ROB to rename an integer µOP, there needs to be an available Integer PRF entry. Likewise, for FP and SIMD µOPs there needs to be an available FP PRF entry. Following renaming, all bets are off and the µOPs are free to execute as soon as their dependencies are resolved.
+
On each cycle, up to 4 µOPs can be delivered here from the front-end from one of the two threads. As stated earlier, the Re-Order Buffer is now a light-weight component that tracks the in-flight µOPs. The  ROB in Sandy Bridge is 168 entries, allowing for up to 40 additional µOPs in-flight over {{\\|Nehalem}}. At this point of the pipeline the µOPs are still handled sequentially (i.e., in order) with each µOPs occupying the next entry in the ROB. This entry is used to track the correct execution order and statues. In order for the ROB to rename an integer µOP, there needs to be an available Integer PRF entry. Likewise, for FP and SIMD µOPs there needs to be an available FP PRF entry. Following renaming, all bets are off and the µOPs are free to execute as soon as their dependencies are resolved.
  
It is at this stage that [[architectural registers]] are mapped onto the underlying [[physical registers]]. Other additional bookkeeping tasks are also done at this point such as allocating resources for stores, loads, and determining all possible scheduler ports. Register renaming is also controlled by the [[Register Alias Table]] (RAT) which is used to mark where the data we depend on is coming from (after that value, too, came from an instruction that has previously been renamed). Sandy Bridge's move to a PRF-based renaming has a fairly substantial impact on power too. With {{x86|AVX|the new}} {{x86|extensions|instruction set extension}} which allows for 256-bit operations, a retirement would've meant large amount of 256-bit values have to be needlessly moved to the Retirement Register File each time. This is entirely eliminated in Sandy Bridge. The decoupling of the PRFs from the RAT/ROB likely means some added latency is at play here, but the overall benefits are more than worth it.
+
It is at this stage that [[architectural registers]] are mapped onto the underlying [[physical registers]]. Other additional bookkeeping tasks are also done at this point such as allocating resources for stores, loads, and determining all possible scheduler ports. Register renaming is also controlled by the [[Register Alias Table]] (RAT) which is used to mark where the data we depend on is coming from (after that value, too, came from an instruction that has previously been renamed). Sandy Bridge move to a PRF-based renaming has a fairly substantial impact on power too. With {{x86|AVX|the new}} {{x86|extensions|instruction set extension}} which allows for 256-bit operations, a retirement would've meant large amount of 256-bit values have to be needlessly moved to the Retirement Register File each time. This is entirely eliminated in Sandy Bridge. The decoupling of the PRFs from the RAT/ROB likely means some added latency is at play here, but the overall benefits are more than worth it.
  
 
There is no special costs involved in splitting up fused µOPs before execution or [[retirement]] and the two fused µOPs only occupy a single entry in the ROB, however the rename registers has more than doubled over {{\\|Nehalem}} in order to accommodate those fused operations. The RAT is capable of handling 4 µOPs each cycle. Note that the ROB still operates on fused µOPs, therefore 4 µOPs can effectively be as high as 8 µOPs.
 
There is no special costs involved in splitting up fused µOPs before execution or [[retirement]] and the two fused µOPs only occupy a single entry in the ROB, however the rename registers has more than doubled over {{\\|Nehalem}} in order to accommodate those fused operations. The RAT is capable of handling 4 µOPs each cycle. Note that the ROB still operates on fused µOPs, therefore 4 µOPs can effectively be as high as 8 µOPs.
Line 444: Line 444:
 
| Not only does this instruction get eliminated at the ROB, but it's actually encoded as just 2 bytes <code>31 C0</code> vs the 5 bytes for <code>{{x86|mov}} {{x86|eax}}, 0x0</code> which is encoded as <code>b8 00 00 00 00</code>.
 
| Not only does this instruction get eliminated at the ROB, but it's actually encoded as just 2 bytes <code>31 C0</code> vs the 5 bytes for <code>{{x86|mov}} {{x86|eax}}, 0x0</code> which is encoded as <code>b8 00 00 00 00</code>.
 
|}
 
|}
Sandy Bridge introduced a number of new optimizations it performs prior to entering the out-of-order and renaming part. Two of those optimizations are [[Zeroing Idioms]], and [[Ones Idioms]]. The first common optimization performed in Sandy Bridge is [[Zeroing Idioms]] elimination or a dependency breaking idiom. A number of common zeroing idioms are recognized and consequently eliminated. This is done prior to bookkeeping at the ROB, allowing those µOPs to save resources and eliminating them entirely and breaking any dependency. Eliminated zeroing idioms are zero latency and are entirely removed from the pipeline (i.e., retired).
+
Sandy Bridge introduced a number of new optimizations it performs prior to entering the out-of-order and renaming part. Two of those optimizations are [[Zeroing Idioms]], and [[Ones Idioms]]. The first common optimization performed in Sandy Bridge is [[Zeroing Idioms]] elimination or a dependency breaking idiom. A number common zeroing idioms are recognized and consequently eliminated. This is done prior to bookkeeping at the ROB, allowing those µOPs to save resources and eliminating them entirely and breaking any dependency. Eliminated zeroing idioms are zero latency and are entirely removed from the pipeline (i.e., retired).
  
 
Sandy Bridge recognizes instructions such as <code>{{x86|XOR}}</code>, <code>{{x86|PXOR}}</code>, and <code>{{x86|XORPS}}</code> as zeroing idioms when the [[source operand|source]] and [[destination operand|destination]] operands are the same. Those optimizations are done at the same rate as renaming during renaming (at 4 µOPs per cycle) and the architectural register is simply set to zero (no actual physical register is used).
 
Sandy Bridge recognizes instructions such as <code>{{x86|XOR}}</code>, <code>{{x86|PXOR}}</code>, and <code>{{x86|XORPS}}</code> as zeroing idioms when the [[source operand|source]] and [[destination operand|destination]] operands are the same. Those optimizations are done at the same rate as renaming during renaming (at 4 µOPs per cycle) and the architectural register is simply set to zero (no actual physical register is used).
  
The [[ones idioms]] is another dependency breaking idiom that can be optimized. In all the various {{x86|PCMPEQ|PCMPEQx}} instructions that perform packed comparison the same register with itself always set all bits to one. On those cases, while the µOP still has to be executed, the instructions may be scheduled as soon as possible because the current state of the register needs not to be known.
+
The [[ones idioms]] is another dependency breaking idiom that can be optimized. In all the various {{x86|PCMPEQ|PCMPEQx}} instructions that perform packed comparison the same register with itself always set all bits to one. On those cases, while the µOP still has to be executed, the instructions may be scheduled as soon as possible because the current state of the register need not be known.
  
 
===== Scheduler =====
 
===== Scheduler =====
 
[[File:sandy bridge scheduler.svg|right|500px]]
 
[[File:sandy bridge scheduler.svg|right|500px]]
Following bookkeeping at the ROB, µOPs are sent to the scheduler. Everything here can be done out-of-order whenever dependencies are cleared up and the µOP can be executed. Sandy Bridge features a very large unified scheduler that is dynamically shared between the two threads. The scheduler is exactly one and half times bigger than the reservation station found in {{\\|Nehalem}} (a total of 54 entries). The various internal reordering buffers have been significantly increased as well. The use of a unified scheduler has the advantage of dipping into a more flexible mix of µOPs, resulting in a more efficient and higher throughput design.
+
Following bookkeeping at the ROB, µOP are sent to the schedule. Everything here can be done out-of-order whenever dependencies are cleared up and the µOP can be executed. Sandy Bridge features a very large unified scheduler that is dynamically shared between the two threads. The scheduler is exactly one and half times bigger than the reservation station found in {{\\|Nehalem}} (a total of 54 entries). The various internal reordering buffers have been significantly increased as well. The use of a unified scheduler has the advantage of dipping into a more flexible mix of µOPs, resulting in a more efficient and higher throughput design.
  
Sandy Bridge has two distinct [[physical register files]] (PRF). 64-bit data values are stored in the 160-entry Integer PRF, while [[Floating Point]] and vector data values are stored in the Vector PRF which has been extended to 256 bits in order to accommodate the new {{x86|AVX}} {{x86|YMM}} registers. The Vector PRF is 144-entry deep which is slightly smaller than the Integer one. It's worth pointing out that prior to Sandy Bridge, code that relied on constant register reading was bottlenecked by a limitation in the register file which was limited to three reads. This restriction has been eliminated in Sandy Bridge.
+
Sandy Bridge has two distinct [[physical register files]] (PRF). 64-bit data values are stored in the 160-entry Integer PRF, while [[Floating Point]] and vector data values are store in the Vector PRF which has been extended to 256 bits in order to accommodate the new {{x86|AVX}} {{x86|YMM}} registers. The Vector PRF is 144-entry deep which is slightly smaller than the Integer one. It's worth pointing out that prior to Sandy Bridge, code that relied on constant register reading was bottlenecked by a limitation in the register file which was limited to three reads. This restriction has been eliminated in Sandy Bridge.
  
 
At this point, µOPs are not longer fused and will be dispatched to the execution units independently. The scheduler holds the µOPs while they wait to be executed. A µOP could be waiting on an operand that has not arrived (e.g., fetched from memory or currently being calculated from another µOPs) or because the execution unit it needs is busy. Once the µOP is ready, they are dispatched through their designated port.
 
At this point, µOPs are not longer fused and will be dispatched to the execution units independently. The scheduler holds the µOPs while they wait to be executed. A µOP could be waiting on an operand that has not arrived (e.g., fetched from memory or currently being calculated from another µOPs) or because the execution unit it needs is busy. Once the µOP is ready, they are dispatched through their designated port.
Line 514: Line 514:
  
 
=== Physical layout ===
 
=== Physical layout ===
Bellow are the [[die]] are breakdown for the [[DDR]] [[PHY]], I/O PHY, SA, GPU, L3$, and the core. Note that the area of the DDR and I/O PHYs for the two smaller dies have been extrapolated from sizes of the components of the quad-core die and therefore may be off by a bit.
+
Bellow are the [[die]] are breakdown for the [[DDR]] [[PHY]], I/O PHI, SA, GPU, L3$, and the core. Note that the area of the DDR and I/O PHYs for the two smaller dies have been extrapolated from sizes of the components of the quad-core die and therefore may be off by a bit.
  
 
<gallery widths=500px heights=275px caption="Physical Layout Breakdown" style="float:right">
 
<gallery widths=500px heights=275px caption="Physical Layout Breakdown" style="float:right">
Line 573: Line 573:
 
The Power Control Unit (PCU) is located at the {{intel|System Agent}} which incorporates the various power management hardware logic as well as a dedicated microcontroller which runs firmware that controls the various power features of the device. Communication with the [[physical cores]] and the graphics is done via a dedicate power management over the ring. The unit constantly reads the physical parameters in real time of the parts of the chip allowing it to optimize the power efficiency of the die. The power unit is exposed to the world via a set of external outputs which allows it to interact with rest of the system to control the voltage regulator and an external power management controller.
 
The Power Control Unit (PCU) is located at the {{intel|System Agent}} which incorporates the various power management hardware logic as well as a dedicated microcontroller which runs firmware that controls the various power features of the device. Communication with the [[physical cores]] and the graphics is done via a dedicate power management over the ring. The unit constantly reads the physical parameters in real time of the parts of the chip allowing it to optimize the power efficiency of the die. The power unit is exposed to the world via a set of external outputs which allows it to interact with rest of the system to control the voltage regulator and an external power management controller.
  
Sandy Bridge has two variable power planes and a single fixed power plane for the {{intel|System Agent}}. The first one covers the ring, cache, and the physical cores. Note that this is a single power plane that is shared by all those components which means they all move together up or down in frequency and voltage. Each of the individual cores is capable of being entirely power gated when needed such as when the core goes into a higher [[C state]]. When this happens, the core state is saved into one of the ways of the cache and the core is entirely shut off. As with the cores, the caches can also be power-gated per way. With each deeper idle state, additional ways are invalidated and flushed and turned off.
+
Sandy Bridge has two variable power planes and a single fixed power plane for the {{intel|System Agent}}. The first one covers the ring, cache, and the physical cores. Note that this is a single power plane that is shared by all those components which means they all move together up or down in frequency and voltage. Each of the individual cores can be is capable of being entirely power gated when needed such as when the core goes into a higher [[c state]]. When this happens, the core state is saved into one of the ways of the cache and the core is entirely shut off. As with the cores, the caches can also be power-gated per way. With each deeper idle state, additional ways are invalidated and flushed and turned off.
  
 
The [[integrated graphics]] has its own variable power plane which can run at entirely different voltage and frequency than the cores. The graphics are not power gated but the voltage is cut off when the graphics needs to go into a sleep state. As mentioned earlier, the System Agent has a fixed power plane which many different voltages for the various I/Os and logic (e.g., Display, [[PCIe]], [[DDR]], etc..). The System Agent incorporates a programmable power plane which has a set of predefined voltages which the hardware signals can select from.
 
The [[integrated graphics]] has its own variable power plane which can run at entirely different voltage and frequency than the cores. The graphics are not power gated but the voltage is cut off when the graphics needs to go into a sleep state. As mentioned earlier, the System Agent has a fixed power plane which many different voltages for the various I/Os and logic (e.g., Display, [[PCIe]], [[DDR]], etc..). The System Agent incorporates a programmable power plane which has a set of predefined voltages which the hardware signals can select from.
  
 
=== Active power optimization ===
 
=== Active power optimization ===
Optimizing for performance  means trying to deliver as much power as possible to demanding components all while meeting stringent constraints. Power algorithms take into account various constraints when considering what [[P-State]] (i.e., voltage and frequency) to operate in, which include the CPU capabilities, the platform specification (e.g. platform cooling capabilities), power delivery, graphics driver and operating system inputs as well as actual user controls (e.g. system preferences) and the type of workload (e.g. [[I/O bound]] workloads will not enjoy performance increase through increased frequency). Improvements in that area comes from throughput improvement and responsiveness (branded under "Turbo Boost 2.0").
+
Optimizing for performance  means trying to deliver as much power as possible to demanding components all while meeting stringent constraints. Power algorithms take into various constraints when considering what [[P-State]] (i.e., voltage and frequency) to operate in which include the CPU capabilities, the platform specification (e.g. platform cooling capabilities), power delivery, graphics driver and operating system inputs as well as actual user controls (e.g. system preferences) and the type of workload (e.g. [[I/O bound]] workloads will not enjoy performance increase through increased frequency). Improvements in that area comes from throughput improvement and responsiveness (branded under "Turbo Boost 2.0").
  
 
In order to optimize the active power, you need to be able to determine the real time power. Sandy Bridge features Intel's 3rd generation power metering. Power metering is an event-based power meter which incorporates many different counters that track the main activity blocks of the die. Energy cost is then applied to the 100s of different event counters which are then summed up in order to obtain the active power. The die also incorporates fuses on different areas in order to be able to obtain the leakage and idle static power of the system which is used along with the active power to get an estimate of the entire chip's power. Most of this functionality is exposed to software as well via {{x86|MSR}}s.
 
In order to optimize the active power, you need to be able to determine the real time power. Sandy Bridge features Intel's 3rd generation power metering. Power metering is an event-based power meter which incorporates many different counters that track the main activity blocks of the die. Energy cost is then applied to the 100s of different event counters which are then summed up in order to obtain the active power. The die also incorporates fuses on different areas in order to be able to obtain the leakage and idle static power of the system which is used along with the active power to get an estimate of the entire chip's power. Most of this functionality is exposed to software as well via {{x86|MSR}}s.
 
==== New thermal capacitance model ====
 
==== New thermal capacitance model ====
 
[[File:sandy bridge dynamic thermal capacitance.png|right|350px]]
 
[[File:sandy bridge dynamic thermal capacitance.png|right|350px]]
Prior to Sandy Bridge, Intel used a static model for thermal capacitance. That is, traditionally, if a model is certified for a specific TDP wattage, under no circumstance the chip will be allowed to run any hotter than that rating. This has the implication of treating temperature changes as instant. In reality temperature changes are not instant and there is a tiny bit of time early on when the heat sink is relatively cool and can absorb heat considerably faster. With Sandy Bridge, Intel moved to dynamic model which allows the chip to take advantage of the period of time when the heat spreader is still cool and can dissipate more heat quicker. For the desktop parts this can be a period of over a minute in which Sandy Bridge can operate at considerably higher frequencies and run much hotter.
+
Prior to Sandy Bridge, Intel used a static model for thermal capacitance. That is, traditionally, if a model is certified for a specific TDP wattage, under no circumstance the chip will be allowed to run any hotter than that rating. This has the implication of treating temperature changes as instant. In reality temperature changes is not instant and there is a tiny bit of time early on when the heat sink is relatively cool that can absorb heat considerably faster. With Sandy Bridge, Intel moved to dynamic model which allows the chip to take advantage of the period of time when the heat spreader is still cool and can dissipate more heat quicker. For the desktop parts this can be a period of over a minute in which Sandy Bridge can operate at considerably higher frequencies and run much hotter.
  
 
Sandy Bridge uses an exponential average moving filter mode which can be used to estimate the energy budget.
 
Sandy Bridge uses an exponential average moving filter mode which can be used to estimate the energy budget.

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)

This page is a member of 1 hidden category:

codenameSandy Bridge (client) +
core count2 + and 4 +
designerIntel +
first launchedSeptember 13, 2010 +
full page nameintel/microarchitectures/sandy bridge (client) +
instance ofmicroarchitecture +
instruction set architecturex86-64 +
manufacturerIntel +
microarchitecture typeCPU +
nameSandy Bridge (client) +
phase-outNovember 2012 +
pipeline stages (max)19 +
pipeline stages (min)14 +
process32 nm (0.032 μm, 3.2e-5 mm) +