From WikiChip
Editing intel/microarchitectures/sandy bridge (client)

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 384: Line 384:
 
===== Decoding =====
 
===== Decoding =====
 
[[File:sandy bridge decode.svg|right|350px]]
 
[[File:sandy bridge decode.svg|right|350px]]
Up to four instructions (or five in cases where one of the instructions was macro-fused) pre-decoded instructions are sent to the decoders each cycle. Like the fetchers, the decoders alternate between the two threads each cycle. Decoders read in [[macro-operations]] and emit regular, fixed length [[µOPs]]. The decoders organization in Sandy Bridge has been kept more or less the same as {{\\|Nehalem}}. As with its predecessor, Sandy Bridge features four decodes. The decoders are asymmetric; the first one, Decoder 0, is a [[complex decoder]] while the other three are [[simple decoders]]. A simple decoder is capable of translating instructions that emit a single fused-[[µOP]]. By contrast, a [[complex decoder]] can decode anywhere from one to four fused-µOPs. Overall up to 4 simple instructions can be decoded each cycle with lesser amounts if the complex decoder needs to emit additional µOPs; i.e., for each additional µOP the complex decoder needs to emit, 1 less simple decoder can operate. In other words, for each additional µOP the complex decoder emits, one less decoder is active.
+
Up to four instructions (or five in cases where one of the instructions was macro-fused) pre-decoded instructions are sent to the decoders each cycle. Like the fetchers, the decoders alternate between the two threads each cycle. Decoders read in [[macro-operations]] and emit regular, fixed length [[µOPs]]. The decoders organization in Sandy Bridge has been kept more or less the same as {{\\|Nehalem}}. As with its predecessor, Sandy Bridge features four decodes. The decoders are asymmetric; the first one, Decoder 0, is a [[complex decoder]] while the other three are [[simple decoders]]. A simple decoder is capable of translating instructions that emit a single fused-[[µOP]]. By contrast, a [[complex decoder]] can decode anywhere from one to four fused-µOPs. Overall up to 4 simple instructions can be decoded each cycle with lesser amounts if the complex decoder needs to emit addition µOPs; i.e., for each additional µOP the complex decoder needs to emit, 1 less simple decoder can operate. In other words, for each additional µOP the complex decoder emits, one less decoder is active.
  
 
Sandy Bridge brought about the first 256-bit [[SIMD]] set of instructions called {{x86|AVX}}. This extension expanded the sixteen pre-existing 128-bit {{x86|XMM}} registers to 256-bit {{x86|YMM}} registers for floating point vector operations (note that {{\\|Haswell}} expanded this further to [[Integer]] operations as well). Most of the new AVX instructions have been designed as simple instructions that can be decoded by the simple decoders.  
 
Sandy Bridge brought about the first 256-bit [[SIMD]] set of instructions called {{x86|AVX}}. This extension expanded the sixteen pre-existing 128-bit {{x86|XMM}} registers to 256-bit {{x86|YMM}} registers for floating point vector operations (note that {{\\|Haswell}} expanded this further to [[Integer]] operations as well). Most of the new AVX instructions have been designed as simple instructions that can be decoded by the simple decoders.  
Line 391: Line 391:
 
There are more complex instructions that are not trivial to be decoded even by complex decoder. For instructions that transform into more than four µOPs, the instruction detours through the [[microcode sequencer]] (MS) ROM. When that happens, up to 4 µOPs/cycle are emitted until the microcode sequencer is done. During that time, the decoders are disabled.
 
There are more complex instructions that are not trivial to be decoded even by complex decoder. For instructions that transform into more than four µOPs, the instruction detours through the [[microcode sequencer]] (MS) ROM. When that happens, up to 4 µOPs/cycle are emitted until the microcode sequencer is done. During that time, the decoders are disabled.
  
[[x86]] has dedicated [[stack machine]] operations. Instructions such as <code>{{x86|PUSH}}</code>, <code>{{x86|POP}}</code>, as well as <code>{{x86|CALL}}</code>, and <code>{{x86|RET}}</code> all operate on the [[stack pointer]] (<code>{{x86|ESP}}</code>). Without any specialized hardware, such operations would need to be sent to the back-end for execution using the general purpose ALUs, using up some of the bandwidth and utilizing scheduler and execution units resources. Since {{\\|Pentium M}}, Intel has been making use of a [[Stack Engine]]. The Stack Engine has a set of three dedicated adders it uses to perform and eliminate the stack-updating µOPs (i.e. capable of handling three additions per cycle). Instruction such as <code>{{x86|PUSH}}</code> are translated into a store and a subtraction of 4 from <code>{{x86|ESP}}</code>. The subtraction in this case will be done by the Stack Engine. The Stack Engine sits after the [[instruction decode|decoders]] and monitors the µOPs stream as it passes by. Incoming stack-modifying operations are caught by the Stack Engine. This operation alleviates the burden of the pipeline from stack pointer-modifying µOPs. In other words, it's cheaper and faster to calculate stack pointer targets at the Stack Engine than it is to send those operations down the pipeline to be done by the execution units (i.e., general purpose ALUs).
+
[[x86]] has dedicated [[stack machine]] operations. Instructions such as <code>{{x86|PUSH}}</code>, <code>{{x86|POP}}</code>, as well as <code>{{x86|CALL}}</code>, and <code>{{x86|RET}}</code> all operate on the [[stack pointer]] (<code>{{x86|ESP}}</code>). Without any specialized hardware, such operations would would need to be sent to the back-end for execution using the general purpose ALUs, using up some of the bandwidth and utilizing scheduler and execution units resources. Since {{\\|Pentium M}}, Intel has been making use of a [[Stack Engine]]. The Stack Engine has a set of three dedicated adders it uses to perform and eliminate the stack-updating µOPs (i.e. capable of handling three additions per cycle). Instruction such as <code>{{x86|PUSH}}</code> are translated into a store and a subtraction of 4 from <code>{{x86|ESP}}</code>. The subtraction in this case will be done by the Stack Engine. The Stack Engine sits after the [[instruction decode|decoders]] and monitors the µOPs stream as it passes by. Incoming stack-modifying operations are caught by the Stack Engine. This operation alleviates the burden of the pipeline from stack pointer-modifying µOPs. In other words, it's cheaper and faster to calculate stack pointer targets at the Stack Engine than it is to send those operations down the pipeline to be done by the execution units (i.e., general purpose ALUs).
  
 
===== New µOP cache & x86 tax =====
 
===== New µOP cache & x86 tax =====
Line 428: Line 428:
  
 
===== Renaming & Allocation =====
 
===== Renaming & Allocation =====
On each cycle, up to 4 µOPs can be delivered here from the front-end from one of the two threads. As stated earlier, the Re-Order Buffer is now a light-weight component that tracks the in-flight µOPs. The  ROB in Sandy Bridge is 168 entries, allowing for up to 40 additional µOPs in-flight over {{\\|Nehalem}}. At this point of the pipeline the µOPs are still handled sequentially (i.e., in order) with each µOP occupying the next entry in the ROB. This entry is used to track the correct execution order and statuses. In order for the ROB to rename an integer µOP, there needs to be an available Integer PRF entry. Likewise, for FP and SIMD µOPs there needs to be an available FP PRF entry. Following renaming, all bets are off and the µOPs are free to execute as soon as their dependencies are resolved.
+
On each cycle, up to 4 µOPs can be delivered here from the front-end from one of the two threads. As stated earlier, the Re-Order Buffer is now a light-weight component that tracks the in-flight µOPs. The  ROB in Sandy Bridge is 168 entries, allowing for up to 40 additional µOPs in-flight over {{\\|Nehalem}}. At this point of the pipeline the µOPs are still handled sequentially (i.e., in order) with each µOPs occupying the next entry in the ROB. This entry is used to track the correct execution order and statues. In order for the ROB to rename an integer µOP, there needs to be an available Integer PRF entry. Likewise, for FP and SIMD µOPs there needs to be an available FP PRF entry. Following renaming, all bets are off and the µOPs are free to execute as soon as their dependencies are resolved.
  
 
It is at this stage that [[architectural registers]] are mapped onto the underlying [[physical registers]]. Other additional bookkeeping tasks are also done at this point such as allocating resources for stores, loads, and determining all possible scheduler ports. Register renaming is also controlled by the [[Register Alias Table]] (RAT) which is used to mark where the data we depend on is coming from (after that value, too, came from an instruction that has previously been renamed). Sandy Bridge's move to a PRF-based renaming has a fairly substantial impact on power too. With {{x86|AVX|the new}} {{x86|extensions|instruction set extension}} which allows for 256-bit operations, a retirement would've meant large amount of 256-bit values have to be needlessly moved to the Retirement Register File each time. This is entirely eliminated in Sandy Bridge. The decoupling of the PRFs from the RAT/ROB likely means some added latency is at play here, but the overall benefits are more than worth it.
 
It is at this stage that [[architectural registers]] are mapped onto the underlying [[physical registers]]. Other additional bookkeeping tasks are also done at this point such as allocating resources for stores, loads, and determining all possible scheduler ports. Register renaming is also controlled by the [[Register Alias Table]] (RAT) which is used to mark where the data we depend on is coming from (after that value, too, came from an instruction that has previously been renamed). Sandy Bridge's move to a PRF-based renaming has a fairly substantial impact on power too. With {{x86|AVX|the new}} {{x86|extensions|instruction set extension}} which allows for 256-bit operations, a retirement would've meant large amount of 256-bit values have to be needlessly moved to the Retirement Register File each time. This is entirely eliminated in Sandy Bridge. The decoupling of the PRFs from the RAT/ROB likely means some added latency is at play here, but the overall benefits are more than worth it.
Line 452: Line 452:
 
===== Scheduler =====
 
===== Scheduler =====
 
[[File:sandy bridge scheduler.svg|right|500px]]
 
[[File:sandy bridge scheduler.svg|right|500px]]
Following bookkeeping at the ROB, µOPs are sent to the scheduler. Everything here can be done out-of-order whenever dependencies are cleared up and the µOP can be executed. Sandy Bridge features a very large unified scheduler that is dynamically shared between the two threads. The scheduler is exactly one and half times bigger than the reservation station found in {{\\|Nehalem}} (a total of 54 entries). The various internal reordering buffers have been significantly increased as well. The use of a unified scheduler has the advantage of dipping into a more flexible mix of µOPs, resulting in a more efficient and higher throughput design.
+
Following bookkeeping at the ROB, µOP are sent to the schedule. Everything here can be done out-of-order whenever dependencies are cleared up and the µOP can be executed. Sandy Bridge features a very large unified scheduler that is dynamically shared between the two threads. The scheduler is exactly one and half times bigger than the reservation station found in {{\\|Nehalem}} (a total of 54 entries). The various internal reordering buffers have been significantly increased as well. The use of a unified scheduler has the advantage of dipping into a more flexible mix of µOPs, resulting in a more efficient and higher throughput design.
  
 
Sandy Bridge has two distinct [[physical register files]] (PRF). 64-bit data values are stored in the 160-entry Integer PRF, while [[Floating Point]] and vector data values are stored in the Vector PRF which has been extended to 256 bits in order to accommodate the new {{x86|AVX}} {{x86|YMM}} registers. The Vector PRF is 144-entry deep which is slightly smaller than the Integer one. It's worth pointing out that prior to Sandy Bridge, code that relied on constant register reading was bottlenecked by a limitation in the register file which was limited to three reads. This restriction has been eliminated in Sandy Bridge.
 
Sandy Bridge has two distinct [[physical register files]] (PRF). 64-bit data values are stored in the 160-entry Integer PRF, while [[Floating Point]] and vector data values are stored in the Vector PRF which has been extended to 256 bits in order to accommodate the new {{x86|AVX}} {{x86|YMM}} registers. The Vector PRF is 144-entry deep which is slightly smaller than the Integer one. It's worth pointing out that prior to Sandy Bridge, code that relied on constant register reading was bottlenecked by a limitation in the register file which was limited to three reads. This restriction has been eliminated in Sandy Bridge.
Line 549: Line 549:
 
Sandy Bridge reworked the way clock generation is done. There are now 13 [[phase-locked loop|PLLs]] driving independent clock domains for the individual cores, the cache slices, the integrated graphics, the {{intel|System Agent}}, and the four independent I/O regions. The goal was ensuring uniformity and consistency across all clock domains.
 
Sandy Bridge reworked the way clock generation is done. There are now 13 [[phase-locked loop|PLLs]] driving independent clock domains for the individual cores, the cache slices, the integrated graphics, the {{intel|System Agent}}, and the four independent I/O regions. The goal was ensuring uniformity and consistency across all clock domains.
  
A single external reference clock is provided by the {{intel|Platform Control Hub}} (PCH) chip. The {{intel|BCLK}}, the System Bus Clock which dates back to the {{intel|FSB}}, is now the reference clock which has been set to 100 MHz. Note that this has changed from 133 MHz in previous architectures. The BCLK is the reference edge for all the clock domains. Because the core slices and the integrated graphics have variable frequency which {{intel|turbo boost|scales with workloads}} and voltage requirements, the Slice PLLs and GPU PLL sit behind their own 100 MHz Reference Spine. This was done to ensure clock skew is minimized as much as possible over the different power planes. Intel used low [[jitter]] PLLs (long term jitter σ < 2ps is reported) in addition to the vertical clock spines and embedded [[clock compensator]]s to achieve good [[clock skew]] performance which was measured at 16 [[picoseconds|ps]].
+
A single external reference clock is provided by the {{intel|Platform Control Hub}} (PCH) chip. The {{intel|BCLK}}, the System Bus Clock which dates back to the {{intel|FSB}}, is now the reference clock which has been set to 100 MHz. Note that this has changed from 133 MHz in previous architectures. The BCLK is the reference edge for all the clock domains. Because the core slices and the integrated graphics have variable frequency which {{intel|turob boost|scales with workloads}} and voltage requirements, the Slice PLLs and GPU PLL sit behind their own 100 MHz Reference Spine. This was done to ensure clock skew is minimized as much as possible over the different power planes. Intel used low [[jitter]] PLLs (long term jitter σ < 2ps is reported) in addition to the vertical clock spines and embedded [[clock compensator]]s to achieve good [[clock skew]] performance which was measured at 16 [[picoseconds|ps]].
  
 
The System Agent PLL generates a variety of frequencies for the different zones like the PCU, SA, and Display Engine. Additionally, a seperate 133 MHz reference clock is also generated for main memory system.
 
The System Agent PLL generates a variety of frequencies for the different zones like the PCU, SA, and Display Engine. Additionally, a seperate 133 MHz reference clock is also generated for main memory system.
Line 573: Line 573:
 
The Power Control Unit (PCU) is located at the {{intel|System Agent}} which incorporates the various power management hardware logic as well as a dedicated microcontroller which runs firmware that controls the various power features of the device. Communication with the [[physical cores]] and the graphics is done via a dedicate power management over the ring. The unit constantly reads the physical parameters in real time of the parts of the chip allowing it to optimize the power efficiency of the die. The power unit is exposed to the world via a set of external outputs which allows it to interact with rest of the system to control the voltage regulator and an external power management controller.
 
The Power Control Unit (PCU) is located at the {{intel|System Agent}} which incorporates the various power management hardware logic as well as a dedicated microcontroller which runs firmware that controls the various power features of the device. Communication with the [[physical cores]] and the graphics is done via a dedicate power management over the ring. The unit constantly reads the physical parameters in real time of the parts of the chip allowing it to optimize the power efficiency of the die. The power unit is exposed to the world via a set of external outputs which allows it to interact with rest of the system to control the voltage regulator and an external power management controller.
  
Sandy Bridge has two variable power planes and a single fixed power plane for the {{intel|System Agent}}. The first one covers the ring, cache, and the physical cores. Note that this is a single power plane that is shared by all those components which means they all move together up or down in frequency and voltage. Each of the individual cores is capable of being entirely power gated when needed such as when the core goes into a higher [[C state]]. When this happens, the core state is saved into one of the ways of the cache and the core is entirely shut off. As with the cores, the caches can also be power-gated per way. With each deeper idle state, additional ways are invalidated and flushed and turned off.
+
Sandy Bridge has two variable power planes and a single fixed power plane for the {{intel|System Agent}}. The first one covers the ring, cache, and the physical cores. Note that this is a single power plane that is shared by all those components which means they all move together up or down in frequency and voltage. Each of the individual cores can be is capable of being entirely power gated when needed such as when the core goes into a higher [[c state]]. When this happens, the core state is saved into one of the ways of the cache and the core is entirely shut off. As with the cores, the caches can also be power-gated per way. With each deeper idle state, additional ways are invalidated and flushed and turned off.
  
 
The [[integrated graphics]] has its own variable power plane which can run at entirely different voltage and frequency than the cores. The graphics are not power gated but the voltage is cut off when the graphics needs to go into a sleep state. As mentioned earlier, the System Agent has a fixed power plane which many different voltages for the various I/Os and logic (e.g., Display, [[PCIe]], [[DDR]], etc..). The System Agent incorporates a programmable power plane which has a set of predefined voltages which the hardware signals can select from.
 
The [[integrated graphics]] has its own variable power plane which can run at entirely different voltage and frequency than the cores. The graphics are not power gated but the voltage is cut off when the graphics needs to go into a sleep state. As mentioned earlier, the System Agent has a fixed power plane which many different voltages for the various I/Os and logic (e.g., Display, [[PCIe]], [[DDR]], etc..). The System Agent incorporates a programmable power plane which has a set of predefined voltages which the hardware signals can select from.
  
 
=== Active power optimization ===
 
=== Active power optimization ===
Optimizing for performance  means trying to deliver as much power as possible to demanding components all while meeting stringent constraints. Power algorithms take into account various constraints when considering what [[P-State]] (i.e., voltage and frequency) to operate in, which include the CPU capabilities, the platform specification (e.g. platform cooling capabilities), power delivery, graphics driver and operating system inputs as well as actual user controls (e.g. system preferences) and the type of workload (e.g. [[I/O bound]] workloads will not enjoy performance increase through increased frequency). Improvements in that area comes from throughput improvement and responsiveness (branded under "Turbo Boost 2.0").
+
Optimizing for performance  means trying to deliver as much power as possible to demanding components all while meeting stringent constraints. Power algorithms take into various constraints when considering what [[P-State]] (i.e., voltage and frequency) to operate in which include the CPU capabilities, the platform specification (e.g. platform cooling capabilities), power delivery, graphics driver and operating system inputs as well as actual user controls (e.g. system preferences) and the type of workload (e.g. [[I/O bound]] workloads will not enjoy performance increase through increased frequency). Improvements in that area comes from throughput improvement and responsiveness (branded under "Turbo Boost 2.0").
  
 
In order to optimize the active power, you need to be able to determine the real time power. Sandy Bridge features Intel's 3rd generation power metering. Power metering is an event-based power meter which incorporates many different counters that track the main activity blocks of the die. Energy cost is then applied to the 100s of different event counters which are then summed up in order to obtain the active power. The die also incorporates fuses on different areas in order to be able to obtain the leakage and idle static power of the system which is used along with the active power to get an estimate of the entire chip's power. Most of this functionality is exposed to software as well via {{x86|MSR}}s.
 
In order to optimize the active power, you need to be able to determine the real time power. Sandy Bridge features Intel's 3rd generation power metering. Power metering is an event-based power meter which incorporates many different counters that track the main activity blocks of the die. Energy cost is then applied to the 100s of different event counters which are then summed up in order to obtain the active power. The die also incorporates fuses on different areas in order to be able to obtain the leakage and idle static power of the system which is used along with the active power to get an estimate of the entire chip's power. Most of this functionality is exposed to software as well via {{x86|MSR}}s.
 
==== New thermal capacitance model ====
 
==== New thermal capacitance model ====
 
[[File:sandy bridge dynamic thermal capacitance.png|right|350px]]
 
[[File:sandy bridge dynamic thermal capacitance.png|right|350px]]
Prior to Sandy Bridge, Intel used a static model for thermal capacitance. That is, traditionally, if a model is certified for a specific TDP wattage, under no circumstance the chip will be allowed to run any hotter than that rating. This has the implication of treating temperature changes as instant. In reality temperature changes are not instant and there is a tiny bit of time early on when the heat sink is relatively cool and can absorb heat considerably faster. With Sandy Bridge, Intel moved to dynamic model which allows the chip to take advantage of the period of time when the heat spreader is still cool and can dissipate more heat quicker. For the desktop parts this can be a period of over a minute in which Sandy Bridge can operate at considerably higher frequencies and run much hotter.
+
Prior to Sandy Bridge, Intel used a static model for thermal capacitance. That is, traditionally, if a model is certified for a specific TDP wattage, under no circumstance the chip will be allowed to run any hotter than that rating. This has the implication of treating temperature changes as instant. In reality temperature changes is not instant and there is a tiny bit of time early on when the heat sink is relatively cool that can absorb heat considerably faster. With Sandy Bridge, Intel moved to dynamic model which allows the chip to take advantage of the period of time when the heat spreader is still cool and can dissipate more heat quicker. For the desktop parts this can be a period of over a minute in which Sandy Bridge can operate at considerably higher frequencies and run much hotter.
  
 
Sandy Bridge uses an exponential average moving filter mode which can be used to estimate the energy budget.
 
Sandy Bridge uses an exponential average moving filter mode which can be used to estimate the energy budget.

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)

This page is a member of 1 hidden category:

codenameSandy Bridge (client) +
core count2 + and 4 +
designerIntel +
first launchedSeptember 13, 2010 +
full page nameintel/microarchitectures/sandy bridge (client) +
instance ofmicroarchitecture +
instruction set architecturex86-64 +
manufacturerIntel +
microarchitecture typeCPU +
nameSandy Bridge (client) +
phase-outNovember 2012 +
pipeline stages (max)19 +
pipeline stages (min)14 +
process32 nm (0.032 μm, 3.2e-5 mm) +