From WikiChip
Editing intel/microarchitectures/sandy bridge (client)

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 18: Line 18:
 
|stages max=19
 
|stages max=19
 
|decode=4-way
 
|decode=4-way
|isa=x86-64
+
|isa=x86-16
 +
|isa 2=x86-32
 +
|isa 3=x86-64
 
|predecessor=Westmere
 
|predecessor=Westmere
 
|predecessor link=intel/microarchitectures/westmere
 
|predecessor link=intel/microarchitectures/westmere
Line 48: Line 50:
 
| {{intel|Sandy Bridge M|l=core}} || SNB-M || Mobile processors
 
| {{intel|Sandy Bridge M|l=core}} || SNB-M || Mobile processors
 
|-
 
|-
| Sandy Bridge || SNB || Desktop performance to value, AiOs, and minis
+
| {{intel|Sandy Bridge H|l=core}} || SNB-H || Desktop performance to value, AiOs, and minis
 
|-
 
|-
| {{intel|Sandy Bridge E|l=core}} || SNB-E || Workstations & entry-level servers
+
| {{intel|Sandy Bridge DT|l=core}} || SNB-DT || Workstations & entry-level servers
 
|}
 
|}
  
Line 86: Line 88:
 
== Process Technology ==
 
== Process Technology ==
 
[[File:sandy bridge wafer.jpg|right|thumb|300px|Sandy Bridge Wafer]]
 
[[File:sandy bridge wafer.jpg|right|thumb|300px|Sandy Bridge Wafer]]
{{further|intel/microarchitectures/westmere#Process_Technology|l1=Westmere § Process Technology}}
+
{{main|intel/microarchitectures/westmere#Process_Technology|l1=Westmere § Process Technology}}
 
Sandy Bridge uses the same [[32 nm process]] used for the Westmere microarchitecture for all mainstream consumer parts.
 
Sandy Bridge uses the same [[32 nm process]] used for the Westmere microarchitecture for all mainstream consumer parts.
  
Line 93: Line 95:
 
! Vendor !! OS  !! Version !! Notes
 
! Vendor !! OS  !! Version !! Notes
 
|-
 
|-
| rowspan="3" | [[Microsoft]] || rowspan="3" | Windows || {{tchk|yes|Windows XP}}
+
| rowspan="3" | Microsoft || rowspan="3" | Windows || {{tchk|yes|Windows XP}}
 
|-
 
|-
 
| {{tchk|yes|Windows Vista}}
 
| {{tchk|yes|Windows Vista}}
Line 126: Line 128:
  
 
== Architecture ==
 
== Architecture ==
Sandy Bridge features an entirely new architecture with a brand new core design which is both more highly performing and power efficient. The front-end has been entirely redesigned to incorporate a new decoded pipeline using a new µOP cache. The back-end is an entirely new PRF-based renaming architecture with a considerably large parallelism window. Sandy Bridge also provides considerable higher integration versus its {{intel|microarchitectures|predecessors}} resulting a full [[system on a chip]] design.
+
Sandy Bridge's features an entirely new architecture with a brand new core design which is both more performent and more power efficient. The front-end has been entirely rearchitected to incorporate a new decoded pipeline using a new µOP cache. The back-end is an entirely new PRF-based renaming architecture with a considerably large parallelism window. Sandy Bridge also provides considerable higher integration verses its {{intel|microarchitectures|predecessors}} resulting a full [[system on a chip]] design.
 
=== Key changes from {{\\|Westmere}} ===
 
=== Key changes from {{\\|Westmere}} ===
 
[[File:sandy bridge buffer window.png|right|350px]]
 
[[File:sandy bridge buffer window.png|right|350px]]
Line 182: Line 184:
 
** Integrated on-die is now integrated on the same die (previously was on a second die)
 
** Integrated on-die is now integrated on the same die (previously was on a second die)
 
** Dropped {{intel|QPI}} controller which linked the two dies
 
** Dropped {{intel|QPI}} controller which linked the two dies
* Testability
 
** Integrates {{intel|Generic Debug eXternal Connection}} (GDXC)
 
  
 
====New instructions ====
 
====New instructions ====
Line 192: Line 192:
 
=== Block Diagram ===
 
=== Block Diagram ===
 
==== Entire SoC Overview (dual) ====
 
==== Entire SoC Overview (dual) ====
[[File:sandy bridge soc block diagram (dual).svg|800px]]
+
{{empty section}}
  
 
==== Entire SoC Overview (quad) ====
 
==== Entire SoC Overview (quad) ====
[[File:sandy bridge soc block diagram.svg|900px]]
+
{{empty section}}
  
 
==== Individual Core ====
 
==== Individual Core ====
Line 286: Line 286:
  
 
=== Cache Architecture ===
 
=== Cache Architecture ===
As part of the entire system overhaul, the cache architecture has been streamlined and was made more scalable. Sandy Bridge features a high-bandwidth [[last level cache]] which is shared by all the [[physical core|cores]] as well as the [[integrated graphics]] and the [[system agent]]. The LLC is an inclusive multi-bank cache architecture that is tightly associative with the individual cores. Each core is paired with a "slice" of LLC which is 2 [[MiB]] in size (lower amount for lower-end models). This pairing of cores and cache slices scales with the number of cores which provides a significant performance boost while saving power and bandwidth. Partitioning the data also helps simplify coherency as well as reducing localized contentions and hot spots.
+
As part of the entire system overhaul, the cache architecture has been streamlined and was made more scalable. Sandy Bridge features a high-bandwidth [[last level cache]] which is shared by all the [[physical core|cores]] as well as the [[integrated graphics]] and the [[system agent]]. The LLC is an inclusive multi-bank cache architecture that is tightly associative with the individual cores. Each core is paired with a "slice" of LLC which is 2 [[MiB]] in size (lower amount for lower-end models). This pairing of cores and cache slices scales with the number of cores which provides a significant performance boost while saving power and bandwidth. Partitioning the data also helps simplifies coherency as well as reduce localized contentions and hot spots.
  
Sandy Bridge is Intel's first microarchitecture to integrate the graphics on-die. One of the key enablers for this feature is the new cache architecture. Conceptually, the integrated graphics is treated just like another core. The graphics itself can decide which of its buffers (e.g., display or textures) will sit in the LLC and be coherent. Because it's possible for the graphics to use large amounts of memory, the possibility of [[cache thrashing|thrashing]] had to be addressed. A Special mechanism has been implemented inside the cache in order to prevent thrashing. In order to prevent the graphics from flushing out all the core's data, that mechanism is capable of capping how much of the cache will be dedicated to the cores and how much will be dedicated to the graphics.
+
Sandy Bridge is Intel's first microarchitecture to integrated the graphics on-die. One of the key enablers for this feature is the new cache architecture. Conceptually, the integrated graphics is treated just like another core. The graphics itself can decide which of its buffers (e.g., display or textures) will sit in the LLC and be coherent. Because it's possible for the graphics to use large amounts of memory, the possibility of [[cache thrashing|thrashing]] had to be addressed. A Special mechanism has been implemented inside the cache in order to prevent thrashing. In order to prevent the graphics from flushing out all the core's data, that mechanism is capable of capping how much of the cache will be dedicated to the cores and how much will be dedicated to the graphics.
  
 
The last level cache is an inclusive cache with a 64 byte cache line organized as 16-way set associative. Each LLC slice is accessible to all cores. With up to 2 MiB per slice per core, a four-core model will sport a total of 8 MiB. Lower-end/budget models feature a smaller cache slice. This is done by disabling ways of cache in 4-way increments (for a granularity of 512 KiB). The LLC to use latency in Sandy Bridge has been greatly improved from 35-40+ in {{\\|Nehalem}} to 26-31 cycles (depending on ring hops).
 
The last level cache is an inclusive cache with a 64 byte cache line organized as 16-way set associative. Each LLC slice is accessible to all cores. With up to 2 MiB per slice per core, a four-core model will sport a total of 8 MiB. Lower-end/budget models feature a smaller cache slice. This is done by disabling ways of cache in 4-way increments (for a granularity of 512 KiB). The LLC to use latency in Sandy Bridge has been greatly improved from 35-40+ in {{\\|Nehalem}} to 26-31 cycles (depending on ring hops).
Line 295: Line 295:
 
Within each cache slice is cache box. The cache box is the controller and agent serving as the interface between the LLC and the rest of the system (i.e., cores, graphics, and system agent). The cache box implements the ring logic and the address hashing and arbitration. It is also tasked with communicating with the {{intel|System Agent}} on cache misses, non-cacheable accesses (such as in the case of I/O pulls), as well as external I/O [[snoop]] requests. The cache box is fully pipelined and is capable of working on multiple requests at the same time. Agent requests (e.g., ones from the GPU) are handled by the individual cache boxes via the ring. This is much different to how it was previously done in {{\\|Westmere}} where a single unified cache handled everything. The distribution of slices allows Sandy Bridge to have higher associativity bandwidth while reducing traffic.  
 
Within each cache slice is cache box. The cache box is the controller and agent serving as the interface between the LLC and the rest of the system (i.e., cores, graphics, and system agent). The cache box implements the ring logic and the address hashing and arbitration. It is also tasked with communicating with the {{intel|System Agent}} on cache misses, non-cacheable accesses (such as in the case of I/O pulls), as well as external I/O [[snoop]] requests. The cache box is fully pipelined and is capable of working on multiple requests at the same time. Agent requests (e.g., ones from the GPU) are handled by the individual cache boxes via the ring. This is much different to how it was previously done in {{\\|Westmere}} where a single unified cache handled everything. The distribution of slices allows Sandy Bridge to have higher associativity bandwidth while reducing traffic.  
  
The entire physical address space is mapped distributively across all the slices using a [[hash function]]. On a [[cache miss|miss]], the core needs to decode the address to figure out which slice ID to request the data from. Physical addresses are hashed at the source in order to prevent hot spots. The cache box is responsible for the maintaining of coherency and ordering between requests. Because the LLC slices are fully inclusive, it can make efficienct use of an on-die snoop filter. Each slice makes use of {{intel|Core Valid Bits}} (CVB) which is used to eliminate unnecessary snoops to the cores. A single bit per cache line is needed to indicate if the line may be in the core. Snoops are therefore only needed if the line is in the LLC and a CVB is asserted on that line. This mechanism helps limits external snoops to the cache box most of the time without resorting to going to the cores.
+
The entire physical address space is mapped distributively across all the slices using a [[hash function]]. On a [[cache miss|miss]], the core needs decodes the address to figure out which slice ID to request the data from. Physical addresses are hashed at the source in order to prevent hot spots. The cache box is responsible for the maintaining of coherency and ordering between requests. Because the LLC slices are fully inclusive, it can make efficienct use of an on-die snoop filter. Each slice makes use of {{intel|Core Valid Bits}} (CVB) which is used to eliminate unnecessary snoops to the cores. A single bit per cache line is needed to indicate if the line may be in the core. Snoops are therefore only needed if the line is in the LLC and a CVB is asserted on that line. This mechanism helps limits external snoops to the cache box most of the time without resorting to going to the cores.
  
 
Note that both the cache slices and the cache boxes reside within the same [[clock domain]] as the cores themselves - sharing the same voltage and frequency and scaling along with the cores when needed.
 
Note that both the cache slices and the cache boxes reside within the same [[clock domain]] as the cores themselves - sharing the same voltage and frequency and scaling along with the cores when needed.
 
===== Logic design techniques =====
 
[[File:sandy bridge vcc min improvements.png|right|300px|thumb|The graph which was provided by Intel during [[ISSCC]] demonstrates the [[Vcc]] min reduction after those technique have been applied to the circuits.]]
 
One of the new challenges that Sandy Bridge presented is a consequence of the shared [[power plane]] that spans the [[physical core|cores]] and the [[L3 cache]]. Using standard design, the minimum voltage required to keep the L3 cache data alive may limit the minimum operating voltage of the cores. This would consequently adversely affect the average power of the system. In previous designs, this issue could be solved by moving the L3 to its own dedicated power plane that operated at a higher voltage. However, this solution would've substantially increased the power dissipation of the L3 cache and since Sandy Bridge incorporated as much as 8 MiB of L3 for the higher-end quad-core models, this impact would have accounted for a big fraction of die's overall power consumption. Intel used several logic design techniques to minimize the minimum voltage of the L3 cache and the [[register file]] to allow it to operate at a lower level than the core logic.
 
 
[[File:sandy bridge register file shared strength and post silicon tuning.png|left|300px]]
 
Shown here is a schematic of one of the techniques that were used in the [[register file]] in order to lower the minimum Vcc Min voltage. Intel noted that fabrication variations can cause write-ability degradation at low voltages - such as in the case where TP comes out stronger than TN. This method addresses the issue caused by a strong TP by weakening the memory cell pull-up device effective strength. Note the three parallel [[transistors]] on the very right side of the schematic (inside the square box) labeled T1, T2, and T3. The effective size of the shared [[PMOS]] is set during [[post-silicon]] production testing by enabling and disabling any combination of the three parallel transistors to achieve the desired result.
 
  
 
=== Ring Interconnect ===
 
=== Ring Interconnect ===
Line 317: Line 310:
  
 
[[File:sandy bridge ring flow.svg|right|275px]]
 
[[File:sandy bridge ring flow.svg|right|275px]]
The four rings consists of a considerable amount of wiring and routing. Because the routing runs in the upper metal layers over the LLC, the rings have no real impact on die area. As with the LLC slices, the ring is fully pipelined and operates within the core's [[clock domain]] as it scales with the frequency of the core. The bandwidth of the ring also scales in bandwidth with each additional core/$slice pair that is added onto the ring, however with more cores the ring becomes more congested and adding latency as the average hop count increases. Intel expected the ring to support a fairly large amount of cores before facing real performance issues.
+
The four rings consists of a considerable amount of wiring and routing. Because the routing runs in the upper metal layers over the LLC, the rings have no real impact on die area. As with the LLC slices, the ring is fully pipelined and operates within the core's [[clock domain]] as it scales with the frequency of the core. The bandwidth of the ring also scales in bandwidth with each additional core/$slice pair that is added onto the ring, however with more cores the ring becomes more congested and adding latency as the average hop count increases. Intel expected the ring to support a fairly large amount of core before facing real performance issues.
  
 
It's important to note that the term ring refers to its structure and not necessarily how the data flows. The ring is not a round-robin and requests may travel up or down as needed. The use of address hashing allows the source agent to know exactly where the destination is. In order reduce latency, the ring is designed such as that all accesses on the ring always picks the shortest path. Because of this aspect of the ring and the fact that some requests can take longer than others to complete, the ring might have requests being handled out of order. It is the responsibility of the source agents to handle the ordering requirements. The ring [[cache coherency]] protocol is largely an enhancement based on Intel's {{intel|QuickPath Interconnect|QPI}} protocols with [[MESI]]-based source snooping protocol. On each cycle, the agents receive an indication whether there is an available slot on the ring for communication in the next cycle. When asserted, the agent can sent any type of communication (e.g. data or snoop) on the ring the following cycle.
 
It's important to note that the term ring refers to its structure and not necessarily how the data flows. The ring is not a round-robin and requests may travel up or down as needed. The use of address hashing allows the source agent to know exactly where the destination is. In order reduce latency, the ring is designed such as that all accesses on the ring always picks the shortest path. Because of this aspect of the ring and the fact that some requests can take longer than others to complete, the ring might have requests being handled out of order. It is the responsibility of the source agents to handle the ordering requirements. The ring [[cache coherency]] protocol is largely an enhancement based on Intel's {{intel|QuickPath Interconnect|QPI}} protocols with [[MESI]]-based source snooping protocol. On each cycle, the agents receive an indication whether there is an available slot on the ring for communication in the next cycle. When asserted, the agent can sent any type of communication (e.g. data or snoop) on the ring the following cycle.
Line 327: Line 320:
 
The System Agent (SA) is a centralized peripheral device integration unit. It contains what was previously the traditional [[Memory Controller Hub]] (MCH) which includes all the I/O such as the [[PCIe]], {{intel|DMI}}, and others. Additionally the SA incorporates the [[memory controller]] and the display engine which works in tandem with the integrated graphics. The major enabler for the new System Agent is in fact the [[32 nm process]] which allowed for considerably higher integration, over a dozen [[clock domain]]s and [[PHY]]s.
 
The System Agent (SA) is a centralized peripheral device integration unit. It contains what was previously the traditional [[Memory Controller Hub]] (MCH) which includes all the I/O such as the [[PCIe]], {{intel|DMI}}, and others. Additionally the SA incorporates the [[memory controller]] and the display engine which works in tandem with the integrated graphics. The major enabler for the new System Agent is in fact the [[32 nm process]] which allowed for considerably higher integration, over a dozen [[clock domain]]s and [[PHY]]s.
  
The System Agent interfaces with the rest of the system via the ring in a similar manner to the cache boxes in the LLC slices. It is also in charge of handling I/O to cache coherency. The System Agent enables [[direct memory access]] (DMA), which allows devices to snoop the cache hierarchy. Address conflicts resulting from multiple concurrent requests associated with the same cache line are also handled by the System Agent.
+
The system agent interfaces with the rest of the system via the ring in a similar manner to the cache boxes in the LLC slices. It is also in charge of handling I/O to cache coherency. The SA enables [[direct memory access]] (DMA) allows devices to snoop the cache hierarchy. Address conflicts resulting from multiple concurrent requests associated with the same cache line are also handled by the SA.
  
 
[[File:sandy bridge power overview.svg|275px|left]]
 
[[File:sandy bridge power overview.svg|275px|left]]
Line 333: Line 326:
  
 
==== <span style="float: right;">Power</span> ====
 
==== <span style="float: right;">Power</span> ====
With Sandy Bridge, Intel introduced a large number of power features to save powers depending on the workload, temperature, and what I/O is being utilized. The new power features are handled at the System Agent. Sandy Bridge introduced a fully-fledged Power Control Unit (PCU). The root concept for this unit started with {{\\|Nehalem}} which embedded a discrete microcontroller on-die which handled all the power management related tasks in firmware. The new PCU extends on this functionality with numerous state machines and firmware algorithms. The use of firmware, which Intel Fellow Opher Kahn described at IDF 2010, contains 1000s of lines of code to handle very complex power management tasks. Intel uses firmware to be able to fine-tune chips post-silicon as well as fix problems to extract the best performance.
+
With Sandy Bridge, Intel introduced a large number of power features to save powers depending on the workload, temperature, and what's I/O is being utilized. The new power features are handled at the System Agent. Sandy Bridge introduced a fully-fledged Power Control Unit (PCU). The root concept for this unit started with {{\\|Nehalem}} which embedded a discrete microcontroller on-die which handled all the power management related tasks in firmware. The new PCU extends on this functionality with numerous state machines and firmware algorithms. The use of firmware, which Intel Fellow Opher Kahn described at IDF 2010, contains 1000s of lines of code to handle very complex power management tasks. Intel uses firmware to be able to fin-tune chips post-silicon as well as fix problems to extract the best performance.
  
 
Sandy Bridge has three separate voltage domains: System Agent, Cores, and Graphics. Both the Cores and Graphics are variable voltage domains. Both of those domains are able to scale up and down with voltage and frequency based on controls from the Power Control Unit. The cores and graphics scale up and down depending on the workloads in order to deliver instantaneous performance boost through higher frequency whenever possible. The System Agent sits on a third power plane which is low voltage (especially on mobile devices) in order to consume as little power as necessary.
 
Sandy Bridge has three separate voltage domains: System Agent, Cores, and Graphics. Both the Cores and Graphics are variable voltage domains. Both of those domains are able to scale up and down with voltage and frequency based on controls from the Power Control Unit. The cores and graphics scale up and down depending on the workloads in order to deliver instantaneous performance boost through higher frequency whenever possible. The System Agent sits on a third power plane which is low voltage (especially on mobile devices) in order to consume as little power as necessary.
Line 343: Line 336:
  
 
=== Pipeline ===
 
=== Pipeline ===
The Sandy Bridge core focuses on extracting performance and reducing power in a great number of ways. Intel placed heavy emphasis in the cores on performance enhancing features that can provide more-than-linear performance-to-power ratio as well as features that provide more performance while reducing power. The various enhancements can be found in both the front-end and the back-end of the core.
+
The Sandy Lake core focuses on extracting performance and reducing power through a great number ways. Intel placed heavy emphasis in the cores on performance enhancing features that can provide more-than-linear performance-to-power ratio as well as features that provide more performance while reducing power. The various enhancements can be found in both the front-end and the back-end of the core.
  
 
==== Broad Overview ====
 
==== Broad Overview ====
Line 349: Line 342:
  
 
==== Front-end ====
 
==== Front-end ====
The front-end is tasked with the challenge of fetching the complex [[x86]] instructions from memory, decoding them, and delivering them to the execution units. In other words, the front end needs to be able to consistently deliver enough [[µOPs]] from the [[instruction code stream]] to keep the back-end busy. When the back-end is not being fully utilized, the core is not reaching its full performance. A weak or under-performing front-end will directly affect the back-end, resulting in a poorly performing core. In the case of Sandy Bridge base, this challenge is further complicated by various redirections such as branches and the complex nature of the [[x86]] instructions themselves.
+
The front-end is is tasked with the challenge of fetching the complex [[x86]] instructions from memory, decoding them, and delivering them to the execution units. In other words, the front end needs to be able to consistently deliver enough [[µOPs]] from the [[instruction code stream]] to keep the back-end busy. When the back-end is not being fully utilized, the core is not reaching its full performance. A weak or under-performing front-end will directly affect the back-end, resulting in a poorly performing core. In the case of Sandy Bridge base, this challenge is further complicated by various redirection such as branches and the complex nature of the [[x86]] instructions themselves.
  
The entire front-end was redesigned from the ground up in Sandy Bridge. The four major changes in the front-end of Sandy Bridge is the entirely new [[µOP cache]], the overhauled branch predictor and further decoupling of the front-end, and the improved macro-op fusion capabilities. All those features not only improve performance but they also reduce power draw at the same time.  
+
The entire front-end was redesigned from the ground up in Sandy Bridge. The four major changes in the front-end of Sandy Bridge is the entirely new [[µOP cache]], the overhauled branch predictor and further decoupling of the front-end, and the improved macro-op fusion capabilities. All those features not only improve performance but they also reduce power at the same time.  
  
 
===== Fetch & pre-decoding =====  
 
===== Fetch & pre-decoding =====  
Blocks of memory arrive at the core from either the cache slice or further down [[#Ring_Interconnect|the ring]] from one of the other cache slice. On occasion, far less desirably, from main memory. On their first pass, instructions should have already been prefetched from the [[L2 cache]] and into the [[L1 cache]]. The L1 is a 32 [[KiB]], 64B line, 8-way set associative cache. The [[instruction cache]] is identical in size to that of {{\\|Nehalem}}'s but its associativity was increased to 8-way. Sandy Bridge fetching is done on a 16-byte fetch window. A window size that has not changed in a number of generations. Up to 16 bytes of code can be fetched each cycle. Note that the fetcher is shared evenly between two threads, so that each thread gets every other cycle. At this point they are still [[macro-ops]] (i.e. variable-length [[x86]] architectural instruction). Instructions are brought into the pre-decode buffer for initial preparation.
+
Blocks of memory arrive at the core from either the cache slice or further down [[#Ring_Interconnect|the ring]] from one of the other cache slice. On occasion, far less desirably, from main memory. On their first pass, instructions should have already been prefetched from the [[L2 cache]] and into the [[L1 cache]]. The L1 is a 32 [[KiB]], 64B line, 8-way set associative cache. The [[instruction cache]] is identical in size to that of {{\\|Nehalem}}'s but its associativity was increased to 8-way. Sandy Bridge fetching is done on a 16-byte fetch window. A window size that has not changed in a number of generations. Up to 16 bytes of code can be fetched each cycle. Note that fetcher is shared evenly between two thread, so that each thread gets every other cycle. At this point they are still [[macro-ops]] (i.e. variable-length [[x86]] architectural instruction). Instructions are brought into the pre-decode buffer for initial preparation.
  
 
[[File:sandy bridge fetch.svg|left|300px]]
 
[[File:sandy bridge fetch.svg|left|300px]]
Line 384: Line 377:
 
===== Decoding =====
 
===== Decoding =====
 
[[File:sandy bridge decode.svg|right|350px]]
 
[[File:sandy bridge decode.svg|right|350px]]
Up to four instructions (or five in cases where one of the instructions was macro-fused) pre-decoded instructions are sent to the decoders each cycle. Like the fetchers, the decoders alternate between the two threads each cycle. Decoders read in [[macro-operations]] and emit regular, fixed length [[µOPs]]. The decoders organization in Sandy Bridge has been kept more or less the same as {{\\|Nehalem}}. As with its predecessor, Sandy Bridge features four decodes. The decoders are asymmetric; the first one, Decoder 0, is a [[complex decoder]] while the other three are [[simple decoders]]. A simple decoder is capable of translating instructions that emit a single fused-[[µOP]]. By contrast, a [[complex decoder]] can decode anywhere from one to four fused-µOPs. Overall up to 4 simple instructions can be decoded each cycle with lesser amounts if the complex decoder needs to emit additional µOPs; i.e., for each additional µOP the complex decoder needs to emit, 1 less simple decoder can operate. In other words, for each additional µOP the complex decoder emits, one less decoder is active.
+
Up to four instructions (or five in cases where one of the instructions was macro-fused) pre-decoded instructions are sent to the decoders each cycle. Like the fetchers, the Decoders alternate between the two thread each cycle. Decoders read in [[macro-operations]] and emit regular, fixed length [[µOPs]]. The decoders organization in Sandy Bridge has been kept more or less the same as {{\\|Nehalem}}. As with its predecessor, Sandy Bridge features four decodes. The decoders are asymmetric; the first one, Decoder 0, is a [[complex decoder]] while the other three are [[simple decoders]]. A simple decoder is capable of translating instructions that emit a single fused-[[µOP]]. By contrast, a [[complex decoder]] can decode anywhere from one to four fused-µOPs. Overall up to 4 simple instructions can be decoded each cycle with lesser amounts if the complex decoder needs to emit addition µOPs; i.e., for each additional µOP the complex decoder needs to emit, 1 less simple decoder can operate. In other words, for each additional µOP the complex decoder emits, one less decoder is active.
  
Sandy Bridge brought about the first 256-bit [[SIMD]] set of instructions called {{x86|AVX}}. This extension expanded the sixteen pre-existing 128-bit {{x86|XMM}} registers to 256-bit {{x86|YMM}} registers for floating point vector operations (note that {{\\|Haswell}} expanded this further to [[Integer]] operations as well). Most of the new AVX instructions have been designed as simple instructions that can be decoded by the simple decoders.
+
Sandy Bridge brought about the first 256-bit [[SIMD]] set of instructions called {{x86|AVX}}. This extension expanded the sixteen pre-existing 128-bit {{x86|XMM}} registers to 256-bit {{x86|YMM}} registers for floating point vector operations (note that {{\\|Haswell}} expanded this further to [[Integer]] operations as well). Most of the new AVX instructions have been designed as simple instructions that can be decoded by the simple decoders  
  
 
====== MSROM & Stack Engine ======
 
====== MSROM & Stack Engine ======
 
There are more complex instructions that are not trivial to be decoded even by complex decoder. For instructions that transform into more than four µOPs, the instruction detours through the [[microcode sequencer]] (MS) ROM. When that happens, up to 4 µOPs/cycle are emitted until the microcode sequencer is done. During that time, the decoders are disabled.
 
There are more complex instructions that are not trivial to be decoded even by complex decoder. For instructions that transform into more than four µOPs, the instruction detours through the [[microcode sequencer]] (MS) ROM. When that happens, up to 4 µOPs/cycle are emitted until the microcode sequencer is done. During that time, the decoders are disabled.
  
[[x86]] has dedicated [[stack machine]] operations. Instructions such as <code>{{x86|PUSH}}</code>, <code>{{x86|POP}}</code>, as well as <code>{{x86|CALL}}</code>, and <code>{{x86|RET}}</code> all operate on the [[stack pointer]] (<code>{{x86|ESP}}</code>). Without any specialized hardware, such operations would need to be sent to the back-end for execution using the general purpose ALUs, using up some of the bandwidth and utilizing scheduler and execution units resources. Since {{\\|Pentium M}}, Intel has been making use of a [[Stack Engine]]. The Stack Engine has a set of three dedicated adders it uses to perform and eliminate the stack-updating µOPs (i.e. capable of handling three additions per cycle). Instruction such as <code>{{x86|PUSH}}</code> are translated into a store and a subtraction of 4 from <code>{{x86|ESP}}</code>. The subtraction in this case will be done by the Stack Engine. The Stack Engine sits after the [[instruction decode|decoders]] and monitors the µOPs stream as it passes by. Incoming stack-modifying operations are caught by the Stack Engine. This operation alleviates the burden of the pipeline from stack pointer-modifying µOPs. In other words, it's cheaper and faster to calculate stack pointer targets at the Stack Engine than it is to send those operations down the pipeline to be done by the execution units (i.e., general purpose ALUs).
+
[[x86]] has dedicated [[stack machine]] operations. Instructions such as <code>{{x86|PUSH}}</code>, <code>{{x86|POP}}</code>, as well as <code>{{x86|CALL}}</code>, and <code>{{x86|RET}}</code> all operate on the [[stack pointer]] (<code>{{x86|ESP}}</code>). Without any specialized hardware, such operations would would need to be sent to the back-end for execution using the general purpose ALUs, using up some of the bandwidth and utilizing scheduler and execution units resources. Since {{\\|Pentium M}}, Intel has been making use of a [[Stack Engine]]. The Stack Engine has a set of three dedicated adders it uses to perform and eliminate the stack-updating µOPs (i.e. capable of handling three additions per cycle). Instruction such as <code>{{x86|PUSH}}</code> are translated into a store and a subtraction of 4 from <code>{{x86|ESP}}</code>. The subtraction in this case will be done by the Stack Engine. The Stack Engine sits after the [[instruction decode|decoders]] and monitors the µOPs stream as it passes by. Incoming stack-modifying operations are caught by the Stack Engine. This operation alleviate the burden of the pipeline from stack pointer-modifying µOPs. In other words, it's cheaper and faster to calculate stack pointer targets at the Stack Engine than it is to send those operations down the pipeline to be done by the execution units (i.e., general purpose ALUs).
  
 
===== New µOP cache & x86 tax =====
 
===== New µOP cache & x86 tax =====
 
[[File:sandy bridge ucache.svg|right|400px]]
 
[[File:sandy bridge ucache.svg|right|400px]]
Decoding the variable-length, inconsistent, and complex [[x86]] instructions is a nontrivial task. It's also expensive in terms of performance and power. Therefore, the best way for the pipeline to avoid those things is to simply not decode the instructions. This is exactly what Intel has done with Sandy Bridge and what's perhaps the single biggest feature that has been added to the core. With Sandy Bridge Intel introduced a new [[µOP cache]] unit or perhaps more appropriately called the Decoded Stream Buffer (DSB). The micro-op cache is unique in that it not only does substantially improve performance but it does so while significantly reducing power.
+
Decoding the variable-length, inconsistent, and complex [[x86]] instructions is a nontrivial task. It's also expensive in terms of performance and power. Therefore, the best way for the pipeline to avoid those things is to simply not decode the instructions. This is exactly what Intel has done with Sandy Bridge and what's perhaps the single biggest feature that has been added to the core. With Sandy Bridge Intel introduced a new [[µOP cache]] unit or perhaps more appropriately called the Decoded Stream Buffer (DSB). The micro-op cache is unique in that not only does it substantially improve performance but it does so while significantly reducing power.
  
 
On the surface, the µOP cache can be conceptualized as a second instruction cache unit that is subset of the [[level one instruction cache]]. What's unique about it is that it stores actual decoded instructions (i.e., µOPs). While it shares many of the goals of {{\\|NetBurst}}'s [[trace cache]], the two implementations are entirely different, this is especially true as it pertains to how it augments the rest of the front-end. The idea behind both mechanisms is to increase the front-end bandwidth by reducing reliance on the decoders.
 
On the surface, the µOP cache can be conceptualized as a second instruction cache unit that is subset of the [[level one instruction cache]]. What's unique about it is that it stores actual decoded instructions (i.e., µOPs). While it shares many of the goals of {{\\|NetBurst}}'s [[trace cache]], the two implementations are entirely different, this is especially true as it pertains to how it augments the rest of the front-end. The idea behind both mechanisms is to increase the front-end bandwidth by reducing reliance on the decoders.
Line 407: Line 400:
 
The µOP cache has an average hit rate of around 80%. During the instruction fetch, the branch predictor will probe the µOPs cache tags. A hit in the cache allows for up to 4 µOPs (possibly fused macro-ops) per cycle to be sent directly to the Instruction Decode Queue (IDQ), bypassing all the pre-decoding and decoding that would otherwise have to be done. During those cycles, the rest of the front-end is entirely clock-gated which is how the substantial power saving is gained. Whereas the legacy decode path works in 16-byte instruction fetch windows, the µOP cache has no such restriction and can deliver 4 µOPs/cycle corresponding to the much bigger 32-byte window. Since a single window can be made of up to 18 µOPs, up to 5 whole cycles may be required to read out the entire decoded stream. Nonetheless, the µOPs cache can deliver consistently higher bandwidth than the legacy pipeline which is limited to the 16-byte fetch window and can be a serious [[bottleneck]] if the average instruction length is more than four bytes per window.
 
The µOP cache has an average hit rate of around 80%. During the instruction fetch, the branch predictor will probe the µOPs cache tags. A hit in the cache allows for up to 4 µOPs (possibly fused macro-ops) per cycle to be sent directly to the Instruction Decode Queue (IDQ), bypassing all the pre-decoding and decoding that would otherwise have to be done. During those cycles, the rest of the front-end is entirely clock-gated which is how the substantial power saving is gained. Whereas the legacy decode path works in 16-byte instruction fetch windows, the µOP cache has no such restriction and can deliver 4 µOPs/cycle corresponding to the much bigger 32-byte window. Since a single window can be made of up to 18 µOPs, up to 5 whole cycles may be required to read out the entire decoded stream. Nonetheless, the µOPs cache can deliver consistently higher bandwidth than the legacy pipeline which is limited to the 16-byte fetch window and can be a serious [[bottleneck]] if the average instruction length is more than four bytes per window.
  
It's interesting to note that the µOPs cache only operates on full windows, that is, a full 32B window that has all the µOPs cached. Any partial hits are required to go through the legacy decode pipeline as if nothing was cached. The choice to not handle partial cache hits is rooted in this features efficiency. On partial window hits the core will end up emitting some µOPs from the micro-op cache as well as having the legacy decode pipeline decoding the remaining missed µOPs. Effectively multiple components will end up emitting µOPs. Not only would such mechanism increase complexity, but it's also unclear how much, if any, benefits would be gained by that.
+
It's interesting to note that the µOPs cache only operates on full windows, that is, a full 32B window that has all the µOPs cached. Any partial hits are required to go through the legacy decode pipeline as if nothing was cached. The choice to not handle partial cache hits is rooted in this feature's efficiency. On partial window hits the core will end up emitting some µOPs from the micro-op cache as well as having the legacy decode pipeline decoding the remaining missed µOPs. Effectively multiple components will end up emitting µOPs. Not only such mechanism would increase complexity, but it's also unclear how much, if any, benefits would be gained by that.
  
 
As noted earlier, the µOPs cache has its root in the original {{\\|NetBurst}} trace cache, particularly as far as goals are concerned. But that's where the similarities end. The micro-op cache in Sandy Bridge can be seen as a light-weight, efficient µOPs delivery mechanism that can surpass the legacy pipeline whenever possible. This is different to the trace cache that attempted to effectively replace the entire front-end and use vastly inferior and slow fetch/decode for workloads that the trace cache could not handle. It's worth noting that the trace cache was costly and complicated (having dedicated components such as a trace BTB unit) and had various side-effects such as needing to flush on context switches. Deficiencies the µOP cache doesn't have. The trace cache resulted in a lot of duplication as well, for example {{\\|NetBurst}}'s 12k uops trace cache had a hit rate comparable to an 8 KiB-16 KiB L1I$ whereas this 1.5K µOPs cache cache is comparable to a 6 KiB instruction cache. This implies a significant storage efficiency of four-fold or greater.
 
As noted earlier, the µOPs cache has its root in the original {{\\|NetBurst}} trace cache, particularly as far as goals are concerned. But that's where the similarities end. The micro-op cache in Sandy Bridge can be seen as a light-weight, efficient µOPs delivery mechanism that can surpass the legacy pipeline whenever possible. This is different to the trace cache that attempted to effectively replace the entire front-end and use vastly inferior and slow fetch/decode for workloads that the trace cache could not handle. It's worth noting that the trace cache was costly and complicated (having dedicated components such as a trace BTB unit) and had various side-effects such as needing to flush on context switches. Deficiencies the µOP cache doesn't have. The trace cache resulted in a lot of duplication as well, for example {{\\|NetBurst}}'s 12k uops trace cache had a hit rate comparable to an 8 KiB-16 KiB L1I$ whereas this 1.5K µOPs cache cache is comparable to a 6 KiB instruction cache. This implies a significant storage efficiency of four-fold or greater.
Line 428: Line 421:
  
 
===== Renaming & Allocation =====
 
===== Renaming & Allocation =====
On each cycle, up to 4 µOPs can be delivered here from the front-end from one of the two threads. As stated earlier, the Re-Order Buffer is now a light-weight component that tracks the in-flight µOPs. The  ROB in Sandy Bridge is 168 entries, allowing for up to 40 additional µOPs in-flight over {{\\|Nehalem}}. At this point of the pipeline the µOPs are still handled sequentially (i.e., in order) with each µOP occupying the next entry in the ROB. This entry is used to track the correct execution order and statuses. In order for the ROB to rename an integer µOP, there needs to be an available Integer PRF entry. Likewise, for FP and SIMD µOPs there needs to be an available FP PRF entry. Following renaming, all bets are off and the µOPs are free to execute as soon as their dependencies are resolved.
+
On each cycle, up to 4 µOPs can be delivered here from the front-end from one of the two threads. As stated earlier, the Re-Order Buffer is now a light-weight component that tracks the in-flight µOPs. The  ROB in Sandy Bridge is 168 entries, allowing for up to 40 additional µOPs in-flight over {{\\|Nehalem}}. At this point of the pipeline the µOPs are still handled sequentially (i.e., in order) with each µOPs occupying the next entry in the ROB. This entry is used to track the correct execution order and statues. In order for the ROB to rename an integer µOP, there needs to be an available Integer PRF entry. Likewise, for FP and SIMD µOPs there needs to be an available FP PRF entry. Following renaming, all bets are off and the µOPs are free to execute as soon as their dependencies are resolved.
  
It is at this stage that [[architectural registers]] are mapped onto the underlying [[physical registers]]. Other additional bookkeeping tasks are also done at this point such as allocating resources for stores, loads, and determining all possible scheduler ports. Register renaming is also controlled by the [[Register Alias Table]] (RAT) which is used to mark where the data we depend on is coming from (after that value, too, came from an instruction that has previously been renamed). Sandy Bridge's move to a PRF-based renaming has a fairly substantial impact on power too. With {{x86|AVX|the new}} {{x86|extensions|instruction set extension}} which allows for 256-bit operations, a retirement would've meant large amount of 256-bit values have to be needlessly moved to the Retirement Register File each time. This is entirely eliminated in Sandy Bridge. The decoupling of the PRFs from the RAT/ROB likely means some added latency is at play here, but the overall benefits are more than worth it.
+
It is at this stage that [[architectural registers]] are mapped onto the underlying [[physical registers]]. Other additional bookkeeping tasks are also done at this point such as allocating resources for stores, loads, and determining all possible scheduler ports. Register renaming is also controlled by the [[Register Alias Table]] (RAT) which is used to mark where the data we depend on is coming from (after that value, too, came from an instruction that has previously been renamed). Sandy Bridge move to a PRF-based renaming has a fairly substantial impact on power too. With {{x86|AVX|the new}} {{x86|extensions|instruction set extension}} which allows for 256-bit operations, a retirement would've meant large amount of 256-bit values have to be needlessly moved to the Retirement Register File each time. This is entirely eliminated in Sandy Bridge. The decoupling of the PRFs from the RAT/ROB likely means some added latency is at play here, but the overall benefits are more than worth it.
  
 
There is no special costs involved in splitting up fused µOPs before execution or [[retirement]] and the two fused µOPs only occupy a single entry in the ROB, however the rename registers has more than doubled over {{\\|Nehalem}} in order to accommodate those fused operations. The RAT is capable of handling 4 µOPs each cycle. Note that the ROB still operates on fused µOPs, therefore 4 µOPs can effectively be as high as 8 µOPs.
 
There is no special costs involved in splitting up fused µOPs before execution or [[retirement]] and the two fused µOPs only occupy a single entry in the ROB, however the rename registers has more than doubled over {{\\|Nehalem}} in order to accommodate those fused operations. The RAT is capable of handling 4 µOPs each cycle. Note that the ROB still operates on fused µOPs, therefore 4 µOPs can effectively be as high as 8 µOPs.
Line 444: Line 437:
 
| Not only does this instruction get eliminated at the ROB, but it's actually encoded as just 2 bytes <code>31 C0</code> vs the 5 bytes for <code>{{x86|mov}} {{x86|eax}}, 0x0</code> which is encoded as <code>b8 00 00 00 00</code>.
 
| Not only does this instruction get eliminated at the ROB, but it's actually encoded as just 2 bytes <code>31 C0</code> vs the 5 bytes for <code>{{x86|mov}} {{x86|eax}}, 0x0</code> which is encoded as <code>b8 00 00 00 00</code>.
 
|}
 
|}
Sandy Bridge introduced a number of new optimizations it performs prior to entering the out-of-order and renaming part. Two of those optimizations are [[Zeroing Idioms]], and [[Ones Idioms]]. The first common optimization performed in Sandy Bridge is [[Zeroing Idioms]] elimination or a dependency breaking idiom. A number of common zeroing idioms are recognized and consequently eliminated. This is done prior to bookkeeping at the ROB, allowing those µOPs to save resources and eliminating them entirely and breaking any dependency. Eliminated zeroing idioms are zero latency and are entirely removed from the pipeline (i.e., retired).
+
Sandy Bridge introduced a number of new optimizations it performs prior to entering the out-of-order and renaming part. Two of those optimizations are [[Zeroing Idioms]], and [[Ones Idioms]]. The first common optimization performed in Sandy Bridge is [[Zeroing Idioms]] elimination or a dependency breaking idiom. A number common zeroing idioms are recognized and consequently eliminated. This is done prior to bookkeeping at the ROB, allowing those µOPs to save resources and eliminating them entirely and breaking any dependency. Eliminated zeroing idioms are zero latency and are entirely removed from the pipeline (i.e., retired).
  
 
Sandy Bridge recognizes instructions such as <code>{{x86|XOR}}</code>, <code>{{x86|PXOR}}</code>, and <code>{{x86|XORPS}}</code> as zeroing idioms when the [[source operand|source]] and [[destination operand|destination]] operands are the same. Those optimizations are done at the same rate as renaming during renaming (at 4 µOPs per cycle) and the architectural register is simply set to zero (no actual physical register is used).
 
Sandy Bridge recognizes instructions such as <code>{{x86|XOR}}</code>, <code>{{x86|PXOR}}</code>, and <code>{{x86|XORPS}}</code> as zeroing idioms when the [[source operand|source]] and [[destination operand|destination]] operands are the same. Those optimizations are done at the same rate as renaming during renaming (at 4 µOPs per cycle) and the architectural register is simply set to zero (no actual physical register is used).
  
The [[ones idioms]] is another dependency breaking idiom that can be optimized. In all the various {{x86|PCMPEQ|PCMPEQx}} instructions that perform packed comparison the same register with itself always set all bits to one. On those cases, while the µOP still has to be executed, the instructions may be scheduled as soon as possible because the current state of the register needs not to be known.
+
The [[ones idioms]] is another dependency breaking idiom that can be optimized. In all the various {{x86|PCMPEQ|PCMPEQx}} instructions that perform packed comparison the same register with itself always set all bits to one. On those cases, while the µOP still has to be executed, the instructions may be scheduled as soon as possible because all the decencies are resolved.
  
 
===== Scheduler =====
 
===== Scheduler =====
 
[[File:sandy bridge scheduler.svg|right|500px]]
 
[[File:sandy bridge scheduler.svg|right|500px]]
Following bookkeeping at the ROB, µOPs are sent to the scheduler. Everything here can be done out-of-order whenever dependencies are cleared up and the µOP can be executed. Sandy Bridge features a very large unified scheduler that is dynamically shared between the two threads. The scheduler is exactly one and half times bigger than the reservation station found in {{\\|Nehalem}} (a total of 54 entries). The various internal reordering buffers have been significantly increased as well. The use of a unified scheduler has the advantage of dipping into a more flexible mix of µOPs, resulting in a more efficient and higher throughput design.
+
Following bookkeeping at the ROB, µOP are sent to the schedule. Everything here can be done out-of-order whenever dependencies are cleared up and the µOP can be executed. Sandy Bridge features a very large unified scheduler that is dynamically shared between the two threads. The scheduler is exactly one and half times bigger than the reservation station found in {{\\|Nehalem}} (a total of 54 entries). The various internal reordering buffers have been significantly increased as well. The use of a unified scheduler has the advantage of dipping into a more flexible mix of µOPs, resulting in a more efficient and higher throughput design.
  
Sandy Bridge has two distinct [[physical register files]] (PRF). 64-bit data values are stored in the 160-entry Integer PRF, while [[Floating Point]] and vector data values are stored in the Vector PRF which has been extended to 256 bits in order to accommodate the new {{x86|AVX}} {{x86|YMM}} registers. The Vector PRF is 144-entry deep which is slightly smaller than the Integer one. It's worth pointing out that prior to Sandy Bridge, code that relied on constant register reading was bottlenecked by a limitation in the register file which was limited to three reads. This restriction has been eliminated in Sandy Bridge.
+
Sandy Bridge has two distinct [[physical register files]] (PRF). 64-bit data values are stored in the 160-entry Integer PRF, while [[Floating Point]] and vector data values are store in the Vector PRF which has been extended to 256 bits in order to accommodate the new {{x86|AVX}} {{x86|YMM}} registers. The Vector PRF is 144-entry deep which is slightly smaller than the Integer one. It's worth pointing out that prior to Sandy Bridge, code that relied on constant register reading was bottlenecked by a limitation in the register file which was limited to three reads. This restriction has been eliminated in Sandy Bridge.
  
 
At this point, µOPs are not longer fused and will be dispatched to the execution units independently. The scheduler holds the µOPs while they wait to be executed. A µOP could be waiting on an operand that has not arrived (e.g., fetched from memory or currently being calculated from another µOPs) or because the execution unit it needs is busy. Once the µOP is ready, they are dispatched through their designated port.
 
At this point, µOPs are not longer fused and will be dispatched to the execution units independently. The scheduler holds the µOPs while they wait to be executed. A µOP could be waiting on an operand that has not arrived (e.g., fetched from memory or currently being calculated from another µOPs) or because the execution unit it needs is busy. Once the µOP is ready, they are dispatched through their designated port.
Line 492: Line 485:
  
 
== Configurability ==
 
== Configurability ==
Configurability was a major design goal for Sandy Bridge and is something that Intel spent considerable effort on. With a highly-configurable design, using the same [[macro cells]], Intel can meet the different market segment requirements. Sandy Bridge configurabiltiy was presented during the [[2011]] [[International Solid-State Circuits Conference]]. A copy of the paper can be [http://ieeexplore.ieee.org/abstract/document/5746311/ found here].
+
Configurability was a major design goal for Sandy Bridge and is something that Intel spend effort on. With a highly-configurable design, using the same [[macro cells]], Intel can meet the different market segment requirements. Sandy Bridge configurabiltiy was presented during the [[2011]] [[International Solid-State Circuits Conference]]. A copy of the paper can be [http://ieeexplore.ieee.org/abstract/document/5746311/ found here].
  
 
[[File:sandy bridge chop layout.png|right]]
 
[[File:sandy bridge chop layout.png|right]]
Line 512: Line 505:
  
 
In addition to the cache, both {{intel|Hyper-Threading|multi-threading}} and the number of PCIe lanes offered can be disabled or enabled depending on the exact model offered.
 
In addition to the cache, both {{intel|Hyper-Threading|multi-threading}} and the number of PCIe lanes offered can be disabled or enabled depending on the exact model offered.
 
=== Physical layout ===
 
Bellow are the [[die]] are breakdown for the [[DDR]] [[PHY]], I/O PHY, SA, GPU, L3$, and the core. Note that the area of the DDR and I/O PHYs for the two smaller dies have been extrapolated from sizes of the components of the quad-core die and therefore may be off by a bit.
 
 
<gallery widths=500px heights=275px caption="Physical Layout Breakdown" style="float:right">
 
File:sandy bridge die area (quad).svg|Quad-core die, GT2 GPU
 
File:sandy bridge die area (dual, gt2).svg|Dual-core die, GT2 GPU
 
File:sandy bridge die area (dual, gt1).svg|Dual-core die, GT1 GPU
 
</gallery>
 
 
Some [[macrocells]] remain the same regardless of the die configuration. Below is the percentage breakdown by component for each of the dies. (Values may be rounded and not add to 100 exactly.)
 
 
{| class="wikitable"
 
! colspan="4" | Area Percentage
 
|-
 
! Die Configuration !! 4+2 !! 2+2 !! 2+1
 
|-
 
| {{intel|System Agent}} || 8.3% || 12.1% || 13.7%
 
|-
 
| [[L3$]] + {{intel|Ring}} || 19.9% || 14.4% || 16.4%
 
|-
 
| [[physical cores|Cores]] || 34.3% || 24.8% || 28.2%
 
|-
 
| [[Integrated Graphics]] || 17.6 % || 25.5% || 16%
 
|-
 
| [[DDR]]/[[I/O]] PHYs || 19.9% || 23.2% || 25.6%
 
|}
 
 
== Testability ==
 
With the higher level of integration testing also increases in complexity because the ability to observe data signals becomes increasingly complex as those signals are no longer exposed. Those internal [[signals]] are quite valuable because they can provide the designers with an insight into the flow of threads and data among the cores and caches. Sandy Bridge introduced the {{intel|Generic Debug eXternal Connection}} (GDXC), a debug bus that allows snooping the ring bus in order to monitor the traffic that flows between the cores, GPU, caches, and the {{intel|System Agent}} on the ring bus. GDXC allows chip, system or software debuggers to sample the traffic on ring bus including the ring protocol control signals themselves and dump the data to an external analyzer via a dedicated on-package probe array. GDXC is protected by an Intel patent US20100332927 filed Jun 30, 2009 and should expire on June 30, 2029.
 
 
It's worth pointing out the GDX is inherently vulnerable to a [[physical attack|physical]] [[port attack]] if it's made accessible after factory testing.
 
 
== Clock Domains ==
 
[[File:sandy bridge clock domains.png|right|500px]]
 
Sandy Bridge reworked the way clock generation is done. There are now 13 [[phase-locked loop|PLLs]] driving independent clock domains for the individual cores, the cache slices, the integrated graphics, the {{intel|System Agent}}, and the four independent I/O regions. The goal was ensuring uniformity and consistency across all clock domains.
 
 
A single external reference clock is provided by the {{intel|Platform Control Hub}} (PCH) chip. The {{intel|BCLK}}, the System Bus Clock which dates back to the {{intel|FSB}}, is now the reference clock which has been set to 100 MHz. Note that this has changed from 133 MHz in previous architectures. The BCLK is the reference edge for all the clock domains. Because the core slices and the integrated graphics have variable frequency which {{intel|turbo boost|scales with workloads}} and voltage requirements, the Slice PLLs and GPU PLL sit behind their own 100 MHz Reference Spine. This was done to ensure clock skew is minimized as much as possible over the different power planes. Intel used low [[jitter]] PLLs (long term jitter σ < 2ps is reported) in addition to the vertical clock spines and embedded [[clock compensator]]s to achieve good [[clock skew]] performance which was measured at 16 [[picoseconds|ps]].
 
 
The System Agent PLL generates a variety of frequencies for the different zones like the PCU, SA, and Display Engine. Additionally, a seperate 133 MHz reference clock is also generated for main memory system.
 
 
=== Overclocking ===
 
{{oc warning}}
 
Because of the way clock generation is done in Sandy Bridge, some overclocking capabilities are no longer possible. Overclocking is generally done on [[unlocked]] parts such as the [[Core i5-2500K]] and [[Core i7-2600K]] processors.
 
 
[[File:sandy bridge bclk.png|left|300px]]
 
 
Some initial bad press surrounded Sandy Bridge overclocking capabilities because it was revealed that it can no longer be overclocked. Since the {{intel|BCLK}} is used as the reference clock to all clock domains - including PCIe, Display Link, and System Agent, increasing the BCLK will most certainly introduces system instability (most likely involving I/O). Note that in the image on the left, the "DMICLK" is the same as the "BCLK". Intel reported very limited overclocking capabilities of ~2 to 3%. It's therefore been recommended to leave the BCLK at the 100 MHz value.
 
 
The overclock-able models can be overclocked using their [[clock multiplier]]s, this requires the Desktop "P" PCH. In the diagram on the left '''(xC)''' refers to the Core Frequency and is represented as a multiple of BCLK (Core Frequency = BCLK × Core Freq Multiplier up to x57). Likewise '''(xM)''' refers to the memory ratio (up to 2133 MT/s).
 
 
Note that the '''(xP)''' and '''(xD)''' refer to the PEG (PCIe & Graphics) and {{intel|DMI}} Ratios. Those ratios are not adjustable and scale with the BCLK if overclocked.
 
 
{{clear}}
 
  
 
== Power ==
 
== Power ==
Line 573: Line 512:
 
The Power Control Unit (PCU) is located at the {{intel|System Agent}} which incorporates the various power management hardware logic as well as a dedicated microcontroller which runs firmware that controls the various power features of the device. Communication with the [[physical cores]] and the graphics is done via a dedicate power management over the ring. The unit constantly reads the physical parameters in real time of the parts of the chip allowing it to optimize the power efficiency of the die. The power unit is exposed to the world via a set of external outputs which allows it to interact with rest of the system to control the voltage regulator and an external power management controller.
 
The Power Control Unit (PCU) is located at the {{intel|System Agent}} which incorporates the various power management hardware logic as well as a dedicated microcontroller which runs firmware that controls the various power features of the device. Communication with the [[physical cores]] and the graphics is done via a dedicate power management over the ring. The unit constantly reads the physical parameters in real time of the parts of the chip allowing it to optimize the power efficiency of the die. The power unit is exposed to the world via a set of external outputs which allows it to interact with rest of the system to control the voltage regulator and an external power management controller.
  
Sandy Bridge has two variable power planes and a single fixed power plane for the {{intel|System Agent}}. The first one covers the ring, cache, and the physical cores. Note that this is a single power plane that is shared by all those components which means they all move together up or down in frequency and voltage. Each of the individual cores is capable of being entirely power gated when needed such as when the core goes into a higher [[C state]]. When this happens, the core state is saved into one of the ways of the cache and the core is entirely shut off. As with the cores, the caches can also be power-gated per way. With each deeper idle state, additional ways are invalidated and flushed and turned off.
+
Sandy Bridge has two variable power planes and a single fixed power plane for the {{intel|System Agent}}. The first one covers the ring, cache, and the physical cores. Note that this is a single power plane that is shared by all those components which means they all move together up or down in frequency and voltage. Each of the individual cores can be is capable of being entirely power gated when needed such as when the core goes into a higher [[c state]]. When this happens, the core state is saved into one of the ways of the cache and the core is entirely shut off. As with the cores, the caches can also be power-gated per way. With each deeper idle state, additional ways are invalidated and flushed and turned off.
  
 
The [[integrated graphics]] has its own variable power plane which can run at entirely different voltage and frequency than the cores. The graphics are not power gated but the voltage is cut off when the graphics needs to go into a sleep state. As mentioned earlier, the System Agent has a fixed power plane which many different voltages for the various I/Os and logic (e.g., Display, [[PCIe]], [[DDR]], etc..). The System Agent incorporates a programmable power plane which has a set of predefined voltages which the hardware signals can select from.
 
The [[integrated graphics]] has its own variable power plane which can run at entirely different voltage and frequency than the cores. The graphics are not power gated but the voltage is cut off when the graphics needs to go into a sleep state. As mentioned earlier, the System Agent has a fixed power plane which many different voltages for the various I/Os and logic (e.g., Display, [[PCIe]], [[DDR]], etc..). The System Agent incorporates a programmable power plane which has a set of predefined voltages which the hardware signals can select from.
  
 
=== Active power optimization ===
 
=== Active power optimization ===
Optimizing for performance  means trying to deliver as much power as possible to demanding components all while meeting stringent constraints. Power algorithms take into account various constraints when considering what [[P-State]] (i.e., voltage and frequency) to operate in, which include the CPU capabilities, the platform specification (e.g. platform cooling capabilities), power delivery, graphics driver and operating system inputs as well as actual user controls (e.g. system preferences) and the type of workload (e.g. [[I/O bound]] workloads will not enjoy performance increase through increased frequency). Improvements in that area comes from throughput improvement and responsiveness (branded under "Turbo Boost 2.0").
+
Optimizing for performance  means trying to deliver as much power as possible to demanding components all while meeting stringent constraints. Power algorithms take into various constraints when considering what [[P-State]] (i.e., voltage and frequency) to operate in which include the CPU capabilities, the platform specification (e.g. platform cooling capabilities), power delivery, graphics driver and operating system inputs as well as actual user controls (e.g. system preferences) and the type of workload (e.g. [[I/O bound]] workloads will not enjoy performance increase through increased frequency). Improvements in that area comes from throughput improvement and responsiveness (branded under "Turbo Boost 2.0").
  
 
In order to optimize the active power, you need to be able to determine the real time power. Sandy Bridge features Intel's 3rd generation power metering. Power metering is an event-based power meter which incorporates many different counters that track the main activity blocks of the die. Energy cost is then applied to the 100s of different event counters which are then summed up in order to obtain the active power. The die also incorporates fuses on different areas in order to be able to obtain the leakage and idle static power of the system which is used along with the active power to get an estimate of the entire chip's power. Most of this functionality is exposed to software as well via {{x86|MSR}}s.
 
In order to optimize the active power, you need to be able to determine the real time power. Sandy Bridge features Intel's 3rd generation power metering. Power metering is an event-based power meter which incorporates many different counters that track the main activity blocks of the die. Energy cost is then applied to the 100s of different event counters which are then summed up in order to obtain the active power. The die also incorporates fuses on different areas in order to be able to obtain the leakage and idle static power of the system which is used along with the active power to get an estimate of the entire chip's power. Most of this functionality is exposed to software as well via {{x86|MSR}}s.
 
==== New thermal capacitance model ====
 
==== New thermal capacitance model ====
 
[[File:sandy bridge dynamic thermal capacitance.png|right|350px]]
 
[[File:sandy bridge dynamic thermal capacitance.png|right|350px]]
Prior to Sandy Bridge, Intel used a static model for thermal capacitance. That is, traditionally, if a model is certified for a specific TDP wattage, under no circumstance the chip will be allowed to run any hotter than that rating. This has the implication of treating temperature changes as instant. In reality temperature changes are not instant and there is a tiny bit of time early on when the heat sink is relatively cool and can absorb heat considerably faster. With Sandy Bridge, Intel moved to dynamic model which allows the chip to take advantage of the period of time when the heat spreader is still cool and can dissipate more heat quicker. For the desktop parts this can be a period of over a minute in which Sandy Bridge can operate at considerably higher frequencies and run much hotter.
+
Prior to Sandy Bridge, Intel used a static model for thermal capacitance. That is, traditionally, if a model is certified for a specific TDP wattage, under no circumstance the chip will be allowed to run any hotter than that rating. This has the implication of treating temperature changes as instant. In reality temperature changes is not instant and there is a tiny bit of time early on when the heat sink is relatively cool that can absorb heat considerably faster. With Sandy Bridge, Intel moved to dynamic model which allows the chip to take advantage of the period of time when the heat spreader is still cool and can dissipate more heat quicker. For the desktop parts this can be a period of over a minute in which Sandy Bridge can operate at considerably higher frequencies and run much hotter.
  
 
Sandy Bridge uses an exponential average moving filter mode which can be used to estimate the energy budget.
 
Sandy Bridge uses an exponential average moving filter mode which can be used to estimate the energy budget.
Line 601: Line 540:
 
=== Idle power optimization ===
 
=== Idle power optimization ===
 
For mobility-specific power improvements, Intel improved battery life with Sandy Bridge as well as the overall power efficiency and reduced [[form factor]]. Key mobile features such as video playback have been substantially improved. The core, graphics, and package [[C-States]] have been enhanced to enable low-power idle and average power usage. Many new algorithms have been implemented which can take platform preferences from BIOS as well. This enables OEM to more finely tune the performance characteristics of the processor by indicating what kind of device this is and how power consumption and saving should be handled.
 
For mobility-specific power improvements, Intel improved battery life with Sandy Bridge as well as the overall power efficiency and reduced [[form factor]]. Key mobile features such as video playback have been substantially improved. The core, graphics, and package [[C-States]] have been enhanced to enable low-power idle and average power usage. Many new algorithms have been implemented which can take platform preferences from BIOS as well. This enables OEM to more finely tune the performance characteristics of the processor by indicating what kind of device this is and how power consumption and saving should be handled.
 
== Graphics ==
 
{{main|intel/microarchitectures/gen6|l1=Gen6}}
 
Sandy Bridge supports up to two independent displays.
 
 
{| class="wikitable tc2 tc3"
 
|-
 
! colspan="4" | Gen6 [[IGP]] Models !! colspan="7" | Standards
 
|-
 
! rowspan="2" | Name !! rowspan="2" | Execution Units !! rowspan="2" | Tier !!  rowspan="2" | Series !! colspan="3" | [[Direct3D]] !! colspan="2" | [[OpenGL]]
 
|-
 
| Windows || Linux || [[High Level Shading Language|HLSL]] || Windows || Linux
 
|-
 
| {{intel|HD Graphics (Sandy Bridge)|HD Graphics}} || 6 || GT1 || {{intel|Sandy Bridge M|M|l=core}}/{{intel|Sandy Bridge H|H|l=core}} || rowspan="4" style="text-align: center;" | '''10.1''' || rowspan="4" style="text-align: center;" | '''N/A''' || rowspan="4" style="text-align: center;" | '''4.1''' || rowspan="4" style="text-align: center;" | '''3.1''' || rowspan="4" style="text-align: center;" | '''3.3'''
 
|-
 
| {{intel|HD Graphics 2000}} || 6 || GT1 || {{intel|Sandy Bridge H|H|l=core}}
 
|-
 
| {{intel|HD Graphics 3000}} || 12 || GT2 || {{intel|Sandy Bridge M|M|l=core}}/{{intel|Sandy Bridge H|H|l=core}}
 
|-
 
| {{intel|HD Graphics P3000}} || 12 || GT2 || {{intel|Sandy Bridge DT|DT|l=core}}
 
|}
 
==== Hardware Accelerated Video ====
 
{{sandy bridge hardware accelerated video table}}
 
 
== Sockets/Platform ==
 
Sandy Bridge comes in three different [[C4]] packages: [[PGA]] for mobile computers, [[LGA]] for desktop systems and [[BGA]] for small form factor systems. For each of the three different dies (see [[#Configurability|§ Configurability]]), a different [[package]] [[stack-up]] was developed optimized for a particular vector: 10 layers for the highest power consumption, and 8 or 6 layers for the smaller dies.
 
 
{| class="wikitable" style="text-align: center;"
 
|-
 
! Package !! Permanent !! Platform !! Chipset !! Bus
 
|-
 
| {{intel|FCBGA-1023}} || Yes || rowspan="4" | 2-chip solution || rowspan="4" | Cougar Point || rowspan="4" | [[DMI 2.0]]
 
|-
 
| {{intel|FCBGA-1224}} || Yes
 
|-
 
| {{intel|rPGA988B}} || No
 
|-
 
| {{intel|LGA-1155}} || No
 
|}
 
  
 
== Die ==
 
== Die ==
Line 724: Line 624:
  
 
<gallery mode=slideshow>
 
<gallery mode=slideshow>
File:sandy bridge die on a wafer.png|A partial wafer shot showing a complete quad-core die along with eight other dies around it.
+
File:sandy bridge die on a wafer.png|A partial wafer shot showing a complete quad-core die along with seven other dies around it.
 
File:sandy bridge whole wafer.png|An entire Sandy Bridge Wafer.
 
File:sandy bridge whole wafer.png|An entire Sandy Bridge Wafer.
 
File:sandy bridge wafer angled 1.png|Partial wafer shot, angled. shot 1.
 
File:sandy bridge wafer angled 1.png|Partial wafer shot, angled. shot 1.
Line 782: Line 682:
 
* Lempel, Oded. "2nd Generation Intel® Core Processor Family: Intel® Core i7, i5 and i3." Hot Chips 23 Symposium (HCS), 2011 IEEE. IEEE, 2011.
 
* Lempel, Oded. "2nd Generation Intel® Core Processor Family: Intel® Core i7, i5 and i3." Hot Chips 23 Symposium (HCS), 2011 IEEE. IEEE, 2011.
 
* Rotem, Efi, et al. "Power management architecture of the 2nd generation Intel® Core microarchitecture, formerly codenamed Sandy Bridge." Hot Chips 23 Symposium (HCS), 2011 IEEE. IEEE, 2011.
 
* Rotem, Efi, et al. "Power management architecture of the 2nd generation Intel® Core microarchitecture, formerly codenamed Sandy Bridge." Hot Chips 23 Symposium (HCS), 2011 IEEE. IEEE, 2011.
* Yuffe, Marcelo, et al. "A fully integrated multi-CPU, GPU and memory controller 32nm processor." Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011 IEEE International. IEEE, 2011.
 
* Valentine, Bob. [[:File:introducing sandy bridge.pdf|"Introducing sandy bridge."]] Intel Developer Forum, San Francisco. 2010.
 
 
== Documents ==
 
 
* Valentine, Bob. [[:File:introducing sandy bridge.pdf|"Introducing sandy bridge."]] Intel Developer Forum, San Francisco. 2010.
 
* Valentine, Bob. [[:File:introducing sandy bridge.pdf|"Introducing sandy bridge."]] Intel Developer Forum, San Francisco. 2010.
* [[:File:core-overview-of-2nd-generation-intel-core-processor-family-brief.pdf|2nd Generation Intel® Core Processor Family]]
 
* [[:File:performance-2nd-gen-core-product-brief.pdf|2nd Generation Intel® Core vPro Processor Family]]
 
* [[:File:performance-2nd-generation-core-vpro-family-paper.pdf|2nd Generation Intel® Core vPro Processor Family; New levels of security, manageability, and responsiveness]]
 

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)

This page is a member of 1 hidden category:

codenameSandy Bridge (client) +
core count2 + and 4 +
designerIntel +
first launchedSeptember 13, 2010 +
full page nameintel/microarchitectures/sandy bridge (client) +
instance ofmicroarchitecture +
instruction set architecturex86-64 +
manufacturerIntel +
microarchitecture typeCPU +
nameSandy Bridge (client) +
phase-outNovember 2012 +
pipeline stages (max)19 +
pipeline stages (min)14 +
process32 nm (0.032 μm, 3.2e-5 mm) +