From WikiChip
Editing intel/microarchitectures/haswell (client)
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.
Latest revision | Your text | ||
Line 22: | Line 22: | ||
|stages min=14 | |stages min=14 | ||
|stages max=19 | |stages max=19 | ||
− | |isa=x86-64 | + | |isa=IA-32 |
+ | |isa 2=x86-64 | ||
|extension=MOVBE | |extension=MOVBE | ||
|extension 2=MMX | |extension 2=MMX | ||
Line 138: | Line 139: | ||
==== CPU changes ==== | ==== CPU changes ==== | ||
− | Haswell can | + | Haswell can execute more classes of instructions with 4 ops/cycle throughput. SandyBridge/Ivybridge could do so only for NOPs, CLC, some vector MOVs and some zeroing instructions (SUB, XOR and vector analogs). |
* MOVSX and MOVZX have 4 op/cycle throughput for 8->32, 8->64 and 16->64 bit forms. | * MOVSX and MOVZX have 4 op/cycle throughput for 8->32, 8->64 and 16->64 bit forms. | ||
− | * | + | * Some ALU operations have 4 op/cycle throughput for 32-bit registers: XOR, OR, NEG, NOT, although not all (ADD, SUB, CMP and AND don't). |
* Variable shifts and rotates (SHL r32, CL etc) latency increased from 1 cycle to 2 cycles, variable SHLD/SHRD from 2 cycles to 4 cycles. | * Variable shifts and rotates (SHL r32, CL etc) latency increased from 1 cycle to 2 cycles, variable SHLD/SHRD from 2 cycles to 4 cycles. | ||
* REP MOVS copy is twice as fast: now ~52 bytes/cycle. | * REP MOVS copy is twice as fast: now ~52 bytes/cycle. | ||
Line 152: | Line 153: | ||
====New instructions ==== | ====New instructions ==== | ||
+ | {{main|#Added instructions|l1=See #Added_instructions for the complete list}} | ||
Haswell introduced a number of new instructions: | Haswell introduced a number of new instructions: | ||
* {{x86|AVX2|<code>AVX2</code>}} - Advanced Vector Extensions 2; an extension that extends most integer instructions to 256 bits vectors. | * {{x86|AVX2|<code>AVX2</code>}} - Advanced Vector Extensions 2; an extension that extends most integer instructions to 256 bits vectors. | ||
+ | ** Vector Gather supprt | ||
+ | ** Any-to-Any permutes | ||
+ | ** Vector-Vector Shifts | ||
* {{x86|BMI1|<code>BMI1</code>}} - Bit Manipulation Instructions Sets 1 | * {{x86|BMI1|<code>BMI1</code>}} - Bit Manipulation Instructions Sets 1 | ||
* {{x86|BMI2|<code>BMI2</code>}} - Bit Manipulation Instructions Sets 2 | * {{x86|BMI2|<code>BMI2</code>}} - Bit Manipulation Instructions Sets 2 | ||
Line 159: | Line 164: | ||
* {{x86|FMA3|<code>FMA3</code>}} - Floating Point Multiply Accumulate, 3 operands | * {{x86|FMA3|<code>FMA3</code>}} - Floating Point Multiply Accumulate, 3 operands | ||
* {{x86|TSX|<code>TSX</code>}} - Transactional Synchronization Extensions | * {{x86|TSX|<code>TSX</code>}} - Transactional Synchronization Extensions | ||
− | |||
− | |||
=== Block Diagram === | === Block Diagram === | ||
Line 168: | Line 171: | ||
=== Memory Hierarchy === | === Memory Hierarchy === | ||
− | The memory hierarchy in Haswell had a number of changes from its predecessor. The cache bandwidth for both load and store have been doubled (64B/cycle for load and 32B/cycle for store; up from 32/16 respectively). Significant enhancements have been done to support the new gather instructions and transactional memory. With | + | The memory hierarchy in Haswell had a number of changes from its predecessor. The cache bandwidth for both load and store have been doubled (64B/cycle for load and 32B/cycle for store; up from 32/16 respectively). Significant enhancements have been done to support the new gather instructions and transactional memory. With haswell new port 7 which adds an address generation for stores, up to two loads and one store are possible each cycle. |
* Cache | * Cache | ||
Line 226: | Line 229: | ||
==== Front-end ==== | ==== Front-end ==== | ||
− | The front-end is the complicated part of the microarchitecture | + | The front-end is the complicated part of the microarchitecture has it deals with variable length x86 instructions ranging from 1 to 15 bytes. The main goal here is to fetch and decode correctly the next set of instructions. The caches have not changed in Haswell from {{\\|Ivy Bridge}}, with the [[L1i$]] still 32KB , 8-way set associative shared dynamically by the two threads. Instruction cache instruction fetching remains 16B/cycle. [[TLB]] is also still 128-entries, 4-way for 4KB pages and 8-entries, [[fully associative]] for 2MB page mode. The fetched instructions are then moved on to an instruction queue which has 40 entries, 20 for each thread. Haswell continued to improve the branch misses although the exact details have not been made public. |
Haswell has the same µOps cache as Ivy Bridge - 1,536 entries organized in 32 sets of 8 cache lines with 6 µOps each. Hits can yield up to 4-µOps/cycle. The cache supports microcoded instructions (being pointers to ROM entries). Cache is shared by the two threads. | Haswell has the same µOps cache as Ivy Bridge - 1,536 entries organized in 32 sets of 8 cache lines with 6 µOps each. Hits can yield up to 4-µOps/cycle. The cache supports microcoded instructions (being pointers to ROM entries). Cache is shared by the two threads. | ||
Line 233: | Line 236: | ||
==== Execution engine ==== | ==== Execution engine ==== | ||
− | Continuing with the decoder is the [[register renaming]] stage. This is crucial for out-of-order execution. In this stage the architectural x86 registers get mapped into one of the many physical registers. The integer physical register file (PRF) has been enlarged by 8 addition registers for a total 168. Likewise the FP PRF was extended by 24 registers bringing it too to 168 registers. The larger increase in the FP PRF is likely to accommodate the new {{x86|AVX2}} extension. The [[reorder buffer|ROB]] in Haswell has been increased to 192 entries (from 168 in Ivy) where each entry corresponds to a single µOp. The | + | Continuing with the decoder is the [[register renaming]] stage. This is crucial for out-of-order execution. In this stage the architectural x86 registers get mapped into one of the many physical registers. The integer physical register file (PRF) has been enlarged by 8 addition registers for a total 168. Likewise the FP PRF was extended by 24 registers bringing it too to 168 registers. The larger increase in the FP PRF is likely to accommodate the new {{x86|AVX2}} extension. The [[reorder buffer|ROB]] in Haswell has been increased to 192 entries (from 168 in Ivy) where each entry corresponds to a single µOp. The ROD is fixed split between the two threads. Additional scheduler resources get allocated as well - this includes stores, loads, and branch buffer entries. Note that due to how dependencies are handled, there may be more or less µOps than what was fed in. For the most part, the renamer is unified and deals with both integers and vectors. Resources, however, are partitioned between the two threads. Finally, as a last step, the µOps are matched with a port depending on their intended execution purpose. Up to 4 fused µOps may be renamed and handled per thread per cycle. Both the load and store in-flight units were increased to 72 and 42 entries respectively. |
Haswell continues to use a unified scheduler for all µOps which holds 60 entries. µOps at this stage sit idle until they are cleared to be executed via their assigned dispatch port. µOps may be held due to resource unavailability. | Haswell continues to use a unified scheduler for all µOps which holds 60 entries. µOps at this stage sit idle until they are cleared to be executed via their assigned dispatch port. µOps may be held due to resource unavailability. | ||
Line 250: | Line 253: | ||
{{oc warning}} | {{oc warning}} | ||
− | Overclocking needs to be done on an unlocked part such as the [[Core i7-5820K]], [[Core i7-5930K]], or [[Core i7-5960X]] Extreme Edition. Additionally those chips | + | Overclocking needs to be done on an unlocked part such as the [[Core i7-5820K]], [[Core i7-5930K]], or [[Core i7-5960X]] Extreme Edition. Additionally those chips needs to be paired with the Intel X99 Chipset. |
[[File:haswell oc chips.png|500px|left]] | [[File:haswell oc chips.png|500px|left]] | ||
− | The 5930K and the 5820K are [[ | + | The 5930K and the 5820K are [[hex-core]] parts whereas the [[5960X]] is an octa-core part. Between 28 and 40 [[PCIe]] lanes are possible with a core ratio of up to x80 the [[BCLK]]. |
[[File:haswell bclk.png|300px|right]] | [[File:haswell bclk.png|300px|right]] | ||
Line 275: | Line 278: | ||
Client die come in [[dual-core|2]], [[quad-core|4]], or [[octa-core|8]] cores setup with dual/quad being mainstream models and the [[octa-core]] being the high-end desktop. | Client die come in [[dual-core|2]], [[quad-core|4]], or [[octa-core|8]] cores setup with dual/quad being mainstream models and the [[octa-core]] being the high-end desktop. | ||
− | ==== Dual-core | + | ====Dual-core ==== |
− | |||
− | |||
− | |||
− | |||
− | + | : [[File:haswell die (dual-core).jpg|850px]] | |
− | |||
− | |||
− | |||
− | |||
− | + | ====Quad-core ==== | |
− | |||
− | ====Quad-core | ||
* [[22 nm process]] | * [[22 nm process]] | ||
* 1,400,000,000 transistors | * 1,400,000,000 transistors | ||
Line 300: | Line 293: | ||
: [[File:haswell die (quad-core) (annotated).png|850px]] | : [[File:haswell die (quad-core) (annotated).png|850px]] | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
====Octa-core ==== | ====Octa-core ==== | ||
Line 319: | Line 306: | ||
:[[File:haswell (octa-core) die shot (annotated).png|650px]] | :[[File:haswell (octa-core) die shot (annotated).png|650px]] | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Added instructions == | == Added instructions == | ||
Line 659: | Line 636: | ||
created and tagged accordingly. | created and tagged accordingly. | ||
− | Missing a chip? please dump its name here: | + | Missing a chip? please dump its name here: http://en.wikichip.org/wiki/WikiChip:wanted_chips |
--> | --> | ||
− | + | <table class="wikitable sortable"> | |
− | <table class=" | + | <tr><th colspan="12" style="background:#D6D6FF;">Haswell Chips</th></tr> |
− | <tr | + | <tr><th colspan="9">Main processor</th><th colspan="3">IGP</th></tr> |
− | <tr | + | <tr><th>Model</th><th>µarch</th><th>Platform</th><th>Core</th><th>Launched</th><th>SDP</th><th>TDP</th><th>Freq</th><th>Max Mem</th><th>Name</th><th>Freq</th><th>Max Freq</th></tr> |
− | + | {{#ask: [[Category:microprocessor models by intel]] [[instance of::microprocessor]] [[microarchitecture::Haswell]] | |
− | < | ||
− | {{#ask: [[Category:microprocessor models by intel]] [[instance of::microprocessor]] [[microarchitecture::Haswell | ||
|?full page name | |?full page name | ||
|?model number | |?model number | ||
− | |? | + | |?microarchitecture |
− | |? | + | |?platform |
− | |||
|?core name | |?core name | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|?first launched | |?first launched | ||
− | |? | + | |?sdp |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|?tdp | |?tdp | ||
− | |?base frequency | + | |?base frequency |
− | + | |?max memory | |
− | |||
− | |||
− | |||
− | |?max memory | ||
|?integrated gpu | |?integrated gpu | ||
|?integrated gpu base frequency | |?integrated gpu base frequency | ||
|?integrated gpu max frequency | |?integrated gpu max frequency | ||
|format=template | |format=template | ||
− | |template=proc table | + | |template=proc table 2 |
− | + | |userparam=13 | |
− | |||
− | |||
− | |userparam= | ||
|mainlabel=- | |mainlabel=- | ||
− | |||
}} | }} | ||
− | {{ | + | <tr><th colspan="12">Count: {{#ask:[[Category:microprocessor models by intel]][[instance of::microprocessor]][[microarchitecture::Haswell]]|format=count}}</th></tr> |
</table> | </table> | ||
− | |||
== References == | == References == |
Facts about "Haswell - Microarchitectures - Intel"
codename | Haswell + |
core count | 2 +, 4 +, 6 +, 8 +, 16 +, 10 +, 12 +, 14 + and 18 + |
designer | Intel + |
first launched | June 4, 2013 + |
full page name | intel/microarchitectures/haswell (client) + |
instance of | microarchitecture + |
instruction set architecture | x86-64 + |
manufacturer | Intel + |
microarchitecture type | CPU + |
name | Haswell + |
phase-out | 2015 + |
pipeline stages (max) | 19 + |
pipeline stages (min) | 14 + |
process | 22 nm (0.022 μm, 2.2e-5 mm) + |