From WikiChip
Difference between revisions of "intel/microarchitectures/haswell (client)"
< intel‎ | microarchitectures

(Execution Units: to low the congestion Port 0 -> to lower the congestion for Port 0)
(56 intermediate revisions by 19 users not shown)
Line 1: Line 1:
 
{{intel title|Haswell|arch}}
 
{{intel title|Haswell|arch}}
 
{{microarchitecture
 
{{microarchitecture
| name             = Haswell
+
|atype=CPU
| manufacturer     = Intel
+
|name=Haswell
| introduction     = June 4, 2013
+
|designer=Intel
| phase-out       = 2015
+
|manufacturer=Intel
| process         = 22 nm
+
|introduction=June 4, 2013
| cores           = 2
+
|phase-out=2015
| cores 2         = 4
+
|process=22 nm
| cores 3         = 6
+
|cores=2
| cores 4         = 8
+
|cores 2=4
| cores 5         = 16
+
|cores 3=6
| cores 6          = 32
+
|cores 4=8
 
+
|cores 5=16
| pipeline        = Yes
+
|type=Superscalar
| type             = Superscalar
+
|speculative=Yes
| OoOE            = Yes
+
|renaming=Yes
| speculative     = Yes
+
|stages min=14
| renaming         = Yes
+
|stages max=19
| isa              = IA-32
+
|isa=x86-64
| isa 2            = x86-64
+
|extension=MOVBE
| stages min       = 14
+
|extension 2=MMX
| stages max       = 19
+
|extension 3=SSE
| issues          = 4
+
|extension 4=SSE2
 
+
|extension 5=SSE3
| inst            = Yes
+
|extension 6=SSSE3
| feature          =  
+
|extension 7=SSE4.1
| extension       = MOVBE
+
|extension 8=SSE4.2
| extension 2     = MMX
+
|extension 9=POPCNT
| extension 3     = SSE
+
|extension 10=AVX
| extension 4     = SSE2
+
|extension 11=AVX2
| extension 5     = SSE3
+
|extension 12=AES
| extension 6     = SSSE3
+
|extension 13=PCLMUL
| extension 7     = SSE4.1
+
|extension 14=FSGSBASE
| extension 8     = SSE4.2
+
|extension 15=RDRND
| extension 9     = POPCNT
+
|extension 16=FMA3
| extension 10     = AVX
+
|extension 17=BMI
| extension 11     = AVX2
+
|extension 18=BMI2
| extension 12     = AES
+
|extension 19=F16C
| extension 13     = PCLMUL
+
|extension 20=VT-x
| extension 14     = FSGSBASE
+
|extension 21=VT-d
| extension 15     = RDRND
+
|extension 22=TXT
| extension 16     = FMA3
+
|extension 23=TSX
| extension 17     = BMI
+
|l1i=32 KB
| extension 18     = BMI2
+
|l1i per=core
| extension 19     = F16C
+
|l1i desc=8-way set associative
| extension 20     = VT-x
+
|l1d=32 KB
| extension 21     = VT-d
+
|l1d per=core
| extension 22     = TXT
+
|l1d desc=8-way set associative
| extension 23     = TSX
+
|l2=256 KB
 
+
|l2 per=core
| cache            = Yes
+
|l2 desc=8-way set associative
| l1i             = 32 KB
+
|l3=1.5 MB
| l1i per         = core
+
|l3 per=core
| l1i desc         = 8-way set associative
+
|l4=128 MB
| l1d             = 32 KB
+
|l4 per=package
| l1d per         = core
+
|l4 desc=on Iris Pro GPUs only
| l1d desc         = 8-way set associative
+
|core name=Haswell H
| l2               = 256 KB
+
|core name 2=Haswell E
| l2 per           = core
+
|core name 3=Haswell EP
| l2 desc         = 8-way set associative
+
|core name 4=Haswell EX
| l3               = 1.5 MB
+
|core name 5=Haswell DT
| l3 per           = core
+
|core name 6=Haswell MB
| l3 desc          =
+
|core name 7=Haswell ULT
| l4               = 128 MB
+
|core name 8=Haswell ULX
| l4 per           = package
+
|predecessor=Ivy Bridge
| l4 desc         = on Iris Pro GPUs only
+
|predecessor link=intel/microarchitectures/ivy bridge
 
+
|successor=Broadwell
| core names      = Yes
+
|successor link=intel/microarchitectures/broadwell
| core name       = Haswell H
+
|pipeline=Yes
| core name 2     = Haswell E
+
|OoOE=Yes
| core name 3     = Haswell EP
+
|issues=4
| core name 4     = Haswell EX
+
|inst=Yes
| core name 5     = Haswell DT
+
|cache=Yes
| core name 6     = Haswell MB
+
|core names=Yes
| core name 7     = Haswell ULT
+
|succession=Yes
| core name 8     = Haswell ULX
 
 
 
| succession      = Yes
 
| predecessor     = Ivy Bridge
 
| predecessor link = intel/microarchitectures/ivy bridge
 
| successor       = Broadwell
 
| successor link   = intel/microarchitectures/broadwell
 
 
}}
 
}}
'''Haswell''' ('''HSW''') is [[Intel]]'s  [[microarchitecture]] based on the [[22 nm process]] for mobile, desktops, and servers. Haswell, which was introduced in 2013, became the successor to {{\\|Ivy Bridge}}. Haswell is named after [[wikipedia:Haswell, Colorado|Haswell, Colorado]] (Originally Molalla after [[wikipedia:Molalla, Oregon|Molalla, Oregon]], it was later renamed due to the difficult pronunciation). In 2014 Intel introduced {{\\|Broadwell}}.
+
'''Haswell''' ('''HSW''') is [[Intel]]'s  [[microarchitecture]] based on the [[22 nm process]] for mobile, desktops, and servers. Haswell, which was introduced in 2013, became the successor to {{\\|Ivy Bridge}}. Haswell is named after [[wikipedia:Haswell, Colorado|Haswell, Colorado]] (Originally Molalla after [[wikipedia:Molalla, Oregon|Molalla, Oregon]], it was later renamed due to the difficult pronunciation). In 2014 Intel introduced Haswell's successor, {{\\|Broadwell}}.
  
 +
For desktop and mobile, Haswell is branded as 4th Generation Intel {{intel|Core}} processors. For server class processors, Intel branded it as {{intel|Xeon E3|Xeon E3 v3}}, {{intel|Xeon E5|Xeon E5 v3}}, and {{intel|Xeon E7|Xeon E7 v3}}.
 
== Codenames ==
 
== Codenames ==
 
{| class="wikitable"
 
{| class="wikitable"
Line 106: Line 100:
 
| Haswell E || HSW-E || High-End Desktops (HEDT)
 
| Haswell E || HSW-E || High-End Desktops (HEDT)
 
|}
 
|}
 +
 +
== Process Technology ==
 +
{{main|intel/microarchitectures/ivy bridge#Process_Technology|l1=Ivy Bridge § Process Technology}}
 +
Haswell-based chips are manufactured on Intel's [[22 nm process]].
  
 
== Architecture ==
 
== Architecture ==
 
While sharing a lot of similarities with its predecessor {{\\|Ivy Bridge}}, Haswell introduces many new enhancements and features. Haswell is the first desktop-line of x86s by Intel tailored for a [[system on chip]] architecture. This is a significant move that will continue to be developed over the next couple of microarchitectures. Overall Haswell shares the same basic flow as {{\\|Sandy Bridge}} and {{\\|Ivy Bridge|Ivy}} but expends on them considerably in the execution engine with wider execution units and additional scheduler ports.
 
While sharing a lot of similarities with its predecessor {{\\|Ivy Bridge}}, Haswell introduces many new enhancements and features. Haswell is the first desktop-line of x86s by Intel tailored for a [[system on chip]] architecture. This is a significant move that will continue to be developed over the next couple of microarchitectures. Overall Haswell shares the same basic flow as {{\\|Sandy Bridge}} and {{\\|Ivy Bridge|Ivy}} but expends on them considerably in the execution engine with wider execution units and additional scheduler ports.
 +
 +
=== Key changes from {{\\|Ivy Bridge}} ===
 +
[[File:haswell buff window.png|right|350px]]
 +
* 3.5x performance/watt over {{\\|Nehalem}}
 
* Platform Controller Hub (PCH)
 
* Platform Controller Hub (PCH)
 
** Shrink from [[65 nm]] to [[32 nm]]
 
** Shrink from [[65 nm]] to [[32 nm]]
 
* Support for DDR4 (server/enthusiast segments)
 
* Support for DDR4 (server/enthusiast segments)
* Integrated voltage regulator (IVR)
+
* Full Integrated voltage regulator (FIVR)
 
* New C6 & C7 sleep states
 
* New C6 & C7 sleep states
 
* Cache
 
* Cache
Line 129: Line 131:
 
** 2 additional execution ports (see [[#Execution_Units]])
 
** 2 additional execution ports (see [[#Execution_Units]])
 
* New memory model for {{x86|TSX|Transactional Synchronization Extensions}}
 
* New memory model for {{x86|TSX|Transactional Synchronization Extensions}}
 +
 +
==== CPU changes ====
 +
Haswell can do many general purpose instructions with 4 ops/cycle throughput. SandyBridge/Ivybridge could do so only for NOPs, CLC, some vector MOVs and some zeroing instructions (SUB, XOR and vector analogs).
 +
* MOVSX and MOVZX have 4 op/cycle throughput for 8->32, 8->64 and 16->64 bit forms.
 +
* Many ALU operations have 4 op/cycle throughput for GP registers: XOR, OR, NEG, NOT, ADD, SUB, CMP, AND, etc.
 +
* Variable shifts and rotates (SHL r32, CL etc) latency increased from 1 cycle to 2 cycles, variable SHLD/SHRD from 2 cycles to 4 cycles.
 +
* REP MOVS copy is twice as fast: now ~52 bytes/cycle.
 +
* REP STOS fill is twice as fast: now ~30 bytes/cycle.
  
 
==== GPU changes ====
 
==== GPU changes ====
Line 137: Line 147:
  
 
====New instructions ====
 
====New instructions ====
{{main|#Added instructions|l1=See #Added_instructions for the complete list}}
 
 
Haswell introduced a number of new instructions:
 
Haswell introduced a number of new instructions:
 
* {{x86|AVX2|<code>AVX2</code>}} - Advanced Vector Extensions 2; an extension that extends most integer instructions to 256 bits vectors.
 
* {{x86|AVX2|<code>AVX2</code>}} - Advanced Vector Extensions 2; an extension that extends most integer instructions to 256 bits vectors.
** Vector Gather supprt
 
** Any-to-Any permutes
 
** Vector-Vector Shifts
 
 
* {{x86|BMI1|<code>BMI1</code>}} - Bit Manipulation Instructions Sets 1
 
* {{x86|BMI1|<code>BMI1</code>}} - Bit Manipulation Instructions Sets 1
 
* {{x86|BMI2|<code>BMI2</code>}} - Bit Manipulation Instructions Sets 2
 
* {{x86|BMI2|<code>BMI2</code>}} - Bit Manipulation Instructions Sets 2
Line 148: Line 154:
 
* {{x86|FMA3|<code>FMA3</code>}} - Floating Point Multiply Accumulate, 3 operands
 
* {{x86|FMA3|<code>FMA3</code>}} - Floating Point Multiply Accumulate, 3 operands
 
* {{x86|TSX|<code>TSX</code>}} - Transactional Synchronization Extensions
 
* {{x86|TSX|<code>TSX</code>}} - Transactional Synchronization Extensions
 +
* {{x86|INVPCID|<code>INVPCID</code>}} - Invalidate Process-Context Identifier
 +
* {{x86|LZCNT|<code>LZCNT</code>}} - [[Leading zero count]]
  
 
=== Block Diagram ===
 
=== Block Diagram ===
Due to the success of the front end in {{\\|Ivy Bridge}}, very few changes were done in Haswell.
 
  
 +
==== Individual Core ====
 
[[File:haswell block diagram.svg]]
 
[[File:haswell block diagram.svg]]
  
 
=== Memory Hierarchy ===
 
=== Memory Hierarchy ===
The memory hierarchy in Haswell had a number of changes from its predecessor. The cache bandwidth for both load and store have been doubled (64B/cycle for load and 32B/cycle for store; up from 32/16 respectively). Significant enhancements have been done to support the new gather instructions and transactional memory. With haswell new port 7 which adds an address generation for stores, up to two loads and one store are possible each cycle.
+
The memory hierarchy in Haswell had a number of changes from its predecessor. The cache bandwidth for both load and store have been doubled (64B/cycle for load and 32B/cycle for store; up from 32/16 respectively). Significant enhancements have been done to support the new gather instructions and transactional memory. With Haswell new port 7 which adds an address generation for stores, up to two loads and one store are possible each cycle.
  
 
* Cache
 
* Cache
Line 161: Line 169:
 
*** 32 KB 8-way [[set associative]]
 
*** 32 KB 8-way [[set associative]]
 
**** 64 B line size
 
**** 64 B line size
 +
**** Write-back policy
 
**** shared by the two threads, per core
 
**** shared by the two threads, per core
 
** L1D Cache:
 
** L1D Cache:
Line 169: Line 178:
 
**** 64 Bytes/cycle load bandwidth
 
**** 64 Bytes/cycle load bandwidth
 
**** 32 Bytes/cycle store bandwidth
 
**** 32 Bytes/cycle store bandwidth
 +
**** Write-back policy
 
** L2 Cache:
 
** L2 Cache:
 
*** unified, 256 KB 8-way set associative
 
*** unified, 256 KB 8-way set associative
 
*** 11 cycles for fastest load-to-use
 
*** 11 cycles for fastest load-to-use
 
*** 64B/cycle bandwidth to L1$
 
*** 64B/cycle bandwidth to L1$
 +
*** Write-back policy
 
** L3 Cache:
 
** L3 Cache:
*** 1.5 MB
+
*** 1.5 - 3 MB
 +
*** Write-back policy
 
*** Per core
 
*** Per core
 
** L4 Cache:
 
** L4 Cache:
Line 180: Line 192:
 
*** Per package
 
*** Per package
 
*** Only on the {{intel|Iris Pro}} GPUs
 
*** Only on the {{intel|Iris Pro}} GPUs
** TLBs:
 
*** ITLB
 
**** 4KB page translations:
 
***** 128 entries; 4-way set associative
 
***** fixed partition; divided between the two threads
 
**** 2MB/4MB page translations:
 
***** 8 entries; fully associative
 
***** Duplicated for each thread
 
*** DTLB
 
**** 4KB page translations:
 
***** 64 entries; 4-way set associative
 
***** fixed partition; divided between the two threads
 
**** 2MB/4MB page translations:
 
***** 32 entries; 4-way set associative
 
**** 1G page translations:
 
***** 4 entries; 4-way set associative
 
*** STLB
 
**** 4KB+2M page translations:
 
***** 1024 entries; 8-way set associative
 
***** shared
 
  
 +
Haswell TLB consists of dedicated level one TLB for instruction cache and another one for data cache. Additionally there is a unified second level TLB.
 +
 +
* TLBs:
 +
** ITLB
 +
*** 4KB page translations:
 +
**** 128 entries; 4-way set associative
 +
**** dynamic partition; divided between the two threads
 +
*** 2MB/4MB page translations:
 +
**** 8 entries; fully associative
 +
**** Duplicated for each thread
 +
** DTLB
 +
*** 4KB page translations:
 +
**** 64 entries; 4-way set associative
 +
**** fixed partition; divided between the two threads
 +
*** 2MB/4MB page translations:
 +
**** 32 entries; 4-way set associative
 +
*** 1G page translations:
 +
**** 4 entries; 4-way set associative
 +
** STLB
 +
*** 4KB+2M page translations:
 +
**** 1024 entries; 8-way set associative
 +
**** shared
 +
 +
== Core ==
 
=== Pipeline ===
 
=== Pipeline ===
 
Haswell, like its predecessor Ivy Bridge, also has a dual-threaded and out-of-order pipeline.  
 
Haswell, like its predecessor Ivy Bridge, also has a dual-threaded and out-of-order pipeline.  
  
 
==== Front-end ====
 
==== Front-end ====
The front-end is the complicated part of the microarchitecture has it deals with variable length x86 instructions ranging from 1 to 15 bytes. The main goal here is to fetch and decode correctly the next set of instructions. The caches have not changed in Haswell from {{\\|Ivy Bridge}}, with the [[L1i$]] still 32KB , 8-way set associative shared dynamically by the two threads. Instruction cache instruction fetching remains 16B/cycle. [[TLB]] is also still 128-entries, 4-way for 4KB pages and 8-entries, [[fully associative]] for 2MB page mode. The fetched instructions are then moved on to an instruction queue which has 40 entries, 20 for each thread. Haswell continued to improve the branch misses although the exact details have not been made public.
+
The front-end is the complicated part of the microarchitecture as it deals with variable length x86 instructions ranging from 1 to 15 bytes. The main goal here is to fetch and decode correctly the next set of instructions. The caches have not changed in Haswell from {{\\|Ivy Bridge}}, with the [[L1i$]] still 32KB , 8-way set associative shared dynamically by the two threads. Instruction cache instruction fetching remains 16B/cycle. [[TLB]] is also still 128-entries, 4-way for 4KB pages and 8-entries, [[fully associative]] for 2MB page mode. The fetched instructions are then moved on to an instruction queue which has 40 entries, 20 for each thread. Haswell continued to improve the branch misses although the exact details have not been made public.
  
 
Haswell has the same µOps cache as Ivy Bridge - 1,536 entries organized in 32 sets of 8 cache lines with 6 µOps each. Hits can yield up to 4-µOps/cycle. The cache supports microcoded instructions (being pointers to ROM entries). Cache is shared by the two threads.
 
Haswell has the same µOps cache as Ivy Bridge - 1,536 entries organized in 32 sets of 8 cache lines with 6 µOps each. Hits can yield up to 4-µOps/cycle. The cache supports microcoded instructions (being pointers to ROM entries). Cache is shared by the two threads.
Line 212: Line 228:
  
 
==== Execution engine ====
 
==== Execution engine ====
Continuing with the decoder is the [[register renaming]] stage. This is crucial for out-of-order execution. In this stage the architectural x86 registers get mapped into one of the many physical registers. The integer physical register file (PRF) has been enlarged by 8 addition registers for a total 168. Likewise the FP PRF was extended by 24 registers bringing it too to 168 registers. The larger increase in the FP PRF is likely to accommodate the new {{x86|AVX2}} extension. The [[reorder buffer|ROB]] in Haswell has been increased to 192 entries (from 168 in Ivy) where each entry corresponds to a single µOp. The ROD is fixed split between the two threads. Additional scheduler resources get allocated as well - this includes stores, loads, and branch buffer entries. Note that due to how dependencies are handled, there may be more or less µOps than what was fed in. For the most part, the renamer is unified and deals with both integers and vectors. Resources, however, are partitioned between the two threads. Finally, as a last step, the µOps are matched with a port depending on their intended execution purpose. Up to 4 fused µOps may be renamed and handled per thread per cycle. Both the load and store in-flight units were increased to 72 and 42 entries respectively.
+
Continuing with the decoder is the [[register renaming]] stage. This is crucial for out-of-order execution. In this stage the architectural x86 registers get mapped into one of the many physical registers. The integer physical register file (PRF) has been enlarged by 8 addition registers for a total 168. Likewise the FP PRF was extended by 24 registers bringing it too to 168 registers. The larger increase in the FP PRF is likely to accommodate the new {{x86|AVX2}} extension. The [[reorder buffer|ROB]] in Haswell has been increased to 192 entries (from 168 in Ivy) where each entry corresponds to a single µOp. The ROB is fixed split between the two threads. Additional scheduler resources get allocated as well - this includes stores, loads, and branch buffer entries. Note that due to how dependencies are handled, there may be more or less µOps than what was fed in. For the most part, the renamer is unified and deals with both integers and vectors. Resources, however, are partitioned between the two threads. Finally, as a last step, the µOps are matched with a port depending on their intended execution purpose. Up to 4 fused µOps may be renamed and handled per thread per cycle. Both the load and store in-flight units were increased to 72 and 42 entries respectively.
  
 
Haswell continues to use a unified scheduler for all µOps which holds 60 entries. µOps at this stage sit idle until they are cleared to be  executed via their assigned dispatch port. µOps may be held due to resource unavailability.
 
Haswell continues to use a unified scheduler for all µOps which holds 60 entries. µOps at this stage sit idle until they are cleared to be  executed via their assigned dispatch port. µOps may be held due to resource unavailability.
Line 219: Line 235:
  
 
===== Execution Units =====
 
===== Execution Units =====
Some of the biggest architectural changes were done in the area of the execution units. Haswell widened the scheduler by two ports - one new integer dispatch port and a new memory port bringing the total to 8 µOps/cycle. The various ports have also been rebalanced. The new port 6 adds another Integer ALU designs to improve integer workloads freeing up Port 0 and 1 for vector works. It also adds a second branch unit to low the congestion Port 0. The second port that was added, Port 7 adds a new [[address generation unit|AGU]]. This is largely due to the improvements for {{x86|AVX2}} that roughly doubled its throughput. Port 0 had its ALU/Mul/shifter extended to 256-bits; same is true for the vector ALU on port 1 and the ALU/shuffle on port 5. Additionally a 256-bit FMA unit were added to both port 0 and port 1. The change makes it possible for FMAs and FMULs to issue on both ports. In theory, Haswell can peak at over double the performance of {{\|Sandy Bridge}}, with 16 double / 32 single precision [[FLOP]]/cycle + Integer ALU option +  Vector operation.
+
Some of the biggest architectural changes were done in the area of the execution units. Haswell widened the scheduler by two ports - one new integer dispatch port and a new memory port bringing the total to 8 µOps/cycle. The various ports have also been rebalanced. The new port 6 adds another Integer ALU designs to improve integer workloads freeing up Port 0 and 1 for vector works. It also adds a second branch unit to lower the congestion for Port 0. The second port that was added, Port 7 adds a new [[address generation unit|AGU]]. This is largely due to the improvements for {{x86|AVX2}} that roughly doubled its throughput. Port 0 had its ALU/Mul/shifter extended to 256-bits; same is true for the vector ALU on port 1 and the ALU/shuffle on port 5. Additionally a 256-bit FMA unit were added to both port 0 and port 1. The change makes it possible for FMAs and FMULs to issue on both ports. In theory, Haswell can peak at over double the performance of {{\|Sandy Bridge}}, with 16 double / 32 single precision [[FLOP]]/cycle + Integer ALU option +  Vector operation.
  
 
The scheduler dispatches up to 8 ready µOps/cycle in [[FIFO]] order through the dispatch ports. µOps involving computational operations are sent to ports 0, 1, 5, and 6 to the appropriate unit. Likewise ports 2, 3, 4 and 7 are used for load/store and address calculations.
 
The scheduler dispatches up to 8 ready µOps/cycle in [[FIFO]] order through the dispatch ports. µOps involving computational operations are sent to ports 0, 1, 5, and 6 to the appropriate unit. Likewise ports 2, 3, 4 and 7 are used for load/store and address calculations.
 +
 +
==Clock domains==
 +
{{empty section}}
 +
=== Overclocking ===
 +
{{see also|intel/xmp|l1=Intel's XMP}}
 +
{{oc warning}}
 +
 +
Overclocking needs to be done on an unlocked part such as the [[Core i7-5820K]], [[Core i7-5930K]], or [[Core i7-5960X]] Extreme Edition. Additionally those chips need to be paired with the Intel X99 Chipset.
 +
 +
[[File:haswell oc chips.png|500px|left]]
 +
 +
The 5930K and the 5820K are [[hexa-core]] parts whereas the [[5960X]] is an octa-core part. Between 28 and 40 [[PCIe]] lanes are possible with a core ratio of up to x80 the [[BCLK]].
 +
 +
[[File:haswell bclk.png|300px|right]]
 +
Haswell provides a Coarsed BCLK ratios of either 100 MHz, 125 MHz, or 167 MHz (this was consequently changed in {{intel|Skylake#Overclocking|Skylake}}). The clock is generated internally by the chipset, but motherboard ODMs could generate it independently. A single BCLK from the PCH is fed in < 1 MHz steps, however in practice the input is very much limited by PCI Express and DMI PLL interface. This works out to 100 MHz ± 5-7% PEG/DMI @ 5:5, 125 MHz ±5-7% PEG/DMI @ 5:4, and 166.66 MHz ±5-7% @ 5:3.
 +
 +
<div style="display: table; padding: 5px;">
 +
* '''f<sub>CORE</sub>''' = [[BCLK]] × [Core Ratio]
 +
* '''f<sub>RING</sub>''' = BCLK × [Ring Ratio]
 +
* '''F<sub>DDR</sub>''' = BCLK × [1.33/1.00] × [DDR Ratio]
 +
</div>
 +
 +
All the clock domains in Haswell are derived from the BCLK (also called DMICLK). In the diagram on the right '''(xC)''' refers to the Core Frequency and is represented as a multiple of BCLK (Core Frequency = BCLK × Core Freq Multiplier up to x80). Likewise '''(xM)''' refers to the memory ratio (up to 2667 MT/s in granularity operations of 200 and 266 MHz) and Two additional multipliers to adjust the PEG(PCIe & Graphics)/DMI links which should remain at a nominal frequency of 100 MHz.
 +
 +
Voltage control is done by Haswell's new FIVER (Full Integrated Voltage Regulator) based architecture. This means that voltage arrives via the V<sub>CCin</sub> input from the motherboard into the processor and onto the voltage regulator (V<sub>CCin</sub> = [[SVID]] 1.8 V Nom up to 2.3 V+). Internally, the various voltage planes are all derived from there. This includes the V<sub>CORE</sub>, V<sub>RING</sub>, and V<sub>SA</sub>. With the memory voltage (V<sub>DDQ</sub> = 1.2 V Nom) provided from the motherboard with to its own rail.
 +
 +
{{clear}}
  
 
== Die ==
 
== Die ==
Quad-core Haswell die:
+
=== Client Die ===
: [[File:haswell die (quad-core).png|850px]]
+
Client die come in [[dual-core|2]], [[quad-core|4]], or [[octa-core|8]] cores setup with dual/quad being mainstream models and the [[octa-core]] being the high-end desktop.
 +
 
 +
==== Dual-core GT2 ====
 +
* 22 nm process
 +
* 960,000,000 transistors
 +
* 131 mm² die size
 +
* 2 CPU cores
 +
 
 +
==== Dual-core GT3 ====
 +
* 22 nm process
 +
* 1,300,000,000 transistors
 +
* 181 mm² die size
 +
* 2 CPU cores
  
: [[File:haswell die (quad-core) (annotated).png|850px]]
+
: [[File:haswell gt3 die (dual-core).jpg|850px]]
  
 +
====Quad-core GT2 ====
 +
* [[22 nm process]]
 
* 1,400,000,000 transistors
 
* 1,400,000,000 transistors
* 177 mm<sup>2</sup>
+
* 177 mm² die size
 
* 4 CPU cores
 
* 4 CPU cores
 
* 1 GPU core
 
* 1 GPU core
 
** 2x10xEU (80 ALUs)
 
** 2x10xEU (80 ALUs)
 +
 +
: [[File:haswell die (quad-core).png|850px]]
 +
 +
: [[File:haswell die (quad-core) (annotated).png|850px]]
 +
 +
====Quad-core GT3 ====
 
* [[22 nm process]]
 
* [[22 nm process]]
 +
* 1,700,000,000 transistors
 +
* 260 mm² die size
 +
* 4 CPU cores
 +
 +
====Octa-core ====
 +
* {{intel|Core i7-5960X}}
 +
* [[Octa-core]] processor
 +
* [[22 nm process]]
 +
* 2,600,000,000 transistors
 +
* 355.52 mm² die size
 +
* 17.6 mm x 20.2 mm
 +
 +
:[[File:haswell (octa-core) die shot.png|650px]]
 +
 +
 +
:[[File:haswell (octa-core) die shot (annotated).png|650px]]
 +
 +
=== Server Die ===
 +
 +
====Octadeca-core====
 +
* [[18 cores]] processor
 +
* [[22 nm process]]
 +
* 5,690,000,000 transistors
 +
* 622 mm² die size
 +
 +
:[[File:intel xeon e7 v3.jpg|850px]]
  
 
== Added instructions ==
 
== Added instructions ==
Line 557: Line 646:
  
 
== Cores ==
 
== Cores ==
 +
{{empty section}}
 +
 
== All Haswell Chips ==
 
== All Haswell Chips ==
 
<!-- NOTE:  
 
<!-- NOTE:  
Line 563: Line 654:
 
           created and tagged accordingly.
 
           created and tagged accordingly.
  
           Missing a chip? please dump its name here: http://en.wikichip.org/wiki/WikiChip:wanted_chips
+
           Missing a chip? please dump its name here: https://en.wikichip.org/wiki/WikiChip:wanted_chips
 
-->
 
-->
<table class="wikitable sortable">
+
{{comp table start}}
<tr><th colspan="12" style="background:#D6D6FF;">Haswell Chips</th></tr>
+
<table class="comptable sortable tc6 tc7 tc20 tc21 tc22 tc23 tc24 tc25">
<tr><th colspan="9">Main processor</th><th colspan="3">IGP</th></tr>
+
<tr class="comptable-header"><th>&nbsp;</th><th colspan="19">List of Haswell Processors</th></tr>
<tr><th>Model</th><th>µarch</th><th>Platform</th><th>Core</th><th>Launched</th><th>SDP</th><th>TDP</th><th>Freq</th><th>Max Mem</th><th>Name</th><th>Freq</th><th>Max Freq</th></tr>
+
<tr class="comptable-header"><th>&nbsp;</th><th colspan="9">Main processor</th><th colspan="5">{{intel|Turbo Boost}}</th><th>Mem</th><th colspan="3">IGP</th></tr>
{{#ask: [[Category:microprocessor models by intel]] [[microarchitecture::Haswell]]
+
{{comp table header 1|cols=Launched, Price, Family, Core Name, Cores, Threads, %L2$, %L3$, TDP, %Frequency, 1 Core, 2 Cores, 3 Cores, 4 Cores, Max Mem, GPU, %Frequency, Turbo}}
 +
<tr class="comptable-header comptable-header-sep"><th>&nbsp;</th><th colspan="20">[[Uniprocessors]]</th></tr>
 +
{{#ask: [[Category:microprocessor models by intel]] [[instance of::microprocessor]] [[microarchitecture::Haswell]] [[max cpu count::1]]
 
  |?full page name
 
  |?full page name
 
  |?model number
 
  |?model number
  |?microarchitecture
+
  |?first launched
  |?platform
+
  |?release price
 +
|?microprocessor family
 
  |?core name
 
  |?core name
 +
|?core count
 +
|?thread count
 +
|?l2$ size
 +
|?l3$ size
 +
|?tdp
 +
|?base frequency#GHz
 +
|?turbo frequency (1 core)#GHz
 +
|?turbo frequency (2 cores)#GHz
 +
|?turbo frequency (3 cores)#GHz
 +
|?turbo frequency (4 cores)#GHz
 +
|?max memory#GiB
 +
|?integrated gpu
 +
|?integrated gpu base frequency
 +
|?integrated gpu max frequency
 +
|format=template
 +
|template=proc table 3
 +
|searchlabel=
 +
|sort=microprocessor family, model number
 +
|order=asc,asc
 +
|userparam=20
 +
|mainlabel=-
 +
|limit=200
 +
}}
 +
<tr class="comptable-header comptable-header-sep"><th>&nbsp;</th><th colspan="20">[[Multiprocessors]] (2-way)</th></tr>
 +
{{#ask: [[Category:microprocessor models by intel]] [[instance of::microprocessor]] [[microarchitecture::Haswell]] [[max cpu count::2]]
 +
|?full page name
 +
|?model number
 
  |?first launched
 
  |?first launched
  |?sdp
+
  |?release price
 +
|?microprocessor family
 +
|?core name
 +
|?core count
 +
|?thread count
 +
|?l2$ size
 +
|?l3$ size
 
  |?tdp
 
  |?tdp
  |?base frequency
+
  |?base frequency#GHz
  |?max memory
+
|?turbo frequency (1 core)#GHz
 +
|?turbo frequency (2 cores)#GHz
 +
|?turbo frequency (3 cores)#GHz
 +
|?turbo frequency (4 cores)#GHz
 +
  |?max memory#GiB
 
  |?integrated gpu
 
  |?integrated gpu
 
  |?integrated gpu base frequency
 
  |?integrated gpu base frequency
 
  |?integrated gpu max frequency
 
  |?integrated gpu max frequency
 
  |format=template
 
  |format=template
  |template=proc table 2
+
  |template=proc table 3
  |userparam=13
+
|searchlabel=
 +
|sort=microprocessor family, model number
 +
|order=asc,asc
 +
  |userparam=20
 
  |mainlabel=-
 
  |mainlabel=-
 +
|limit=200
 
}}
 
}}
 +
<tr class="comptable-header comptable-header-sep"><th>&nbsp;</th><th colspan="20">[[Multiprocessors]] (4-way)</th></tr>
 +
{{#ask: [[Category:microprocessor models by intel]] [[instance of::microprocessor]] [[microarchitecture::Haswell]] [[max cpu count::4]]
 +
|?full page name
 +
|?model number
 +
|?first launched
 +
|?release price
 +
|?microprocessor family
 +
|?core name
 +
|?core count
 +
|?thread count
 +
|?l2$ size
 +
|?l3$ size
 +
|?tdp
 +
|?base frequency#GHz
 +
|?turbo frequency (1 core)#GHz
 +
|?turbo frequency (2 cores)#GHz
 +
|?turbo frequency (3 cores)#GHz
 +
|?turbo frequency (4 cores)#GHz
 +
|?max memory#GiB
 +
|?integrated gpu
 +
|?integrated gpu base frequency
 +
|?integrated gpu max frequency
 +
|format=template
 +
|template=proc table 3
 +
|searchlabel=
 +
|sort=microprocessor family, model number
 +
|order=asc,asc
 +
|userparam=20
 +
|mainlabel=-
 +
|limit=200
 +
}}
 +
<tr class="comptable-header comptable-header-sep"><th>&nbsp;</th><th colspan="20">[[Multiprocessors]] (8-way)</th></tr>
 +
{{#ask: [[Category:microprocessor models by intel]] [[instance of::microprocessor]] [[microarchitecture::Haswell]] [[max cpu count::8]]
 +
|?full page name
 +
|?model number
 +
|?first launched
 +
|?release price
 +
|?microprocessor family
 +
|?core name
 +
|?core count
 +
|?thread count
 +
|?l2$ size
 +
|?l3$ size
 +
|?tdp
 +
|?base frequency#GHz
 +
|?turbo frequency (1 core)#GHz
 +
|?turbo frequency (2 cores)#GHz
 +
|?turbo frequency (3 cores)#GHz
 +
|?turbo frequency (4 cores)#GHz
 +
|?max memory#GiB
 +
|?integrated gpu
 +
|?integrated gpu base frequency
 +
|?integrated gpu max frequency
 +
|format=template
 +
|template=proc table 3
 +
|searchlabel=
 +
|sort=microprocessor family, model number
 +
|order=asc,asc
 +
|userparam=20
 +
|mainlabel=-
 +
|limit=200
 +
}}
 +
{{comp table count|ask=[[Category:microprocessor models by intel]] [[instance of::microprocessor]] [[microarchitecture::Haswell]]}}
 
</table>
 
</table>
 +
{{comp table end}}
 +
 +
== References ==
 +
* Hammarlund, Per, et al. "Haswell: The fourth-generation intel core processor." IEEE Micro 34.2 (2014): 6-20.
 +
* Dan Ragland, Overclocking System Architect, 2015 IDF, in San Francisco, Session RPCS001 ("Overclocking 6th Generation Intel® Core™ Processors!"), August 18, 2015
 +
 +
== Documents ==
 +
* [[:File:haswell isa extension.pdf|Haswell new ISA extensions]]

Revision as of 03:37, 20 June 2020

Edit Values
Haswell µarch
General Info
Arch TypeCPU
DesignerIntel
ManufacturerIntel
IntroductionJune 4, 2013
Phase-out2015
Process22 nm
Core Configs2, 4, 6, 8, 16
Pipeline
TypeSuperscalar
SpeculativeYes
Reg RenamingYes
Stages14-19
Instructions
ISAx86-64
ExtensionsMOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA3, BMI, BMI2, F16C, VT-x, VT-d, TXT, TSX
Cache
L1I Cache32 KB/core
8-way set associative
L1D Cache32 KB/core
8-way set associative
L2 Cache256 KB/core
8-way set associative
L3 Cache1.5 MB/core
L4 Cache128 MB/package
on Iris Pro GPUs only
Cores
Core NamesHaswell H,
Haswell E,
Haswell EP,
Haswell EX,
Haswell DT,
Haswell MB,
Haswell ULT,
Haswell ULX
Succession

Haswell (HSW) is Intel's microarchitecture based on the 22 nm process for mobile, desktops, and servers. Haswell, which was introduced in 2013, became the successor to Ivy Bridge. Haswell is named after Haswell, Colorado (Originally Molalla after Molalla, Oregon, it was later renamed due to the difficult pronunciation). In 2014 Intel introduced Haswell's successor, Broadwell.

For desktop and mobile, Haswell is branded as 4th Generation Intel Core processors. For server class processors, Intel branded it as Xeon E3 v3, Xeon E5 v3, and Xeon E7 v3.

Codenames

Core Abbrev Target
Haswell DT HSW-DT Desktops
Haswell MB HSW-MB Mobile/Laptops
Haswell H HSW-H All-in-ones
Haswell ULT HSW-ULT UltraBooks (MCPs)
Haswell ULX HSW-ULX Tablets/UltraBooks (SoCs)
Haswell EP HSW-EP Xeon chips
Haswell EX HSW-EX Xeon chips, QP
Haswell E HSW-E High-End Desktops (HEDT)

Process Technology

Main article: Ivy Bridge § Process Technology

Haswell-based chips are manufactured on Intel's 22 nm process.

Architecture

While sharing a lot of similarities with its predecessor Ivy Bridge, Haswell introduces many new enhancements and features. Haswell is the first desktop-line of x86s by Intel tailored for a system on chip architecture. This is a significant move that will continue to be developed over the next couple of microarchitectures. Overall Haswell shares the same basic flow as Sandy Bridge and Ivy but expends on them considerably in the execution engine with wider execution units and additional scheduler ports.

Key changes from Ivy Bridge

haswell buff window.png
  • 3.5x performance/watt over Nehalem
  • Platform Controller Hub (PCH)
  • Support for DDR4 (server/enthusiast segments)
  • Full Integrated voltage regulator (FIVR)
  • New C6 & C7 sleep states
  • Cache
    • L1D$ has double the bandwidth
      • Load: 64B/cycle (up from 32B/cycle)
      • Store: 32B/cycle (up from 16B/cycle)
    • L2$ bandwidth to L1 is doubled
      • 64B/cycle (up from 32B/cycle)
    • STLB been made to support 2MB pages
      • Table has been doubled to 1,024 entries 8-Way (up from 512, 4-way)
  • Reorder Buffer (ROB) was increased to 192 entries (up from 168)
  • Scheduler has been widened, (see #Front-end)
    • Increased to 60 entries (up from 54)
    • Integer register file up 8 entries to 168
    • FP register file up 24 entries to 168
    • 2 additional execution ports (see #Execution_Units)
  • New memory model for Transactional Synchronization Extensions

CPU changes

Haswell can do many general purpose instructions with 4 ops/cycle throughput. SandyBridge/Ivybridge could do so only for NOPs, CLC, some vector MOVs and some zeroing instructions (SUB, XOR and vector analogs).

  • MOVSX and MOVZX have 4 op/cycle throughput for 8->32, 8->64 and 16->64 bit forms.
  • Many ALU operations have 4 op/cycle throughput for GP registers: XOR, OR, NEG, NOT, ADD, SUB, CMP, AND, etc.
  • Variable shifts and rotates (SHL r32, CL etc) latency increased from 1 cycle to 2 cycles, variable SHLD/SHRD from 2 cycles to 4 cycles.
  • REP MOVS copy is twice as fast: now ~52 bytes/cycle.
  • REP STOS fill is twice as fast: now ~30 bytes/cycle.

GPU changes

  • Direct3D 11.1
  • OpenGL 4.3
  • OpenCL 1.2
  • Four versions of GPU options codenamed GT1, GT2, GT3 and GT3 (with GT3e having a dedicated eDRAM L4$)

New instructions

Haswell introduced a number of new instructions:

  • AVX2 - Advanced Vector Extensions 2; an extension that extends most integer instructions to 256 bits vectors.
  • BMI1 - Bit Manipulation Instructions Sets 1
  • BMI2 - Bit Manipulation Instructions Sets 2
  • MOVBE - Move Big-Endian instruction
  • FMA3 - Floating Point Multiply Accumulate, 3 operands
  • TSX - Transactional Synchronization Extensions
  • INVPCID - Invalidate Process-Context Identifier
  • LZCNT - Leading zero count

Block Diagram

Individual Core

haswell block diagram.svg

Memory Hierarchy

The memory hierarchy in Haswell had a number of changes from its predecessor. The cache bandwidth for both load and store have been doubled (64B/cycle for load and 32B/cycle for store; up from 32/16 respectively). Significant enhancements have been done to support the new gather instructions and transactional memory. With Haswell new port 7 which adds an address generation for stores, up to two loads and one store are possible each cycle.

  • Cache
    • L1I Cache:
      • 32 KB 8-way set associative
        • 64 B line size
        • Write-back policy
        • shared by the two threads, per core
    • L1D Cache:
      • 32 KB 8-way set associative
        • 64 B line size
        • shared by the two threads, per core
        • 4 cycles for fastest load-to-use
        • 64 Bytes/cycle load bandwidth
        • 32 Bytes/cycle store bandwidth
        • Write-back policy
    • L2 Cache:
      • unified, 256 KB 8-way set associative
      • 11 cycles for fastest load-to-use
      • 64B/cycle bandwidth to L1$
      • Write-back policy
    • L3 Cache:
      • 1.5 - 3 MB
      • Write-back policy
      • Per core
    • L4 Cache:
      • 128 MB
      • Per package
      • Only on the Iris Pro GPUs

Haswell TLB consists of dedicated level one TLB for instruction cache and another one for data cache. Additionally there is a unified second level TLB.

  • TLBs:
    • ITLB
      • 4KB page translations:
        • 128 entries; 4-way set associative
        • dynamic partition; divided between the two threads
      • 2MB/4MB page translations:
        • 8 entries; fully associative
        • Duplicated for each thread
    • DTLB
      • 4KB page translations:
        • 64 entries; 4-way set associative
        • fixed partition; divided between the two threads
      • 2MB/4MB page translations:
        • 32 entries; 4-way set associative
      • 1G page translations:
        • 4 entries; 4-way set associative
    • STLB
      • 4KB+2M page translations:
        • 1024 entries; 8-way set associative
        • shared

Core

Pipeline

Haswell, like its predecessor Ivy Bridge, also has a dual-threaded and out-of-order pipeline.

Front-end

The front-end is the complicated part of the microarchitecture as it deals with variable length x86 instructions ranging from 1 to 15 bytes. The main goal here is to fetch and decode correctly the next set of instructions. The caches have not changed in Haswell from Ivy Bridge, with the L1i$ still 32KB , 8-way set associative shared dynamically by the two threads. Instruction cache instruction fetching remains 16B/cycle. TLB is also still 128-entries, 4-way for 4KB pages and 8-entries, fully associative for 2MB page mode. The fetched instructions are then moved on to an instruction queue which has 40 entries, 20 for each thread. Haswell continued to improve the branch misses although the exact details have not been made public.

Haswell has the same µOps cache as Ivy Bridge - 1,536 entries organized in 32 sets of 8 cache lines with 6 µOps each. Hits can yield up to 4-µOps/cycle. The cache supports microcoded instructions (being pointers to ROM entries). Cache is shared by the two threads.

Following the instruction queue, instructions are coded via the complex 4-way decoder. The decoder has 3 simple decoders and 1 complex decoder. In total, they are capable of emitting 3 single fused µOps and an additional 1-4 fused µOps. The unit handles both micro and macro fusions. Macro-fusion as a result of compatible adjacent µOps may be merged into a single µOp. Push and pops as well as call and return are also handled at this stage. 4 instructions, but with the aid of the macro-fusion, up to 5 instructions can be decoded each cycle.

Execution engine

Continuing with the decoder is the register renaming stage. This is crucial for out-of-order execution. In this stage the architectural x86 registers get mapped into one of the many physical registers. The integer physical register file (PRF) has been enlarged by 8 addition registers for a total 168. Likewise the FP PRF was extended by 24 registers bringing it too to 168 registers. The larger increase in the FP PRF is likely to accommodate the new AVX2 extension. The ROB in Haswell has been increased to 192 entries (from 168 in Ivy) where each entry corresponds to a single µOp. The ROB is fixed split between the two threads. Additional scheduler resources get allocated as well - this includes stores, loads, and branch buffer entries. Note that due to how dependencies are handled, there may be more or less µOps than what was fed in. For the most part, the renamer is unified and deals with both integers and vectors. Resources, however, are partitioned between the two threads. Finally, as a last step, the µOps are matched with a port depending on their intended execution purpose. Up to 4 fused µOps may be renamed and handled per thread per cycle. Both the load and store in-flight units were increased to 72 and 42 entries respectively.

Haswell continues to use a unified scheduler for all µOps which holds 60 entries. µOps at this stage sit idle until they are cleared to be executed via their assigned dispatch port. µOps may be held due to resource unavailability.

Following a successful execution, µOps retire at a rate of up to 4 fused µOps/cycle. Retirement is once again in-order and frees up any reserved resource (ROB entries, PRFs entries, and various other buffers).

Execution Units

Some of the biggest architectural changes were done in the area of the execution units. Haswell widened the scheduler by two ports - one new integer dispatch port and a new memory port bringing the total to 8 µOps/cycle. The various ports have also been rebalanced. The new port 6 adds another Integer ALU designs to improve integer workloads freeing up Port 0 and 1 for vector works. It also adds a second branch unit to lower the congestion for Port 0. The second port that was added, Port 7 adds a new AGU. This is largely due to the improvements for AVX2 that roughly doubled its throughput. Port 0 had its ALU/Mul/shifter extended to 256-bits; same is true for the vector ALU on port 1 and the ALU/shuffle on port 5. Additionally a 256-bit FMA unit were added to both port 0 and port 1. The change makes it possible for FMAs and FMULs to issue on both ports. In theory, Haswell can peak at over double the performance of Sandy Bridge, with 16 double / 32 single precision FLOP/cycle + Integer ALU option + Vector operation.

The scheduler dispatches up to 8 ready µOps/cycle in FIFO order through the dispatch ports. µOps involving computational operations are sent to ports 0, 1, 5, and 6 to the appropriate unit. Likewise ports 2, 3, 4 and 7 are used for load/store and address calculations.

Clock domains

New text document.svg This section is empty; you can help add the missing info by editing this page.

Overclocking

See also: Intel's XMP
Warning: Overclocking can result in better performance for many types of workloads but it does so by pushing the system beyond its rated specifications. This can reduce the life of the chip, affect system data integrity, reduce system stability, and cause system components to fail. [Edit]

Overclocking needs to be done on an unlocked part such as the Core i7-5820K, Core i7-5930K, or Core i7-5960X Extreme Edition. Additionally those chips need to be paired with the Intel X99 Chipset.

haswell oc chips.png

The 5930K and the 5820K are hexa-core parts whereas the 5960X is an octa-core part. Between 28 and 40 PCIe lanes are possible with a core ratio of up to x80 the BCLK.

haswell bclk.png

Haswell provides a Coarsed BCLK ratios of either 100 MHz, 125 MHz, or 167 MHz (this was consequently changed in Skylake). The clock is generated internally by the chipset, but motherboard ODMs could generate it independently. A single BCLK from the PCH is fed in < 1 MHz steps, however in practice the input is very much limited by PCI Express and DMI PLL interface. This works out to 100 MHz ± 5-7% PEG/DMI @ 5:5, 125 MHz ±5-7% PEG/DMI @ 5:4, and 166.66 MHz ±5-7% @ 5:3.

  • fCORE = BCLK × [Core Ratio]
  • fRING = BCLK × [Ring Ratio]
  • FDDR = BCLK × [1.33/1.00] × [DDR Ratio]

All the clock domains in Haswell are derived from the BCLK (also called DMICLK). In the diagram on the right (xC) refers to the Core Frequency and is represented as a multiple of BCLK (Core Frequency = BCLK × Core Freq Multiplier up to x80). Likewise (xM) refers to the memory ratio (up to 2667 MT/s in granularity operations of 200 and 266 MHz) and Two additional multipliers to adjust the PEG(PCIe & Graphics)/DMI links which should remain at a nominal frequency of 100 MHz.

Voltage control is done by Haswell's new FIVER (Full Integrated Voltage Regulator) based architecture. This means that voltage arrives via the VCCin input from the motherboard into the processor and onto the voltage regulator (VCCin = SVID 1.8 V Nom up to 2.3 V+). Internally, the various voltage planes are all derived from there. This includes the VCORE, VRING, and VSA. With the memory voltage (VDDQ = 1.2 V Nom) provided from the motherboard with to its own rail.

Die

Client Die

Client die come in 2, 4, or 8 cores setup with dual/quad being mainstream models and the octa-core being the high-end desktop.

Dual-core GT2

  • 22 nm process
  • 960,000,000 transistors
  • 131 mm² die size
  • 2 CPU cores

Dual-core GT3

  • 22 nm process
  • 1,300,000,000 transistors
  • 181 mm² die size
  • 2 CPU cores
haswell gt3 die (dual-core).jpg

Quad-core GT2

  • 22 nm process
  • 1,400,000,000 transistors
  • 177 mm² die size
  • 4 CPU cores
  • 1 GPU core
    • 2x10xEU (80 ALUs)
haswell die (quad-core).png
haswell die (quad-core) (annotated).png

Quad-core GT3

  • 22 nm process
  • 1,700,000,000 transistors
  • 260 mm² die size
  • 4 CPU cores

Octa-core

haswell (octa-core) die shot.png


haswell (octa-core) die shot (annotated).png

Server Die

Octadeca-core

intel xeon e7 v3.jpg

Added instructions

AVX2 - Integer data types were extended to 256-bit SIMD.

BMI1 / BMI2 - Bit Manipulation Instructions Sets

FMA3 - Fused Multiply-Add instructions, 3 operands

MOVBE - Move Big-Endian instruction

TSX - Transactional Synchronization Extensions

Cores

New text document.svg This section is empty; you can help add the missing info by editing this page.

All Haswell Chips

 List of Haswell Processors
 Main processorTurbo BoostMemIGP
ModelLaunchedPriceFamilyCore NameCoresThreadsL2$L3$TDPFrequency1 Core2 Cores3 Cores4 CoresMax MemGPUFrequencyTurbo
 Uniprocessors
i5-4570R4 June 2013Core i5441 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
4 MiB
4,096 KiB
4,194,304 B
0.00391 GiB
65 W
65,000 mW
0.0872 hp
0.065 kW
2.7 GHz
2,700 MHz
2,700,000 kHz
3.2 GHz
3,200 MHz
3,200,000 kHz
3.1 GHz
3,100 MHz
3,100,000 kHz
3 GHz
3,000 MHz
3,000,000 kHz
3 GHz
3,000 MHz
3,000,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
1,150 MHz
1.15 GHz
1,150,000 KHz
i5-4670R4 June 2013Core i5441 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
4 MiB
4,096 KiB
4,194,304 B
0.00391 GiB
65 W
65,000 mW
0.0872 hp
0.065 kW
3 GHz
3,000 MHz
3,000,000 kHz
3.7 GHz
3,700 MHz
3,700,000 kHz
3.6 GHz
3,600 MHz
3,600,000 kHz
3.5 GHz
3,500 MHz
3,500,000 kHz
3.5 GHz
3,500 MHz
3,500,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
1,300 MHz
1.3 GHz
1,300,000 KHz
i7-4750HQ2 June 2013Core i7481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
6 MiB
6,144 KiB
6,291,456 B
0.00586 GiB
47 W
47,000 mW
0.063 hp
0.047 kW
2 GHz
2,000 MHz
2,000,000 kHz
3.2 GHz
3,200 MHz
3,200,000 kHz
3.1 GHz
3,100 MHz
3,100,000 kHz
3 GHz
3,000 MHz
3,000,000 kHz
3 GHz
3,000 MHz
3,000,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
1,200 MHz
1.2 GHz
1,200,000 KHz
i7-4760HQ14 April 2014Core i7481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
6 MiB
6,144 KiB
6,291,456 B
0.00586 GiB
47 W
47,000 mW
0.063 hp
0.047 kW
2.1 GHz
2,100 MHz
2,100,000 kHz
3.3 GHz
3,300 MHz
3,300,000 kHz
3.2 GHz
3,200 MHz
3,200,000 kHz
3.1 GHz
3,100 MHz
3,100,000 kHz
3.1 GHz
3,100 MHz
3,100,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
1,200 MHz
1.2 GHz
1,200,000 KHz
i7-4770HQ20 July 2014Core i7481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
6 MiB
6,144 KiB
6,291,456 B
0.00586 GiB
47 W
47,000 mW
0.063 hp
0.047 kW
2.2 GHz
2,200 MHz
2,200,000 kHz
3.4 GHz
3,400 MHz
3,400,000 kHz
3.3 GHz
3,300 MHz
3,300,000 kHz
3.2 GHz
3,200 MHz
3,200,000 kHz
3.2 GHz
3,200 MHz
3,200,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
1,200 MHz
1.2 GHz
1,200,000 KHz
i7-4770R4 June 2013Core i7481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
6 MiB
6,144 KiB
6,291,456 B
0.00586 GiB
47 W
47,000 mW
0.063 hp
0.047 kW
3.2 GHz
3,200 MHz
3,200,000 kHz
3.9 GHz
3,900 MHz
3,900,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
3.7 GHz
3,700 MHz
3,700,000 kHz
3.7 GHz
3,700 MHz
3,700,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
1,300 MHz
1.3 GHz
1,300,000 KHz
i7-4850EQ20 February 2014Core i7481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
6 MiB
6,144 KiB
6,291,456 B
0.00586 GiB
47 W
47,000 mW
0.063 hp
0.047 kW
1.6 GHz
1,600 MHz
1,600,000 kHz
3.2 GHz
3,200 MHz
3,200,000 kHz
3.1 GHz
3,100 MHz
3,100,000 kHz
3 GHz
3,000 MHz
3,000,000 kHz
3 GHz
3,000 MHz
3,000,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
650 MHz
0.65 GHz
650,000 KHz
i7-4850HQ4 June 2013Core i7481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
6 MiB
6,144 KiB
6,291,456 B
0.00586 GiB
47 W
47,000 mW
0.063 hp
0.047 kW
2.3 GHz
2,300 MHz
2,300,000 kHz
3.5 GHz
3,500 MHz
3,500,000 kHz
3.4 GHz
3,400 MHz
3,400,000 kHz
3.3 GHz
3,300 MHz
3,300,000 kHz
3.3 GHz
3,300 MHz
3,300,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
1,200 MHz
1.2 GHz
1,200,000 KHz
i7-4860EQ20 February 2014Core i7481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
6 MiB
6,144 KiB
6,291,456 B
0.00586 GiB
47 W
47,000 mW
0.063 hp
0.047 kW
1.8 GHz
1,800 MHz
1,800,000 kHz
3.2 GHz
3,200 MHz
3,200,000 kHz
3.1 GHz
3,100 MHz
3,100,000 kHz
3 GHz
3,000 MHz
3,000,000 kHz
3 GHz
3,000 MHz
3,000,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
750 MHz
0.75 GHz
750,000 KHz
i7-4860HQ19 January 2014Core i7481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
6 MiB
6,144 KiB
6,291,456 B
0.00586 GiB
47 W
47,000 mW
0.063 hp
0.047 kW
2.3 GHz
2,300 MHz
2,300,000 kHz
2.4 GHz
2,400 MHz
2,400,000 kHz
3.6 GHz
3,600 MHz
3,600,000 kHz
3.5 GHz
3,500 MHz
3,500,000 kHz
3.4 GHz
3,400 MHz
3,400,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
1,200 MHz
1.2 GHz
1,200,000 KHz
i7-4870HQ20 July 2014Core i7481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
6 MiB
6,144 KiB
6,291,456 B
0.00586 GiB
47 W
47,000 mW
0.063 hp
0.047 kW
2.5 GHz
2,500 MHz
2,500,000 kHz
3.7 GHz
3,700 MHz
3,700,000 kHz
3.6 GHz
3,600 MHz
3,600,000 kHz
3.5 GHz
3,500 MHz
3,500,000 kHz
3.5 GHz
3,500 MHz
3,500,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
1,200 MHz
1.2 GHz
1,200,000 KHz
i7-4950HQ4 June 2013Core i7481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
6 MiB
6,144 KiB
6,291,456 B
0.00586 GiB
47 W
47,000 mW
0.063 hp
0.047 kW
2.4 GHz
2,400 MHz
2,400,000 kHz
3.6 GHz
3,600 MHz
3,600,000 kHz
3.5 GHz
3,500 MHz
3,500,000 kHz
3.4 GHz
3,400 MHz
3,400,000 kHz
3.4 GHz
3,400 MHz
3,400,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
1,300 MHz
1.3 GHz
1,300,000 KHz
i7-4960HQ1 September 2013Core i7481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
6 MiB
6,144 KiB
6,291,456 B
0.00586 GiB
47 W
47,000 mW
0.063 hp
0.047 kW
2.6 GHz
2,600 MHz
2,600,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
3.7 GHz
3,700 MHz
3,700,000 kHz
3.6 GHz
3,600 MHz
3,600,000 kHz
3.4 GHz
3,400 MHz
3,400,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
1,300 MHz
1.3 GHz
1,300,000 KHz
i7-4980HQ1 September 2014Core i7481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
6 MiB
6,144 KiB
6,291,456 B
0.00586 GiB
47 W
47,000 mW
0.063 hp
0.047 kW
2.8 GHz
2,800 MHz
2,800,000 kHz
4 GHz
4,000 MHz
4,000,000 kHz
3.9 GHz
3,900 MHz
3,900,000 kHz
3.6 GHz
3,600 MHz
3,600,000 kHz
3.6 GHz
3,600 MHz
3,600,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
1,300 MHz
1.3 GHz
1,300,000 KHz
i7-4930MX2 June 2013Core i7EEHaswell481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
8 MiB
8,192 KiB
8,388,608 B
0.00781 GiB
57 W
57,000 mW
0.0764 hp
0.057 kW
3 GHz
3,000 MHz
3,000,000 kHz
3.9 GHz
3,900 MHz
3,900,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
3.7 GHz
3,700 MHz
3,700,000 kHz
3.7 GHz
3,700 MHz
3,700,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel HD Graphics 4600400 MHz
0.4 GHz
400,000 KHz
1,350 MHz
1.35 GHz
1,350,000 KHz
i7-4940MX19 January 2014Core i7EEHaswell481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
8 MiB
8,192 KiB
8,388,608 B
0.00781 GiB
57 W
57,000 mW
0.0764 hp
0.057 kW
3.1 GHz
3,100 MHz
3,100,000 kHz
4 GHz
4,000 MHz
4,000,000 kHz
3.9 GHz
3,900 MHz
3,900,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel HD Graphics 4600400 MHz
0.4 GHz
400,000 KHz
1,350 MHz
1.35 GHz
1,350,000 KHz
i7-5960X29 August 2014Core i7EEHaswell E8162 MiB
2,048 KiB
2,097,152 B
0.00195 GiB
20 MiB
20,480 KiB
20,971,520 B
0.0195 GiB
140 W
140,000 mW
0.188 hp
0.14 kW
3 GHz
3,000 MHz
3,000,000 kHz
3.5 GHz
3,500 MHz
3,500,000 kHz
3.5 GHz
3,500 MHz
3,500,000 kHz
3.3 GHz
3,300 MHz
3,300,000 kHz
3.3 GHz
3,300 MHz
3,300,000 kHz
64 GiB
65,536 MiB
67,108,864 KiB
68,719,476,736 B
0.0625 TiB
E3-1284L v31 October 2014Xeon E3481 MiB
1,024 KiB
1,048,576 B
9.765625e-4 GiB
6 MiB
6,144 KiB
6,291,456 B
0.00586 GiB
47 W
47,000 mW
0.063 hp
0.047 kW
1.8 GHz
1,800 MHz
1,800,000 kHz
3.2 GHz
3,200 MHz
3,200,000 kHz
3.1 GHz
3,100 MHz
3,100,000 kHz
3 GHz
3,000 MHz
3,000,000 kHz
3 GHz
3,000 MHz
3,000,000 kHz
32 GiB
32,768 MiB
33,554,432 KiB
34,359,738,368 B
0.0313 TiB
Intel Iris Pro Graphics 5200200 MHz
0.2 GHz
200,000 KHz
750 MHz
0.75 GHz
750,000 KHz
 Multiprocessors (2-way)
E5-2670 v38 September 2014$ 1,589.00
€ 1,430.10
£ 1,287.09
¥ 164,191.37
Xeon E5Haswell EP12243 MiB
3,072 KiB
3,145,728 B
0.00293 GiB
30 MiB
30,720 KiB
31,457,280 B
0.0293 GiB
120 W
120,000 mW
0.161 hp
0.12 kW
2.3 GHz
2,300 MHz
2,300,000 kHz
3.1 GHz
3,100 MHz
3,100,000 kHz
3.1 GHz
3,100 MHz
3,100,000 kHz
2.9 GHz
2,900 MHz
2,900,000 kHz
2.8 GHz
2,800 MHz
2,800,000 kHz
786 GiB
804,864 MiB
824,180,736 KiB
843,961,073,664 B
0.768 TiB
 Multiprocessors (4-way)
 Multiprocessors (8-way)
Count: 19

References

  • Hammarlund, Per, et al. "Haswell: The fourth-generation intel core processor." IEEE Micro 34.2 (2014): 6-20.
  • Dan Ragland, Overclocking System Architect, 2015 IDF, in San Francisco, Session RPCS001 ("Overclocking 6th Generation Intel® Core™ Processors!"), August 18, 2015

Documents

codenameHaswell +
core count2 +, 4 +, 6 +, 8 + and 16 +
designerIntel +
first launchedJune 4, 2013 +
full page nameintel/microarchitectures/haswell (client) +
instance ofmicroarchitecture +
instruction set architecturex86-64 +
manufacturerIntel +
microarchitecture typeCPU +
nameHaswell +
phase-out2015 +
pipeline stages (max)19 +
pipeline stages (min)14 +
process22 nm (0.022 μm, 2.2e-5 mm) +