From WikiChip
Difference between revisions of "macro-operation fusion"

(Core µarch limitations)
m (Motivation)
(12 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{title|Macro-Operation Fusion (MOP Fusion)}}{{confuse|micro-operation fusion}}
 
{{title|Macro-Operation Fusion (MOP Fusion)}}{{confuse|micro-operation fusion}}
'''Macro-Operation Fusion''' (also '''Macro-Op Fusion''', '''MOP Fusion''', or '''Macrofusion''') is a hardware optimization technique found in [[Intel]]'s [[x86]] [[microarchitectures]] whereby a pair of [[macro-operations]] are merged into a single macro-operation.
+
'''Macro-Operation Fusion''' (also '''Macro-Op Fusion''', '''MOP Fusion''', or '''Macrofusion''') is a hardware optimization technique found in [[Intel]]'s [[x86]] [[microarchitectures]] whereby a pair of adjacent [[macro-operations]] are merged into a single macro-operation prior to decoding. Those instructions are later decoded into fused-µOPs.
  
 
== History ==
 
== History ==
The technique for fusing instructions is owned by [[Intel]] under [https://www.google.com/patents/US6675376 Patent US6675376] ("System and method for fusing instructions") originally filed in December [[2000]]. MOP Fusion was first introduced in the {{intel|Core|l=arch}} microarchitecture and has been featured in every Intel microarch since.
+
The technique for fusing instructions is owned by [[Intel]] and is protected by [https://www.google.com/patents/US6675376 Patent US6675376] ("System and method for fusing instructions") originally filed in December [[2000]]. MOP Fusion was first introduced in [[2006]] in the {{intel|Core|l=arch}} microarchitecture and has been featured in every Intel microarch since.
  
 
== Motivation ==
 
== Motivation ==
A fused instruction remains fused throughout its lifetime. Therefore fused instructions can represent more work with less bits, free up execution units, tracking information (e.g. in the [[register renaming|rename unit]]), save pipeline bandwidth in all stages from decode to retire, and consequently save power. Note that this is done before decoding, therefore even decoding bandwidth is save.
+
A fused instruction remains fused throughout its lifetime. Therefore fused instructions can represent more work with less bits, free up execution units, tracking information (e.g. in the [[register renaming|rename unit]]), save pipeline bandwidth in all stages from decode to retire, and consequently save power. Note that this is done before decoding, therefore even decoding bandwidth is saved.
  
Conditional branching are a very common operation in almost all workloads. Macro-op fusion also helps workloads that are not compiled such as in the case of many [[interpreted programming languages]] (e.g. [[PHP]], the software running WikiChip). In those programs, conditional branching is seldomly fused as they would by a static [[compiler]].
+
Conditional branching are a very common operation in almost all workloads; by Intel estimates it makes up 15% of all instructions. Macro-op fusion also helps workloads that are not compiled such as in the case of many [[interpreted programming languages]] (e.g. [[PHP]], the software running WikiChip).
  
 
== Mechanism ==
 
== Mechanism ==
 +
<div style="float: right; text-align: center; margin: 10px;">
 +
[[File:core mopf off.png|450px]]
 +
 +
[[File:core mopf on.png|450px]]
 +
 +
<small>Slides from Intel's {{intel|Core|l=arch}} microarchitecture presentation.</small>
 +
</div>
 
After the boundaries of [[macro-ops]] are found and marked, they are delivered to the [[instruction queue]] before being fed to the [[instruction decode|decoders]]. At that stage of the [[pipeline]], macro-operation fusion opportunities can be identified and exploited.
 
After the boundaries of [[macro-ops]] are found and marked, they are delivered to the [[instruction queue]] before being fed to the [[instruction decode|decoders]]. At that stage of the [[pipeline]], macro-operation fusion opportunities can be identified and exploited.
  
A pair of two [[dependent instructions]] are first compared against a set of criteria. For example, if the second instruction is commutative (i.e., the order of operands does not affect the result) or if the destination of the [[operand]] of the first instruction is used as the source operand of the second instruction than the instruction may qualify for fusion. Additionally either the first source or destination operand must be a [[register]] and the second source operand (if one exists) must be an [[immediate value]] or a non-{{x86|RIP-Relative Addressing|RIP-relative memory}}. Fusion replaces the two instructions with a single instruction representing both operations behaviorally.
+
A pair of two [[dependent instructions]] are first compared against a set of criteria. For example, either the first source or destination operand must be a [[register]] and the second source operand (if one exists) must be an [[immediate value]] or a non-{{x86|RIP-Relative Addressing|RIP-relative memory}}. Fusion replaces the two instructions with a single instruction representing both operations behaviorally.
  
Fusion is done on compare flag-modifying instruction (e.g., <code>{{x86|CMP}}</code> or <code>{{x86|ADD}}</code>) with a subsequent conditional [[jump instruction]]. The produced output is a single single compare-and-branch instruction. The final fused instruction remains as such for its remaining lifetime; that is the fused instruction will stay fused throughout the [[pipeline]] and execute on a single port in the [[back-end]] that can handle both operations.
+
Fusion is done on compare flag-modifying instruction (e.g., <code>{{x86|CMP}}</code> or <code>{{x86|ADD}}</code>) with a subsequent conditional [[jump instruction]]. The produced output is a single single operation-and-branch instruction. The final fused instruction remains as such for its remaining lifetime; that is the fused instruction will stay fused throughout the [[pipeline]] until execution units where it may be executed on a single port or duel-issued on two appropriate ports.
  
 
* Two instructions must be right next to each other, with no other instruction in between
 
* Two instructions must be right next to each other, with no other instruction in between
Line 63: Line 70:
 
* <code>{{x86|TEST}}</code> can fused with all conditional jumps
 
* <code>{{x86|TEST}}</code> can fused with all conditional jumps
 
* <code>{{x86|CMP}}</code> can only be fused with {{x86|Carry Flag}} ({{x86|CF}}) / {{x86|Zero Flag}} ({{x86|ZF}}) conditional jumps: <code>{{x86|JA}}</code>, <code>{{x86|JNBE}}</code>, <code>{{x86|JAE}}</code>, <code>{{x86|JNB}}</code>, <code>{{x86|JNC}}</code>, <code>{{x86|JE}}</code>, <code>{{x86|JZ}}</code>, <code>{{x86|JNA}}</code>, <code>{{x86|JBE}}</code>, <code>{{x86|JNAE}}</code>, <code>{{x86|JC}}</code>, <code>{{x86|JB}}</code>, <code>{{x86|JNE}}</code>, <code>{{x86|JNZ}}</code>
 
* <code>{{x86|CMP}}</code> can only be fused with {{x86|Carry Flag}} ({{x86|CF}}) / {{x86|Zero Flag}} ({{x86|ZF}}) conditional jumps: <code>{{x86|JA}}</code>, <code>{{x86|JNBE}}</code>, <code>{{x86|JAE}}</code>, <code>{{x86|JNB}}</code>, <code>{{x86|JNC}}</code>, <code>{{x86|JE}}</code>, <code>{{x86|JZ}}</code>, <code>{{x86|JNA}}</code>, <code>{{x86|JBE}}</code>, <code>{{x86|JNAE}}</code>, <code>{{x86|JC}}</code>, <code>{{x86|JB}}</code>, <code>{{x86|JNE}}</code>, <code>{{x86|JNZ}}</code>
 +
 +
== See also ==
 +
* [[micro-operation fusion]]
 +
* [[zeroing idioms]]

Revision as of 20:29, 28 July 2017

Not to be confused with micro-operation fusion.

Macro-Operation Fusion (also Macro-Op Fusion, MOP Fusion, or Macrofusion) is a hardware optimization technique found in Intel's x86 microarchitectures whereby a pair of adjacent macro-operations are merged into a single macro-operation prior to decoding. Those instructions are later decoded into fused-µOPs.

History

The technique for fusing instructions is owned by Intel and is protected by Patent US6675376 ("System and method for fusing instructions") originally filed in December 2000. MOP Fusion was first introduced in 2006 in the Core microarchitecture and has been featured in every Intel microarch since.

Motivation

A fused instruction remains fused throughout its lifetime. Therefore fused instructions can represent more work with less bits, free up execution units, tracking information (e.g. in the rename unit), save pipeline bandwidth in all stages from decode to retire, and consequently save power. Note that this is done before decoding, therefore even decoding bandwidth is saved.

Conditional branching are a very common operation in almost all workloads; by Intel estimates it makes up 15% of all instructions. Macro-op fusion also helps workloads that are not compiled such as in the case of many interpreted programming languages (e.g. PHP, the software running WikiChip).

Mechanism

core mopf off.png

core mopf on.png

Slides from Intel's Core microarchitecture presentation.

After the boundaries of macro-ops are found and marked, they are delivered to the instruction queue before being fed to the decoders. At that stage of the pipeline, macro-operation fusion opportunities can be identified and exploited.

A pair of two dependent instructions are first compared against a set of criteria. For example, either the first source or destination operand must be a register and the second source operand (if one exists) must be an immediate value or a non-RIP-relative memory. Fusion replaces the two instructions with a single instruction representing both operations behaviorally.

Fusion is done on compare flag-modifying instruction (e.g., CMP or ADD) with a subsequent conditional jump instruction. The produced output is a single single operation-and-branch instruction. The final fused instruction remains as such for its remaining lifetime; that is the fused instruction will stay fused throughout the pipeline until execution units where it may be executed on a single port or duel-issued on two appropriate ports.

  • Two instructions must be right next to each other, with no other instruction in between
  • First instruction must be one of the following: CMP, TEST, ADD, SUB, INC, DEC, or AND.
  • Second instruction must be a conditional jump (e.g., JA, JAE, JE, JNE)
  • Fusion cannot take place if the first instruction ends on byte 63 of a cache line and the second instruction starts at byte 0 of the next line.

Additionally, only up to 1 macrofusion can take place each cycle. If there it's possible to perform 2 macrofusions, only the first pair will be fused. The second pair will continue unfused.

Macro-Fusibility
Instruction TEST CMP AND ADD SUB INC DEC
JO/JNO
JC/JB/JAE/JNB
JE/JZ/JNE/JNZ
JNA/JBE/JA/JNBE
JS/JNS/JP/JPE/JNP/JPO
JL/JNGE/JGE/JNL/JLE/JNG/JG/JNLE

Prior limitations

Nehalem µarch limitations

In Nehalem, Intel introduced a number of enhancements:

Core µarch limitations

The original implementation in the Core microarchitecture was much more limited than in recent processors.

  • First instruction must be one of the following: CMP and TEST
  • Macro Fusion is restricted to 16-bit and 32-bit mode only (including 32-bit compatibility sub-mode in x86-64).
  • CMP and TEST can fuse when comparing:
    • REG-REG. (e.g, CMP EAX,ECX; JZ label)
    • REG-IMM. (e.g., CMP EAX,0x80; JZ label)
    • REG-MEM. (e.g., CMP EAX,[ECX]; JZ label)
    • MEM-REG. (e.g., CMP [EAX],ECX; JZ label)
  • CMP and TEST can not be fused when comparing MEM-IMM (e.g. CMP [EAX],0x80; JZ label)
  • TEST can fused with all conditional jumps
  • CMP can only be fused with Carry Flag (CF) / Zero Flag (ZF) conditional jumps: JA, JNBE, JAE, JNB, JNC, JE, JZ, JNA, JBE, JNAE, JC, JB, JNE, JNZ

See also