From WikiChip
Editing macro-operation fusion
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.
Latest revision | Your text | ||
Line 1: | Line 1: | ||
{{title|Macro-Operation Fusion (MOP Fusion)}}{{confuse|micro-operation fusion}} | {{title|Macro-Operation Fusion (MOP Fusion)}}{{confuse|micro-operation fusion}} | ||
− | '''Macro-Operation Fusion''' (also '''Macro-Op Fusion''', '''MOP Fusion''', or '''Macrofusion''') is a hardware optimization technique found in | + | '''Macro-Operation Fusion''' (also '''Macro-Op Fusion''', '''MOP Fusion''', or '''Macrofusion''') is a hardware optimization technique found in [[Intel]]'s [[x86]] [[microarchitectures]] whereby a pair of adjacent [[macro-operations]] are merged into a single macro-operation. |
− | == | + | == History == |
− | + | The technique for fusing instructions is owned by [[Intel]] under [https://www.google.com/patents/US6675376 Patent US6675376] ("System and method for fusing instructions") originally filed in December [[2000]]. MOP Fusion was first introduced in [[2006]] in the {{intel|Core|l=arch}} microarchitecture and has been featured in every Intel microarch since. | |
− | A | + | == Motivation == |
+ | A fused instruction remains fused throughout its lifetime. Therefore fused instructions can represent more work with less bits, free up execution units, tracking information (e.g. in the [[register renaming|rename unit]]), save pipeline bandwidth in all stages from decode to retire, and consequently save power. Note that this is done before decoding, therefore even decoding bandwidth is save. | ||
− | + | Conditional branching are a very common operation in almost all workloads; by Intel estimates it makes up 15% of all instructions. Macro-op fusion also helps workloads that are not compiled such as in the case of many [[interpreted programming languages]] (e.g. [[PHP]], the software running WikiChip). In those programs, conditional branching is seldomly fused as they would by a static [[compiler]]. | |
− | |||
− | + | == Mechanism == | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
<div style="float: right; text-align: center; margin: 10px;"> | <div style="float: right; text-align: center; margin: 10px;"> | ||
− | [[File:core mopf off.png| | + | [[File:core mopf off.png|450px]] |
− | [[File:core mopf on.png| | + | [[File:core mopf on.png|450px]] |
<small>Slides from Intel's {{intel|Core|l=arch}} microarchitecture presentation.</small> | <small>Slides from Intel's {{intel|Core|l=arch}} microarchitecture presentation.</small> | ||
</div> | </div> | ||
− | After the boundaries of [[macro-ops]] are found and marked, they are delivered to the [[instruction queue]] before being fed to the [[instruction decode|decoders]]. At that stage of the [[pipeline]], macro-operation fusion opportunities can be identified and exploited | + | After the boundaries of [[macro-ops]] are found and marked, they are delivered to the [[instruction queue]] before being fed to the [[instruction decode|decoders]]. At that stage of the [[pipeline]], macro-operation fusion opportunities can be identified and exploited. |
− | + | A pair of two [[dependent instructions]] are first compared against a set of criteria. For example, either the first source or destination operand must be a [[register]] and the second source operand (if one exists) must be an [[immediate value]] or a non-{{x86|RIP-Relative Addressing|RIP-relative memory}}. Fusion replaces the two instructions with a single instruction representing both operations behaviorally. | |
− | Fusion is done on compare flag-modifying instruction (e.g., <code>{{x86|CMP}}</code> or <code>{{x86|ADD}}</code>) with a subsequent conditional [[jump instruction]]. The produced output is a single operation-and-branch instruction. The final fused instruction remains as such for its remaining lifetime; that is the fused instruction will stay fused throughout the [[pipeline]] until execution units where it may be executed on a single port or | + | Fusion is done on compare flag-modifying instruction (e.g., <code>{{x86|CMP}}</code> or <code>{{x86|ADD}}</code>) with a subsequent conditional [[jump instruction]]. The produced output is a single single operation-and-branch instruction. The final fused instruction remains as such for its remaining lifetime; that is the fused instruction will stay fused throughout the [[pipeline]] until execution units where it may be executed on a single port or duel-issued on two appropriate ports. |
* Two instructions must be right next to each other, with no other instruction in between | * Two instructions must be right next to each other, with no other instruction in between | ||
Line 96: | Line 49: | ||
|} | |} | ||
− | + | === Prior limitations === | |
− | + | ==== Nehalem µarch limitations ==== | |
In {{intel|Nehalem|l=arch}}, Intel introduced a number of enhancements: | In {{intel|Nehalem|l=arch}}, Intel introduced a number of enhancements: | ||
Line 104: | Line 57: | ||
* Supported on {{x86|x86-64}} mode | * Supported on {{x86|x86-64}} mode | ||
− | + | ==== Core µarch limitations ==== | |
The original implementation in the {{intel|Core|l=arch}} microarchitecture was much more limited than in recent processors. | The original implementation in the {{intel|Core|l=arch}} microarchitecture was much more limited than in recent processors. | ||
Line 117: | Line 70: | ||
* <code>{{x86|TEST}}</code> can fused with all conditional jumps | * <code>{{x86|TEST}}</code> can fused with all conditional jumps | ||
* <code>{{x86|CMP}}</code> can only be fused with {{x86|Carry Flag}} ({{x86|CF}}) / {{x86|Zero Flag}} ({{x86|ZF}}) conditional jumps: <code>{{x86|JA}}</code>, <code>{{x86|JNBE}}</code>, <code>{{x86|JAE}}</code>, <code>{{x86|JNB}}</code>, <code>{{x86|JNC}}</code>, <code>{{x86|JE}}</code>, <code>{{x86|JZ}}</code>, <code>{{x86|JNA}}</code>, <code>{{x86|JBE}}</code>, <code>{{x86|JNAE}}</code>, <code>{{x86|JC}}</code>, <code>{{x86|JB}}</code>, <code>{{x86|JNE}}</code>, <code>{{x86|JNZ}}</code> | * <code>{{x86|CMP}}</code> can only be fused with {{x86|Carry Flag}} ({{x86|CF}}) / {{x86|Zero Flag}} ({{x86|ZF}}) conditional jumps: <code>{{x86|JA}}</code>, <code>{{x86|JNBE}}</code>, <code>{{x86|JAE}}</code>, <code>{{x86|JNB}}</code>, <code>{{x86|JNC}}</code>, <code>{{x86|JE}}</code>, <code>{{x86|JZ}}</code>, <code>{{x86|JNA}}</code>, <code>{{x86|JBE}}</code>, <code>{{x86|JNAE}}</code>, <code>{{x86|JC}}</code>, <code>{{x86|JB}}</code>, <code>{{x86|JNE}}</code>, <code>{{x86|JNZ}}</code> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |