Editing macro-operation fusion

{{title|Macro-Operation Fusion (MOP Fusion)}}{{confuse|micro-operation fusion}}
'''Macro-Operation Fusion''' (also '''Macro-Op Fusion''', '''MOP Fusion''', or '''Macrofusion''') is a hardware optimization technique found in [[Intel]]'s [[x86]] [[microarchitectures]] whereby a pair of adjacent [[macro-operations]] are merged into a single macro-operation.

== History ==
The technique for fusing instructions is owned by [[Intel]] under [https://www.google.com/patents/US6675376 Patent US6675376] ("System and method for fusing instructions") originally filed in December [[2000]]. MOP Fusion was first introduced in [[2006]] in the {{intel|Core|l=arch}} microarchitecture and has been featured in every Intel microarch since.

== Motivation ==
A fused instruction remains fused throughout its lifetime. Therefore fused instructions can represent more work with less bits, free up execution units, tracking information (e.g. in the [[register renaming|rename unit]]), save pipeline bandwidth in all stages from decode to retire, and consequently save power. Note that this is done before decoding, therefore even decoding bandwidth is save.

Conditional branching are a very common operation in almost all workloads; by Intel estimates it makes up 15% of all instructions. Macro-op fusion also helps workloads that are not compiled such as in the case of many [[interpreted programming languages]] (e.g. [[PHP]], the software running WikiChip). In those programs, conditional branching is seldomly fused as they would by a static [[compiler]].

== Mechanism ==
<div style="float: right; text-align: center; margin: 10px;">
[[File:core mopf off.png|450px]]

[[File:core mopf on.png|450px]]

<small>Slides from Intel's {{intel|Core|l=arch}} microarchitecture presentation.</small>
</div>
After the boundaries of [[macro-ops]] are found and marked, they are delivered to the [[instruction queue]] before being fed to the [[instruction decode|decoders]]. At that stage of the [[pipeline]], macro-operation fusion opportunities can be identified and exploited.

A pair of two [[dependent instructions]] are first compared against a set of criteria. For example, either the first source or destination operand must be a [[register]] and the second source operand (if one exists) must be an [[immediate value]] or a non-{{x86|RIP-Relative Addressing|RIP-relative memory}}. Fusion replaces the two instructions with a single instruction representing both operations behaviorally.

Fusion is done on compare flag-modifying instruction (e.g., <code>{{x86|CMP}}</code> or <code>{{x86|ADD}}</code>) with a subsequent conditional [[jump instruction]]. The produced output is a single single operation-and-branch instruction. The final fused instruction remains as such for its remaining lifetime; that is the fused instruction will stay fused throughout the [[pipeline]] until execution units where it may be executed on a single port or duel-issued on two appropriate ports.

* Two instructions must be right next to each other, with no other instruction in between
* First instruction must be one of the following: <code>{{x86|CMP}}</code>, <code>{{x86|TEST}}</code>, <code>{{x86|ADD}}</code>, <code>{{x86|SUB}}</code>, <code>{{x86|INC}}</code>, <code>{{x86|DEC}}</code>, or <code>{{x86|AND}}</code>.
* Second instruction must be a conditional jump (e.g., <code>{{x86|JA}}</code>, <code>{{x86|JAE}}</code>, <code>{{x86|JE}}</code>, <code>{{x86|JNE}}</code>)
* Fusion cannot take place if the first instruction ends on byte 63 of a cache line and the second instruction starts at byte 0 of the next line.

Additionally, only up to 1 macrofusion can take place each cycle. If there it's possible to perform 2 macrofusions, only the first pair will be fused. The second pair will continue unfused.

{| class="wikitable tc2 tc3 tc4 tc5 tc6 tc7"
! colspan="8" | Macro-Fusibility
|-
! [[Instruction]] !! {{x86|TEST}} !! {{x86|CMP}} !! {{x86|AND}} !! {{x86|ADD}} !! {{x86|SUB}} !! {{x86|INC}} !! {{x86|DEC}}
|-
| {{x86|JO}}/{{x86|JNO}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘
|-
| {{x86|JC}}/{{x86|JB}}/{{x86|JAE}}/{{x86|JNB}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘
|-
| {{x86|JE}}/{{x86|JZ}}/{{x86|JNE}}/{{x86|JNZ}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔  || style="background-color: #d6ffd8;" | ✔
|-
| {{x86|JNA}}/{{x86|JBE}}/{{x86|JA}}/{{x86|JNBE}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘
|-
| {{x86|JS}}/{{x86|JNS}}/{{x86|JP}}/{{x86|JPE}}/{{x86|JNP}}/{{x86|JPO}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘
|-
| {{x86|JL}}/{{x86|JNGE}}/{{x86|JGE}}/{{x86|JNL}}/{{x86|JLE}}/{{x86|JNG}}/{{x86|JG}}/{{x86|JNLE}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔
|}

=== Prior limitations ===

==== Nehalem µarch limitations ====
In {{intel|Nehalem|l=arch}}, Intel introduced a number of enhancements:

* <code>{{x86|CMP}}</code> can be fused with: <code>{{x86|JL}}</code>, <code>{{x86|JNGE}}</code>, <code>{{x86|JGE}}</code>, <code>{{x86|JNL}}</code>, <code>{{x86|JLE}}</code>, <code>{{x86|JNG}}</code>, <code>{{x86|JG}}</code>, <code>{{x86|JNLE}}</code>
* Supported on {{x86|x86-64}} mode

==== Core µarch limitations ====
The original implementation in the {{intel|Core|l=arch}} microarchitecture was much more limited than in recent processors.

* First instruction must be one of the following: <code>{{x86|CMP}}</code> and <code>{{x86|TEST}}</code>
* Macro Fusion is restricted to {{x86|x86-16|16-bit}} and {{x86|x86-32|32-bit mode}} only (including 32-bit compatibility sub-mode in {{x86|x86-64}}).
* <code>{{x86|CMP}}</code> and <code>{{x86|TEST}}</code> can fuse when comparing:
** [[register|REG]]-REG. (e.g, CMP EAX,ECX; JZ label)
** REG-[[immediate value|IMM]]. (e.g., CMP EAX,0x80; JZ label)
** REG-[[memory|MEM]]. (e.g., CMP EAX,[ECX]; JZ label)
** MEM-REG. (e.g., CMP [EAX],ECX; JZ label)
* <code>{{x86|CMP}}</code> and <code>{{x86|TEST}}</code> can not be fused when comparing MEM-IMM (e.g. CMP [EAX],0x80; JZ label)
* <code>{{x86|TEST}}</code> can fused with all conditional jumps
* <code>{{x86|CMP}}</code> can only be fused with {{x86|Carry Flag}} ({{x86|CF}}) / {{x86|Zero Flag}} ({{x86|ZF}}) conditional jumps: <code>{{x86|JA}}</code>, <code>{{x86|JNBE}}</code>, <code>{{x86|JAE}}</code>, <code>{{x86|JNB}}</code>, <code>{{x86|JNC}}</code>, <code>{{x86|JE}}</code>, <code>{{x86|JZ}}</code>, <code>{{x86|JNA}}</code>, <code>{{x86|JBE}}</code>, <code>{{x86|JNAE}}</code>, <code>{{x86|JC}}</code>, <code>{{x86|JB}}</code>, <code>{{x86|JNE}}</code>, <code>{{x86|JNZ}}</code>
@@ Line 1: / Line 1: @@
 {{title|Macro-Operation Fusion (MOP Fusion)}}{{confuse|micro-operation fusion}}
-'''Macro-Operation Fusion''' (also '''Macro-Op Fusion''', '''MOP Fusion''', or '''Macrofusion''') is a hardware optimization technique found in many modern [[microarchitectures]] whereby a series of adjacent [[macro-operations]] are merged into a single macro-operation prior or during decoding. Those instructions are later decoded into fused-µOPs.
+'''Macro-Operation Fusion''' (also '''Macro-Op Fusion''', '''MOP Fusion''', or '''Macrofusion''') is a hardware optimization technique found in [[Intel]]'s [[x86]] [[microarchitectures]] whereby a pair of adjacent [[macro-operations]] are merged into a single macro-operation.
-== Overview & Motivation ==
+== History ==
-One of the three [[microprocessor performance|performance knobs of a microprocessor]] is the [[instruction count]]. By reducing the number of instructions that must be executed, more work can be done with fewer resources. The idea behind macro-operation fusion is to combine multiple adjacent instructions into a single instruction. A fused instruction typically remains fused throughout its lifetime. Therefore fused instructions can represent more work with fewer bits, free up execution units, tracking information (e.g. in the [[register renaming|rename unit]]), save pipeline bandwidth in all stages from decode to retire, and consequently save power.
+The technique for fusing instructions is owned by [[Intel]] under [https://www.google.com/patents/US6675376 Patent US6675376] ("System and method for fusing instructions") originally filed in December [[2000]]. MOP Fusion was first introduced in [[2006]] in the {{intel|Core|l=arch}} microarchitecture and has been featured in every Intel microarch since.
-A unique aspect of macro-op fusion is that it also helps workloads that are not compiled such as in the case of many [[interpreted programming languages]] (e.g. [[PHP]], the software running WikiChip).
+== Motivation ==
+A fused instruction remains fused throughout its lifetime. Therefore fused instructions can represent more work with less bits, free up execution units, tracking information (e.g. in the [[register renaming|rename unit]]), save pipeline bandwidth in all stages from decode to retire, and consequently save power. Note that this is done before decoding, therefore even decoding bandwidth is save.
-== Arm ==
+Conditional branching are a very common operation in almost all workloads; by Intel estimates it makes up 15% of all instructions. Macro-op fusion also helps workloads that are not compiled such as in the case of many [[interpreted programming languages]] (e.g. [[PHP]], the software running WikiChip). In those programs, conditional branching is seldomly fused as they would by a static [[compiler]].
-Arm supports a number of macro-op fusion operations in their recent microarchitectures.
-* movw + movt
+== Mechanism ==
-* aese + aesmc
-* aesd + aesimc
-== RISC-V ==
-The use of macro-op fusion in RISC-V was proposed in a 2016 Berkeley paper<ref>Celio et al</ref> where a renewed case was made for the use of macro-operation fusion over bloating the ISA with more complex instructions. The paper compared the RISC-V isa performance in terms of instruction count on the popular [[SPEC CPU2006]] benchmark where it is found to be slightly behind contemporary ISAs. In their paper<ref>Celio et al</ref>, it's claimed that the RV64G and RV64GC effective instruction count can be reduced by 5.4% on average by leveraging macro-op fusion, thereby closing much of the deficiency gap. The used of macro-op fusion has gained larger support in the RISC-V community in favor of the microarchitecture taking care of this aspect rather than bloating the ISA with more complex instructions.
-=== Proposed fusion operations ===
-Some of the common operations are:
-{| class="wikitable"
-|-
-! Pattern !! Result
-|-
-| // rd = array[offset]<br>add rd, rs1, rs2<br>ld rd, 0(rd) || Fused into an indexed load
-|-
-| // &(array[offset])<br>slli rd, rs1, {1,2,3}<br>add rd, rd, rs2 || Fused into a load effective address
-|-
-| // rd = array[offset]<br>slli rd, rs1, {1,2,3}<br>add rd, rd, rs2<br>ld rd, 0(rd) || Three-instruction fused into a load effective address
-|-
-| // rd = rs1 & 0xffffffff<br>slli rd, rs1, 32<br>srli rd, rd, 32 || Clear upper word
-|-
-|// rd = imm[31:0]<br>lui rd, imm[31:12]<br>addi rd, rd, imm[11:0] || Load upper immediate
-|-
-| // rd = *(imm[31:0])<br>lui rd, imm[31:12]<br>ld rd, imm[11:0](rd) || Load upper immediate
-|-
-|// l[dw] rd, symbol[31:0]<br>auipc rd, symbol[31:12]<br>l[dw] rd, symbol[11:0](rd) || Load global immediate
-|-
-| // far jump (1 MB) (AUIPC+JALR)<br>auipc t, imm20<br>jalr ra, imm12(t) || Fused far jump and link with calculated target address
-|-
-| addiw rd, rs1, imm12<br>slli rd, rs1, 32<br>SRLI rd, rs1, 32 || Fused into a single 32-bit zero extending add operation
-|-
-|mulh[[S]U] rdh, rs1, rs2<br>mul rdl, rs1, rs2 || Fused into a wide multiply
-|-
-| div[U] rdq, rs1, rs2<br>rem[U] rdr, rs1, rs2 || Fused into a wide divide
-|-
-|// ldpair rd1,rd2, [imm(rs1)]<br>ld rd1, imm(rs1)<br>ld rd2, imm+8(rs1) || Fused into a load-pair
-|-
-|// ldia rd, imm(rs1)<br>ld rd, imm(rs1)<br>add rs1, rs1, 8 || Fused into a post-indexed load
-|}
-== x86 ==
-=== Intel ===
-Intel uses macro-op fusion in all their modern {{intel|microarchitectures}} since {{intel|Core|l=arch}}.
-==== History ====
-The technique for fusing instructions is owned by [[Intel]] and is protected by [https://www.google.com/patents/US6675376 Patent US6675376] ("System and method for fusing instructions") originally filed in December [[2000]]. MOP Fusion was first introduced in [[2006]] in the {{intel|Core|l=arch}} microarchitecture and has been featured in every Intel microarch since.
-==== Mechanism ====
 <div style="float: right; text-align: center; margin: 10px;">
-[[File:core mopf off.png|350px]]
+[[File:core mopf off.png|450px]]
-[[File:core mopf on.png|350px]]
+[[File:core mopf on.png|450px]]
 <small>Slides from Intel's {{intel|Core|l=arch}} microarchitecture presentation.</small>
 </div>
-After the boundaries of [[macro-ops]] are found and marked, they are delivered to the [[instruction queue]] before being fed to the [[instruction decode|decoders]]. At that stage of the [[pipeline]], macro-operation fusion opportunities can be identified and exploited. Note that this is done before decoding, therefore even decoding bandwidth is saved.
+After the boundaries of [[macro-ops]] are found and marked, they are delivered to the [[instruction queue]] before being fed to the [[instruction decode|decoders]]. At that stage of the [[pipeline]], macro-operation fusion opportunities can be identified and exploited.
-Conditional branching are a very common operation in almost all workloads; by Intel estimates it makes up 15% of all instructions. A pair of two [[dependent instructions]] are first compared against a set of criteria. For example, either the first source or destination operand must be a [[register]] and the second source operand (if one exists) must be an [[immediate value]] or a non-{{x86|RIP-Relative Addressing|RIP-relative memory}}. Fusion replaces the two instructions with a single instruction representing both operations behaviorally.
+A pair of two [[dependent instructions]] are first compared against a set of criteria. For example, either the first source or destination operand must be a [[register]] and the second source operand (if one exists) must be an [[immediate value]] or a non-{{x86|RIP-Relative Addressing|RIP-relative memory}}. Fusion replaces the two instructions with a single instruction representing both operations behaviorally.
-Fusion is done on compare flag-modifying instruction (e.g., <code>{{x86|CMP}}</code> or <code>{{x86|ADD}}</code>) with a subsequent conditional [[jump instruction]]. The produced output is a single operation-and-branch instruction. The final fused instruction remains as such for its remaining lifetime; that is the fused instruction will stay fused throughout the [[pipeline]] until execution units where it may be executed on a single port or dual-issued on two appropriate ports.
+Fusion is done on compare flag-modifying instruction (e.g., <code>{{x86|CMP}}</code> or <code>{{x86|ADD}}</code>) with a subsequent conditional [[jump instruction]]. The produced output is a single single operation-and-branch instruction. The final fused instruction remains as such for its remaining lifetime; that is the fused instruction will stay fused throughout the [[pipeline]] until execution units where it may be executed on a single port or duel-issued on two appropriate ports.
 * Two instructions must be right next to each other, with no other instruction in between
@@ Line 96: / Line 49: @@
 |}
-===== Prior limitations =====
+=== Prior limitations ===
-====== Nehalem µarch limitations ======
+==== Nehalem µarch limitations ====
 In {{intel|Nehalem|l=arch}}, Intel introduced a number of enhancements:
@@ Line 104: / Line 57: @@
 * Supported on {{x86|x86-64}} mode
-====== Core µarch limitations ======
+==== Core µarch limitations ====
 The original implementation in the {{intel|Core|l=arch}} microarchitecture was much more limited than in recent processors.
@@ Line 117: / Line 70: @@
 * <code>{{x86|TEST}}</code> can fused with all conditional jumps
 * <code>{{x86|CMP}}</code> can only be fused with {{x86|Carry Flag}} ({{x86|CF}}) / {{x86|Zero Flag}} ({{x86|ZF}}) conditional jumps: <code>{{x86|JA}}</code>, <code>{{x86|JNBE}}</code>, <code>{{x86|JAE}}</code>, <code>{{x86|JNB}}</code>, <code>{{x86|JNC}}</code>, <code>{{x86|JE}}</code>, <code>{{x86|JZ}}</code>, <code>{{x86|JNA}}</code>, <code>{{x86|JBE}}</code>, <code>{{x86|JNAE}}</code>, <code>{{x86|JC}}</code>, <code>{{x86|JB}}</code>, <code>{{x86|JNE}}</code>, <code>{{x86|JNZ}}</code>
-=== Centaur ===
-[[Centaur Technology]] also implements macro-op fusion in its architectures, including in its most recent server SoC, {{centtech|CHA|l=arch}}.
-== Bibliography ==
-* Celio, Christopher, et al. "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V." arXiv preprint arXiv:1607.02318 (2016).
-{{reflist}}
-== See also ==
-* [[micro-operation fusion]]
-* [[zeroing idioms]]