From WikiChip
Editing macro-operation fusion

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 1: Line 1:
 
{{title|Macro-Operation Fusion (MOP Fusion)}}{{confuse|micro-operation fusion}}
 
{{title|Macro-Operation Fusion (MOP Fusion)}}{{confuse|micro-operation fusion}}
'''Macro-Operation Fusion''' (also '''Macro-Op Fusion''', '''MOP Fusion''', or '''Macrofusion''') is a hardware optimization technique found in many modern [[microarchitectures]] whereby a series of adjacent [[macro-operations]] are merged into a single macro-operation prior or during decoding. Those instructions are later decoded into fused-µOPs.
+
'''Macro-Operation Fusion''' ('''Macro-Op Fusion''' or '''MOP Fusion''') is a hardware optimization technique found in [[Intel]]'s [[x86]] [[microarchitectures]] whereby a pair of [[macro-operations]] are merged into a single macro-operation.
  
== Overview & Motivation ==
+
== History ==
One of the three [[microprocessor performance|performance knobs of a microprocessor]] is the [[instruction count]]. By reducing the number of instructions that must be executed, more work can be done with fewer resources. The idea behind macro-operation fusion is to combine multiple adjacent instructions into a single instruction. A fused instruction typically remains fused throughout its lifetime. Therefore fused instructions can represent more work with fewer bits, free up execution units, tracking information (e.g. in the [[register renaming|rename unit]]), save pipeline bandwidth in all stages from decode to retire, and consequently save power.
+
The technique for fusing instructions is owned by [[Intel]] under [https://www.google.com/patents/US6675376 Patent US6675376] ("System and method for fusing instructions") originally filed in December [[2000]]. MOP Fusion was first introduced in the {{intel|Core|l=arch}} microarchitecture and has been featured in every Intel microarch since.
  
A unique aspect of macro-op fusion is that it also helps workloads that are not compiled such as in the case of many [[interpreted programming languages]] (e.g. [[PHP]], the software running WikiChip).
+
== Motivation ==
 +
A fused instruction remains fused throughout its lifetime. Therefore fused instructions can represent more work with less bits, free up execution units, tracking information (e.g. in the [[register renaming|rename unit]]), save pipeline bandwidth in all stages from decode to retire, and consequently save power. Note that this is done before decoding, therefore even decoding bandwidth is save.
  
== Arm ==
+
Conditional branching are a very common operation in almost all workloads. Macro-op fusion also helps workloads that are not compiled such as in the case of many [[interpreted programming languages]] (e.g. [[PHP]], the software running WikiChip). In those programs, conditional branching is seldomly fused as they would by a static [[compiler]].
Arm supports a number of macro-op fusion operations in their recent microarchitectures.
 
  
* movw + movt
+
== Mechanism ==
* aese + aesmc
+
After the boundaries of [[macro-ops]] are found and marked, they are delivered to the [[instruction queue]] before being fed to the [[instruction decode|decoders]]. At that stage of the [[pipeline]], macro-operation fusion opportunities can be identified and exploited.
* aesd + aesimc
 
  
== RISC-V ==
+
A pair of two [[dependent instructions]] are first compared against a set of criteria. For example, if the second instruction is commutative (i.e., the order of operands does not affect the result) or if the destination of the [[operand]] of the first instruction is used as the source operand of the second instruction than the instruction may qualify for fusion. Additionally either the first source or destination operand must be a [[register]] and the second source operand (if one exists) must be an [[immediate value]] or a non-{{x86|RIP-Relative Addressing|RIP-relative memory}}. Fusion replaces the two instructions with a single instruction representing both operations behaviorally.
The use of macro-op fusion in RISC-V was proposed in a 2016 Berkeley paper<ref>Celio et al</ref> where a renewed case was made for the use of macro-operation fusion over bloating the ISA with more complex instructions. The paper compared the RISC-V isa performance in terms of instruction count on the popular [[SPEC CPU2006]] benchmark where it is found to be slightly behind contemporary ISAs. In their paper<ref>Celio et al</ref>, it's claimed that the RV64G and RV64GC effective instruction count can be reduced by 5.4% on average by leveraging macro-op fusion, thereby closing much of the deficiency gap. The used of macro-op fusion has gained larger support in the RISC-V community in favor of the microarchitecture taking care of this aspect rather than bloating the ISA with more complex instructions.
 
  
=== Proposed fusion operations ===
+
Fusion is done on compare (<code>{{x86|CMP}}</code>) and test (<code>{{x86|TEST}}</code>) instruction with a subsequent conditional [[jump instruction]]. The produced output is a single single compare-and-branch instruction. The final fused instruction remains as such for its remaining lifetime; that is the fused instruction will stay fused throughout the [[pipeline]] and execute on a single port in the [[back-end]] that can handle both operations.
Some of the common operations are:
 
 
 
{| class="wikitable"
 
|-
 
! Pattern !! Result
 
|-
 
| // rd = array[offset]<br>add rd, rs1, rs2<br>ld rd, 0(rd) || Fused into an indexed load
 
|-
 
| // &(array[offset])<br>slli rd, rs1, {1,2,3}<br>add rd, rd, rs2 || Fused into a load effective address
 
|-
 
| // rd = array[offset]<br>slli rd, rs1, {1,2,3}<br>add rd, rd, rs2<br>ld rd, 0(rd) || Three-instruction fused into a load effective address
 
|-
 
| // rd = rs1 & 0xffffffff<br>slli rd, rs1, 32<br>srli rd, rd, 32 || Clear upper word
 
|-
 
|// rd = imm[31:0]<br>lui rd, imm[31:12]<br>addi rd, rd, imm[11:0] || Load upper immediate
 
|-
 
| // rd = *(imm[31:0])<br>lui rd, imm[31:12]<br>ld rd, imm[11:0](rd) || Load upper immediate
 
|-
 
|// l[dw] rd, symbol[31:0]<br>auipc rd, symbol[31:12]<br>l[dw] rd, symbol[11:0](rd) || Load global immediate
 
|-
 
| // far jump (1 MB) (AUIPC+JALR)<br>auipc t, imm20<br>jalr ra, imm12(t) || Fused far jump and link with calculated target address
 
|-
 
| addiw rd, rs1, imm12<br>slli rd, rs1, 32<br>SRLI rd, rs1, 32 || Fused into a single 32-bit zero extending add operation
 
|-
 
|mulh[[S]U] rdh, rs1, rs2<br>mul rdl, rs1, rs2 || Fused into a wide multiply
 
|-
 
| div[U] rdq, rs1, rs2<br>rem[U] rdr, rs1, rs2 || Fused into a wide divide
 
|-
 
|// ldpair rd1,rd2, [imm(rs1)]<br>ld rd1, imm(rs1)<br>ld rd2, imm+8(rs1) || Fused into a load-pair
 
|-
 
|// ldia rd, imm(rs1)<br>ld rd, imm(rs1)<br>add rs1, rs1, 8 || Fused into a post-indexed load
 
|}
 
 
 
== x86 ==
 
=== Intel ===
 
Intel uses macro-op fusion in all their modern {{intel|microarchitectures}} since {{intel|Core|l=arch}}.
 
==== History ====
 
The technique for fusing instructions is owned by [[Intel]] and is protected by [https://www.google.com/patents/US6675376 Patent US6675376] ("System and method for fusing instructions") originally filed in December [[2000]]. MOP Fusion was first introduced in [[2006]] in the {{intel|Core|l=arch}} microarchitecture and has been featured in every Intel microarch since.
 
 
 
==== Mechanism ====
 
<div style="float: right; text-align: center; margin: 10px;">
 
[[File:core mopf off.png|350px]]
 
 
 
[[File:core mopf on.png|350px]]
 
 
 
<small>Slides from Intel's {{intel|Core|l=arch}} microarchitecture presentation.</small>
 
</div>
 
After the boundaries of [[macro-ops]] are found and marked, they are delivered to the [[instruction queue]] before being fed to the [[instruction decode|decoders]]. At that stage of the [[pipeline]], macro-operation fusion opportunities can be identified and exploited. Note that this is done before decoding, therefore even decoding bandwidth is saved.
 
 
 
Conditional branching are a very common operation in almost all workloads; by Intel estimates it makes up 15% of all instructions. A pair of two [[dependent instructions]] are first compared against a set of criteria. For example, either the first source or destination operand must be a [[register]] and the second source operand (if one exists) must be an [[immediate value]] or a non-{{x86|RIP-Relative Addressing|RIP-relative memory}}. Fusion replaces the two instructions with a single instruction representing both operations behaviorally.
 
 
 
Fusion is done on compare flag-modifying instruction (e.g., <code>{{x86|CMP}}</code> or <code>{{x86|ADD}}</code>) with a subsequent conditional [[jump instruction]]. The produced output is a single operation-and-branch instruction. The final fused instruction remains as such for its remaining lifetime; that is the fused instruction will stay fused throughout the [[pipeline]] until execution units where it may be executed on a single port or dual-issued on two appropriate ports.
 
 
 
* Two instructions must be right next to each other, with no other instruction in between
 
* First instruction must be one of the following: <code>{{x86|CMP}}</code>, <code>{{x86|TEST}}</code>, <code>{{x86|ADD}}</code>, <code>{{x86|SUB}}</code>, <code>{{x86|INC}}</code>, <code>{{x86|DEC}}</code>, or <code>{{x86|AND}}</code>.
 
* Second instruction must be a conditional jump (e.g., <code>{{x86|JA}}</code>, <code>{{x86|JAE}}</code>, <code>{{x86|JE}}</code>, <code>{{x86|JNE}}</code>)
 
* Fusion cannot take place if the first instruction ends on byte 63 of a cache line and the second instruction starts at byte 0 of the next line.
 
 
 
Additionally, only up to 1 macrofusion can take place each cycle. If there it's possible to perform 2 macrofusions, only the first pair will be fused. The second pair will continue unfused.
 
 
 
{| class="wikitable tc2 tc3 tc4 tc5 tc6 tc7"
 
! colspan="8" | Macro-Fusibility
 
|-
 
! [[Instruction]] !! {{x86|TEST}} !! {{x86|CMP}} !! {{x86|AND}} !! {{x86|ADD}} !! {{x86|SUB}} !! {{x86|INC}} !! {{x86|DEC}}
 
|-
 
| {{x86|JO}}/{{x86|JNO}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘
 
|-
 
| {{x86|JC}}/{{x86|JB}}/{{x86|JAE}}/{{x86|JNB}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘
 
|-
 
| {{x86|JE}}/{{x86|JZ}}/{{x86|JNE}}/{{x86|JNZ}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔  || style="background-color: #d6ffd8;" | ✔
 
|-
 
| {{x86|JNA}}/{{x86|JBE}}/{{x86|JA}}/{{x86|JNBE}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘
 
|-
 
| {{x86|JS}}/{{x86|JNS}}/{{x86|JP}}/{{x86|JPE}}/{{x86|JNP}}/{{x86|JPO}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘
 
|-
 
| {{x86|JL}}/{{x86|JNGE}}/{{x86|JGE}}/{{x86|JNL}}/{{x86|JLE}}/{{x86|JNG}}/{{x86|JG}}/{{x86|JNLE}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔
 
|}
 
 
 
===== Prior limitations =====
 
 
 
====== Nehalem µarch limitations ======
 
In {{intel|Nehalem|l=arch}}, Intel introduced a number of enhancements:
 
 
 
* <code>{{x86|CMP}}</code> can be fused with: <code>{{x86|JL}}</code>, <code>{{x86|JNGE}}</code>, <code>{{x86|JGE}}</code>, <code>{{x86|JNL}}</code>, <code>{{x86|JLE}}</code>, <code>{{x86|JNG}}</code>, <code>{{x86|JG}}</code>, <code>{{x86|JNLE}}</code>
 
* Supported on {{x86|x86-64}} mode
 
 
 
====== Core µarch limitations ======
 
The original implementation in the {{intel|Core|l=arch}} microarchitecture was much more limited than in recent processors.
 
 
 
* First instruction must be one of the following: <code>{{x86|CMP}}</code> and <code>{{x86|TEST}}</code>
 
* Macro Fusion is restricted to {{x86|x86-16|16-bit}} and {{x86|x86-32|32-bit mode}} only (including 32-bit compatibility sub-mode in {{x86|x86-64}}).
 
* <code>{{x86|CMP}}</code> and <code>{{x86|TEST}}</code> can fuse when comparing:
 
** [[register|REG]]-REG. (e.g, CMP EAX,ECX; JZ label)
 
** REG-[[immediate value|IMM]]. (e.g., CMP EAX,0x80; JZ label)
 
** REG-[[memory|MEM]]. (e.g., CMP EAX,[ECX]; JZ label)
 
** MEM-REG. (e.g., CMP [EAX],ECX; JZ label)
 
* <code>{{x86|CMP}}</code> and <code>{{x86|TEST}}</code> can not be fused when comparing MEM-IMM (e.g. CMP [EAX],0x80; JZ label)
 
* <code>{{x86|TEST}}</code> can fused with all conditional jumps
 
* <code>{{x86|CMP}}</code> can only be fused with {{x86|Carry Flag}} ({{x86|CF}}) / {{x86|Zero Flag}} ({{x86|ZF}}) conditional jumps: <code>{{x86|JA}}</code>, <code>{{x86|JNBE}}</code>, <code>{{x86|JAE}}</code>, <code>{{x86|JNB}}</code>, <code>{{x86|JNC}}</code>, <code>{{x86|JE}}</code>, <code>{{x86|JZ}}</code>, <code>{{x86|JNA}}</code>, <code>{{x86|JBE}}</code>, <code>{{x86|JNAE}}</code>, <code>{{x86|JC}}</code>, <code>{{x86|JB}}</code>, <code>{{x86|JNE}}</code>, <code>{{x86|JNZ}}</code>
 
 
 
=== Centaur ===
 
[[Centaur Technology]] also implements macro-op fusion in its architectures, including in its most recent server SoC, {{centtech|CHA|l=arch}}.
 
 
 
== Bibliography ==
 
* Celio, Christopher, et al. "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V." arXiv preprint arXiv:1607.02318 (2016).
 
 
 
{{reflist}}
 
 
 
== See also ==
 
* [[micro-operation fusion]]
 
* [[zeroing idioms]]
 

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)