Latest revision |
Your text |
Line 1: |
Line 1: |
| {{title|Macro-Operation Fusion (MOP Fusion)}}{{confuse|micro-operation fusion}} | | {{title|Macro-Operation Fusion (MOP Fusion)}}{{confuse|micro-operation fusion}} |
− | '''Macro-Operation Fusion''' (also '''Macro-Op Fusion''', '''MOP Fusion''', or '''Macrofusion''') is a hardware optimization technique found in many modern [[microarchitectures]] whereby a series of adjacent [[macro-operations]] are merged into a single macro-operation prior or during decoding. Those instructions are later decoded into fused-µOPs. | + | '''Macro-Operation Fusion''' ('''Macro-Op Fusion''' or '''MOP Fusion''') is a hardware optimization technique found in [[Intel]]'s [[x86]] [[microarchitectures]] whereby a pair of [[macro-operations]] are merged into a single macro-operation. |
| | | |
− | == Overview & Motivation == | + | == History == |
− | One of the three [[microprocessor performance|performance knobs of a microprocessor]] is the [[instruction count]]. By reducing the number of instructions that must be executed, more work can be done with fewer resources. The idea behind macro-operation fusion is to combine multiple adjacent instructions into a single instruction. A fused instruction typically remains fused throughout its lifetime. Therefore fused instructions can represent more work with fewer bits, free up execution units, tracking information (e.g. in the [[register renaming|rename unit]]), save pipeline bandwidth in all stages from decode to retire, and consequently save power.
| + | The technique for fusing instructions is owned by [[Intel]] under [https://www.google.com/patents/US6675376 Patent US6675376] ("System and method for fusing instructions") originally filed in December [[2000]]. MOP Fusion was first introduced in the {{intel|Core|l=arch}} microarchitecture and has been featured in every Intel microarch since. |
| | | |
− | A unique aspect of macro-op fusion is that it also helps workloads that are not compiled such as in the case of many [[interpreted programming languages]] (e.g. [[PHP]], the software running WikiChip).
| + | == Mechanism == |
| + | After the boundaries of [[macro-ops]] are found and marked, they are delivered to the [[instruction queue]] before being fed to the [[instruction decode|decoders]]. At that stage of the [[pipeline]], macro-operation fusion opportunities can be identified and exploited. |
| | | |
− | == Arm ==
| + | A pair of two [[dependent instructions]] are first compared against a set of criteria. For example, if the second instruction is commutative (i.e., the order of operands does not affect the result) or if the destination of the [[operand]] of the first instruction is used as the source operand of the second instruction than the instruction may qualify for fusion. Additionally either the first source or destination operand must be a [[register]] and the second source operand (if one exists) must be an [[immediate value]] or a non-{{x86|RIP-Relative Addressing|RIP-relative memory}}. Fusion replaces the two instructions with a single instruction representing both operations behaviorally. |
− | Arm supports a number of macro-op fusion operations in their recent microarchitectures.
| |
| | | |
− | * movw + movt
| + | Fusion is done on compare (<code>{{x86|CMP}}</code>) and test (<code>{{x86|TEST}}</code>) instruction with a subsequent conditional [[jump instruction]]. The produced output is a single single compare-and-branch instruction. The final fused instruction remains as such for its remaining lifetime; that is the fused instruction will stay fused throughout the [[pipeline]] and execute on a single port in the [[back-end]] that can handle both operations. |
− | * aese + aesmc
| |
− | * aesd + aesimc
| |
| | | |
− | == RISC-V == | + | == Motivation == |
− | The use of macro-op fusion in RISC-V was proposed in a 2016 Berkeley paper<ref>Celio et al</ref> where a renewed case was made for the use of macro-operation fusion over bloating the ISA with more complex instructions. The paper compared the RISC-V isa performance in terms of instruction count on the popular [[SPEC CPU2006]] benchmark where it is found to be slightly behind contemporary ISAs. In their paper<ref>Celio et al</ref>, it's claimed that the RV64G and RV64GC effective instruction count can be reduced by 5.4% on average by leveraging macro-op fusion, thereby closing much of the deficiency gap. The used of macro-op fusion has gained larger support in the RISC-V community in favor of the microarchitecture taking care of this aspect rather than bloating the ISA with more complex instructions.
| + | A fused instruction remains fused throughout its lifetime. Therefore fused instructions can represent more work with less bits, free up execution units, tracking information (e.g. in the [[register renaming|rename unit]]), save pipeline bandwidth in all stages from decode to retire, and consequently saves power. Note that this is done before decoding, therefore even decoding bandwidth is save. |
| | | |
− | === Proposed fusion operations ===
| + | Conditional branching are a very common operation in almost all workloads. Macro-op fusion also helps workloads that are not compiled such as in the case of many [[interpreted programming languages]] (e.g. [[PHP]], the software running WikiChip). In those programs, conditional branching is seldomly fused as they would by a static [[compiler]]. |
− | Some of the common operations are:
| |
− | | |
− | {| class="wikitable"
| |
− | |-
| |
− | ! Pattern !! Result
| |
− | |-
| |
− | | // rd = array[offset]<br>add rd, rs1, rs2<br>ld rd, 0(rd) || Fused into an indexed load
| |
− | |-
| |
− | | // &(array[offset])<br>slli rd, rs1, {1,2,3}<br>add rd, rd, rs2 || Fused into a load effective address
| |
− | |-
| |
− | | // rd = array[offset]<br>slli rd, rs1, {1,2,3}<br>add rd, rd, rs2<br>ld rd, 0(rd) || Three-instruction fused into a load effective address
| |
− | |-
| |
− | | // rd = rs1 & 0xffffffff<br>slli rd, rs1, 32<br>srli rd, rd, 32 || Clear upper word
| |
− | |-
| |
− | |// rd = imm[31:0]<br>lui rd, imm[31:12]<br>addi rd, rd, imm[11:0] || Load upper immediate
| |
− | |-
| |
− | | // rd = *(imm[31:0])<br>lui rd, imm[31:12]<br>ld rd, imm[11:0](rd) || Load upper immediate
| |
− | |-
| |
− | |// l[dw] rd, symbol[31:0]<br>auipc rd, symbol[31:12]<br>l[dw] rd, symbol[11:0](rd) || Load global immediate
| |
− | |-
| |
− | | // far jump (1 MB) (AUIPC+JALR)<br>auipc t, imm20<br>jalr ra, imm12(t) || Fused far jump and link with calculated target address
| |
− | |-
| |
− | | addiw rd, rs1, imm12<br>slli rd, rs1, 32<br>SRLI rd, rs1, 32 || Fused into a single 32-bit zero extending add operation
| |
− | |-
| |
− | |mulh[[S]U] rdh, rs1, rs2<br>mul rdl, rs1, rs2 || Fused into a wide multiply
| |
− | |-
| |
− | | div[U] rdq, rs1, rs2<br>rem[U] rdr, rs1, rs2 || Fused into a wide divide
| |
− | |-
| |
− | |// ldpair rd1,rd2, [imm(rs1)]<br>ld rd1, imm(rs1)<br>ld rd2, imm+8(rs1) || Fused into a load-pair
| |
− | |-
| |
− | |// ldia rd, imm(rs1)<br>ld rd, imm(rs1)<br>add rs1, rs1, 8 || Fused into a post-indexed load
| |
− | |}
| |
− | | |
− | == x86 ==
| |
− | === Intel ===
| |
− | Intel uses macro-op fusion in all their modern {{intel|microarchitectures}} since {{intel|Core|l=arch}}.
| |
− | ==== History ====
| |
− | The technique for fusing instructions is owned by [[Intel]] and is protected by [https://www.google.com/patents/US6675376 Patent US6675376] ("System and method for fusing instructions") originally filed in December [[2000]]. MOP Fusion was first introduced in [[2006]] in the {{intel|Core|l=arch}} microarchitecture and has been featured in every Intel microarch since.
| |
− | | |
− | ==== Mechanism ====
| |
− | <div style="float: right; text-align: center; margin: 10px;">
| |
− | [[File:core mopf off.png|350px]]
| |
− | | |
− | [[File:core mopf on.png|350px]]
| |
− | | |
− | <small>Slides from Intel's {{intel|Core|l=arch}} microarchitecture presentation.</small>
| |
− | </div>
| |
− | After the boundaries of [[macro-ops]] are found and marked, they are delivered to the [[instruction queue]] before being fed to the [[instruction decode|decoders]]. At that stage of the [[pipeline]], macro-operation fusion opportunities can be identified and exploited. Note that this is done before decoding, therefore even decoding bandwidth is saved.
| |
− | | |
− | Conditional branching are a very common operation in almost all workloads; by Intel estimates it makes up 15% of all instructions. A pair of two [[dependent instructions]] are first compared against a set of criteria. For example, either the first source or destination operand must be a [[register]] and the second source operand (if one exists) must be an [[immediate value]] or a non-{{x86|RIP-Relative Addressing|RIP-relative memory}}. Fusion replaces the two instructions with a single instruction representing both operations behaviorally. | |
− | | |
− | Fusion is done on compare flag-modifying instruction (e.g., <code>{{x86|CMP}}</code> or <code>{{x86|ADD}}</code>) with a subsequent conditional [[jump instruction]]. The produced output is a single operation-and-branch instruction. The final fused instruction remains as such for its remaining lifetime; that is the fused instruction will stay fused throughout the [[pipeline]] until execution units where it may be executed on a single port or dual-issued on two appropriate ports.
| |
− | | |
− | * Two instructions must be right next to each other, with no other instruction in between
| |
− | * First instruction must be one of the following: <code>{{x86|CMP}}</code>, <code>{{x86|TEST}}</code>, <code>{{x86|ADD}}</code>, <code>{{x86|SUB}}</code>, <code>{{x86|INC}}</code>, <code>{{x86|DEC}}</code>, or <code>{{x86|AND}}</code>.
| |
− | * Second instruction must be a conditional jump (e.g., <code>{{x86|JA}}</code>, <code>{{x86|JAE}}</code>, <code>{{x86|JE}}</code>, <code>{{x86|JNE}}</code>)
| |
− | * Fusion cannot take place if the first instruction ends on byte 63 of a cache line and the second instruction starts at byte 0 of the next line.
| |
− | | |
− | Additionally, only up to 1 macrofusion can take place each cycle. If there it's possible to perform 2 macrofusions, only the first pair will be fused. The second pair will continue unfused.
| |
− | | |
− | {| class="wikitable tc2 tc3 tc4 tc5 tc6 tc7"
| |
− | ! colspan="8" | Macro-Fusibility
| |
− | |-
| |
− | ! [[Instruction]] !! {{x86|TEST}} !! {{x86|CMP}} !! {{x86|AND}} !! {{x86|ADD}} !! {{x86|SUB}} !! {{x86|INC}} !! {{x86|DEC}}
| |
− | |-
| |
− | | {{x86|JO}}/{{x86|JNO}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘
| |
− | |-
| |
− | | {{x86|JC}}/{{x86|JB}}/{{x86|JAE}}/{{x86|JNB}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘
| |
− | |-
| |
− | | {{x86|JE}}/{{x86|JZ}}/{{x86|JNE}}/{{x86|JNZ}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔
| |
− | |-
| |
− | | {{x86|JNA}}/{{x86|JBE}}/{{x86|JA}}/{{x86|JNBE}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘
| |
− | |-
| |
− | | {{x86|JS}}/{{x86|JNS}}/{{x86|JP}}/{{x86|JPE}}/{{x86|JNP}}/{{x86|JPO}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘ || style="background-color: #ffdad6;" | ✘
| |
− | |-
| |
− | | {{x86|JL}}/{{x86|JNGE}}/{{x86|JGE}}/{{x86|JNL}}/{{x86|JLE}}/{{x86|JNG}}/{{x86|JG}}/{{x86|JNLE}} || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔
| |
− | |}
| |
− | | |
− | ===== Prior limitations =====
| |
− | | |
− | ====== Nehalem µarch limitations ======
| |
− | In {{intel|Nehalem|l=arch}}, Intel introduced a number of enhancements:
| |
− | | |
− | * <code>{{x86|CMP}}</code> can be fused with: <code>{{x86|JL}}</code>, <code>{{x86|JNGE}}</code>, <code>{{x86|JGE}}</code>, <code>{{x86|JNL}}</code>, <code>{{x86|JLE}}</code>, <code>{{x86|JNG}}</code>, <code>{{x86|JG}}</code>, <code>{{x86|JNLE}}</code>
| |
− | * Supported on {{x86|x86-64}} mode
| |
− | | |
− | ====== Core µarch limitations ======
| |
− | The original implementation in the {{intel|Core|l=arch}} microarchitecture was much more limited than in recent processors.
| |
− | | |
− | * First instruction must be one of the following: <code>{{x86|CMP}}</code> and <code>{{x86|TEST}}</code>
| |
− | * Macro Fusion is restricted to {{x86|x86-16|16-bit}} and {{x86|x86-32|32-bit mode}} only (including 32-bit compatibility sub-mode in {{x86|x86-64}}).
| |
− | * <code>{{x86|CMP}}</code> and <code>{{x86|TEST}}</code> can fuse when comparing:
| |
− | ** [[register|REG]]-REG. (e.g, CMP EAX,ECX; JZ label)
| |
− | ** REG-[[immediate value|IMM]]. (e.g., CMP EAX,0x80; JZ label)
| |
− | ** REG-[[memory|MEM]]. (e.g., CMP EAX,[ECX]; JZ label)
| |
− | ** MEM-REG. (e.g., CMP [EAX],ECX; JZ label)
| |
− | * <code>{{x86|CMP}}</code> and <code>{{x86|TEST}}</code> can not be fused when comparing MEM-IMM (e.g. CMP [EAX],0x80; JZ label)
| |
− | * <code>{{x86|TEST}}</code> can fused with all conditional jumps
| |
− | * <code>{{x86|CMP}}</code> can only be fused with {{x86|Carry Flag}} ({{x86|CF}}) / {{x86|Zero Flag}} ({{x86|ZF}}) conditional jumps: <code>{{x86|JA}}</code>, <code>{{x86|JNBE}}</code>, <code>{{x86|JAE}}</code>, <code>{{x86|JNB}}</code>, <code>{{x86|JNC}}</code>, <code>{{x86|JE}}</code>, <code>{{x86|JZ}}</code>, <code>{{x86|JNA}}</code>, <code>{{x86|JBE}}</code>, <code>{{x86|JNAE}}</code>, <code>{{x86|JC}}</code>, <code>{{x86|JB}}</code>, <code>{{x86|JNE}}</code>, <code>{{x86|JNZ}}</code>
| |
− | | |
− | === Centaur ===
| |
− | [[Centaur Technology]] also implements macro-op fusion in its architectures, including in its most recent server SoC, {{centtech|CHA|l=arch}}. | |
− | | |
− | == Bibliography ==
| |
− | * Celio, Christopher, et al. "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V." arXiv preprint arXiv:1607.02318 (2016).
| |
− | | |
− | {{reflist}}
| |
− | | |
− | == See also ==
| |
− | * [[micro-operation fusion]]
| |
− | * [[zeroing idioms]]
| |