Macro-Operation Fusion (MOP Fusion)

Not to be confused with micro-operation fusion.

Macro-Operation Fusion (also Macro-Op Fusion, MOP Fusion, or Macrofusion) is a hardware optimization technique found in many modern microarchitectures whereby a series of adjacent macro-operations are merged into a single macro-operation prior or during decoding. Those instructions are later decoded into fused-µOPs.

Overview & Motivation[edit]

One of the three performance knobs of a microprocessor is the instruction count. By reducing the number of instructions that must be executed, more work can be done with fewer resources. The idea behind macro-operation fusion is to combine multiple adjacent instructions into a single instruction. A fused instruction typically remains fused throughout its lifetime. Therefore fused instructions can represent more work with fewer bits, free up execution units, tracking information (e.g. in the rename unit), save pipeline bandwidth in all stages from decode to retire, and consequently save power.

A unique aspect of macro-op fusion is that it also helps workloads that are not compiled such as in the case of many interpreted programming languages (e.g. PHP, the software running WikiChip).

Arm[edit]

Arm supports a number of macro-op fusion operations in their recent microarchitectures.

movw + movt
aese + aesmc
aesd + aesimc

RISC-V[edit]

The use of macro-op fusion in RISC-V was proposed in a 2016 Berkeley paper^[1] where a renewed case was made for the use of macro-operation fusion over bloating the ISA with more complex instructions. The paper compared the RISC-V isa performance in terms of instruction count on the popular SPEC CPU2006 benchmark where it is found to be slightly behind contemporary ISAs. In their paper^[2], it's claimed that the RV64G and RV64GC effective instruction count can be reduced by 5.4% on average by leveraging macro-op fusion, thereby closing much of the deficiency gap. The used of macro-op fusion has gained larger support in the RISC-V community in favor of the microarchitecture taking care of this aspect rather than bloating the ISA with more complex instructions.

Proposed fusion operations[edit]

Some of the common operations are:

Pattern	Result
// rd = array[offset] add rd, rs1, rs2 ld rd, 0(rd)	Fused into an indexed load
// &(array[offset]) slli rd, rs1, {1,2,3} add rd, rd, rs2	Fused into a load effective address
// rd = array[offset] slli rd, rs1, {1,2,3} add rd, rd, rs2 ld rd, 0(rd)	Three-instruction fused into a load effective address
// rd = rs1 & 0xffffffff slli rd, rs1, 32 srli rd, rd, 32	Clear upper word
// rd = imm[31:0] lui rd, imm[31:12] addi rd, rd, imm[11:0]	Load upper immediate
// rd = *(imm[31:0]) lui rd, imm[31:12] ld rd, imm[11:0](rd)	Load upper immediate
// l[dw] rd, symbol[31:0] auipc rd, symbol[31:12] l[dw] rd, symbol[11:0](rd)	Load global immediate
// far jump (1 MB) (AUIPC+JALR) auipc t, imm20 jalr ra, imm12(t)	Fused far jump and link with calculated target address
addiw rd, rs1, imm12 slli rd, rs1, 32 SRLI rd, rs1, 32	Fused into a single 32-bit zero extending add operation
mulh[[S]U] rdh, rs1, rs2 mul rdl, rs1, rs2	Fused into a wide multiply
div[U] rdq, rs1, rs2 rem[U] rdr, rs1, rs2	Fused into a wide divide
// ldpair rd1,rd2, [imm(rs1)] ld rd1, imm(rs1) ld rd2, imm+8(rs1)	Fused into a load-pair
// ldia rd, imm(rs1) ld rd, imm(rs1) add rs1, rs1, 8	Fused into a post-indexed load

x86[edit]

Intel[edit]

Intel uses macro-op fusion in all their modern microarchitectures since Core.

History[edit]

The technique for fusing instructions is owned by Intel and is protected by Patent US6675376 ("System and method for fusing instructions") originally filed in December 2000. MOP Fusion was first introduced in 2006 in the Core microarchitecture and has been featured in every Intel microarch since.

Mechanism[edit]

Slides from Intel's Core microarchitecture presentation.

After the boundaries of macro-ops are found and marked, they are delivered to the instruction queue before being fed to the decoders. At that stage of the pipeline, macro-operation fusion opportunities can be identified and exploited. Note that this is done before decoding, therefore even decoding bandwidth is saved.

Conditional branching are a very common operation in almost all workloads; by Intel estimates it makes up 15% of all instructions. A pair of two dependent instructions are first compared against a set of criteria. For example, either the first source or destination operand must be a register and the second source operand (if one exists) must be an immediate value or a non-RIP-relative memory. Fusion replaces the two instructions with a single instruction representing both operations behaviorally.

Fusion is done on compare flag-modifying instruction (e.g., CMP or ADD) with a subsequent conditional jump instruction. The produced output is a single operation-and-branch instruction. The final fused instruction remains as such for its remaining lifetime; that is the fused instruction will stay fused throughout the pipeline until execution units where it may be executed on a single port or dual-issued on two appropriate ports.

Two instructions must be right next to each other, with no other instruction in between
First instruction must be one of the following: CMP, TEST, ADD, SUB, INC, DEC, or AND.
Second instruction must be a conditional jump (e.g., JA, JAE, JE, JNE)
Fusion cannot take place if the first instruction ends on byte 63 of a cache line and the second instruction starts at byte 0 of the next line.

Additionally, only up to 1 macrofusion can take place each cycle. If there it's possible to perform 2 macrofusions, only the first pair will be fused. The second pair will continue unfused.

Macro-Fusibility
Instruction	TEST	CMP	AND	ADD	SUB	INC	DEC
JO/JNO	✔	✘	✔	✘	✘	✘	✘
JC/JB/JAE/JNB	✔	✔	✔	✔	✔	✘	✘
JE/JZ/JNE/JNZ	✔	✔	✔	✔	✔	✔	✔
JNA/JBE/JA/JNBE	✔	✔	✔	✔	✔	✘	✘
JS/JNS/JP/JPE/JNP/JPO	✔	✘	✔	✘	✘	✘	✘
JL/JNGE/JGE/JNL/JLE/JNG/JG/JNLE	✔	✔	✔	✔	✔	✔	✔

Prior limitations[edit]

Nehalem µarch limitations[edit]

In Nehalem, Intel introduced a number of enhancements:

CMP can be fused with: JL, JNGE, JGE, JNL, JLE, JNG, JG, JNLE
Supported on x86-64 mode

Core µarch limitations[edit]

The original implementation in the Core microarchitecture was much more limited than in recent processors.

First instruction must be one of the following: CMP and TEST
Macro Fusion is restricted to 16-bit and 32-bit mode only (including 32-bit compatibility sub-mode in x86-64).
CMP and TEST can fuse when comparing:
- REG-REG. (e.g, CMP EAX,ECX; JZ label)
- REG-IMM. (e.g., CMP EAX,0x80; JZ label)
- REG-MEM. (e.g., CMP EAX,[ECX]; JZ label)
- MEM-REG. (e.g., CMP [EAX],ECX; JZ label)
CMP and TEST can not be fused when comparing MEM-IMM (e.g. CMP [EAX],0x80; JZ label)
TEST can fused with all conditional jumps
CMP can only be fused with Carry Flag (CF) / Zero Flag (ZF) conditional jumps: JA, JNBE, JAE, JNB, JNC, JE, JZ, JNA, JBE, JNAE, JC, JB, JNE, JNZ

Centaur[edit]

Centaur Technology also implements macro-op fusion in its architectures, including in its most recent server SoC, CHA.

Bibliography[edit]

Celio, Christopher, et al. "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V." arXiv preprint arXiv:1607.02318 (2016).

↑ Celio et al
↑ Celio et al

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung

Intel

AMD

Ampere

Apple