- 10 stage pipeline
- IPG: Get next instruction pointer
- FET: Fetch from instruction cache
- ROT: Instruction rotation, decoupling buffer
- EXP: Instruction dispersal
- REN: Register remapping
- WLD: Word line decode
- REG: Register file read
- EXE: Execute
- DET: Exception detection
- WRB: Writeback
- Branch Predictor
- Early zero bubble predictor using Target Address Registers controlled by the compiler
- Two-level predictor with 4 bits of local history and a 512 entry prediction table
- Indirect branches handled with 64 entry Multiway Branch Prediction Table
- 64-entry Target Address Cache
- 8 entry Return Address Stack
- Branch predictor can resteer at ROT stage using the loop exit predictor, or compiler provided prediction hints for the third slot in a bundle.
- Branch predictor can resteer at EXP stage for any branch
- 16K 4-way L1 Instruction Cache
- Fetch and Decode
- Two bundles, each containing three instructions, fetched from the instruction cache every cycle
- No decoder necessary
- Execution Engine
- Instructions from bundles dispersed to issue ports
- Scoreboarding resolves dependencies with compiler hints
- FPU can be accessed from the integer side by floating point get and set instructions.
- Transfer from FPU to integer side takes two clocks
- Transfer from integer side to FPU takes 9 clocks
- Memory Subsystem
- L1D has two cycle latency
- Software can issue "advanced loads", which go into a Advanced Load Address Table that checks for conflicting stores. Software needs to check the ALAT before using the load result. If there's a conflict, a software handler has to reissue the conflicting load.
- The FPU is directly fed by the dual-ported L2 cache, with 9 cycle load latency.
Merced tries to achieve high performance by ditching almost all hardware scheduling. With the compiler handling instruction scheduling, Merced is free to spend power and die area to create a very wide core with plenty of execution resources.
Merced's frontend is very simple. High performance x86 chips have a relatively complex frontend that must determine variable length instruction boundaries and translate x86 instructions into internal micro-ops. In contrast, Merced fetches fixed length bundles of instructions. These instructions map directly to execution units and do not require decoding into micro-ops.
Every cycle, Merced fetches two bundles. Each bundle contains three instructions, meaning Merced's frontend can deliver six instructions per clock. Bundles are then queued into an 8-bundle decoupling buffer. This buffer can help hide fetch latency if there's an instruction cache miss. If the backend stalls, the decoupling buffer can allow fetch to keep running ahead.
To avoid instruction cache misses, Merced can prefetch code from L2 into a streaming buffer with eight 32-byte entries. Unlike x86 chips, instruction prefetch is controlled by software.
Merced features a complex hierarchy of branch predictors to avoid the pipeline flush penalty associated with mispredicts.
The fastest branch prediction method is controlled by software. Four target address registers (TARs) can be programmed by the compiler. If the instruction pointer matches a TAR, Merced uses the specified target address. Because this prediction method happens at the instruction pointer generation stage, it allows taken branches to be handled with zero bubbles. These TARs thus function like a small L0 BTB controlled by compiler hints.
For dynamic branch prediction, Merced uses a 512 entry branch prediction table. This uses a two-level prediction scheme with 4 bits of local history, somewhat like the mechanism used by Pentium Pro. Indirect branches are handled with a 64 entry multiway branch prediction table. Instead of a large BTB, Merced uses a relatively small 64 entry target address cache to provide taken branch targets. Software hints can update target address cache entries.
Merced can resteer the frontend at the ROT stage. Here, Merced has a compiler-assisted loop predictor and branch address calculator.
Each instruction bundle contains three instructions in certain allowed combinations. This simplifies hardware that binds instructions to execution ports.
By avoiding OOO execution, Merced fields an impressive number of execution resources. The execution stage is 9 ports wide. Four ports are available for common integer instructions. Two of these integer ports handle memory operations and transfers to/from the floating point unit. Three ports are available for branch execution, and two more handle floating point instructions.
After instructions are dispersed to ports, Merced executes them in-order. To hide cache and memory latency, Merced supports nonblocking loads with a register scoreboard that enforces dependencies. Thus, a load will only stall the pipeline when the result is consumed.
To further hide load latency, software can issue speculative loads. Merced queues these into a 32-entry Advanced Load Address Table (ALAT). The ALAT acts like a software controlled load buffer, where speculative load address can be compared with store addresses. While Merced does this address overlap check in hardware, software must handle these conflicts and replay the load. Merced thus allows load speculation while minimizing load/store unit complexity.
Finally, Merced can retire six instructions each cycle.
- H. Sharangpani, “ITANIUM PROCESSOR MICROARCHITECTURE.” IEEE MICRO, Sep. 2000.
|core count||1 +|
|first launched||June 2001 +|
|full page name||intel/microarchitectures/merced +|
|instance of||microarchitecture +|
|instruction set architecture||IA-64 +|
|microarchitecture type||CPU +|
|process||180 nm (0.18 μm, 1.8e-4 mm) +|