Latest revision |
Your text |
Line 53: |
Line 53: |
| | successor link = intel/microarchitectures/mckinley | | | successor link = intel/microarchitectures/mckinley |
| }} | | }} |
− | '''Merced''' was the first [[Itanium]] microarchitecture designed by [[Intel]]. The architecture avoids expensive out-of-order execution hardware by moving scheduling responsibilities to the compiler. | + | '''Merced''' was the first [[Itanium]] microarchitecture designed by [[Intel]]. |
| | | |
| == Architecture == | | == Architecture == |
Line 90: |
Line 90: |
| ** The FPU is directly fed by the dual-ported L2 cache, with 9 cycle load latency. | | ** The FPU is directly fed by the dual-ported L2 cache, with 9 cycle load latency. |
| | | |
− | == Block Diagram ==
| |
| [[File:merced.png]] | | [[File:merced.png]] |
− |
| |
− | === Pipeline ===
| |
− | Merced tries to achieve high performance by ditching almost all hardware scheduling. With the compiler handling instruction scheduling, Merced is free to spend power and die area to create a very wide core with plenty of execution resources.
| |
− |
| |
− | ==== Frontend ====
| |
− | Merced's frontend is very simple. High performance x86 chips have a relatively complex frontend that must determine variable length instruction boundaries and translate x86 instructions into internal micro-ops. In contrast, Merced fetches fixed length bundles of instructions. These instructions map directly to execution units and do not require decoding into micro-ops.
| |
− |
| |
− | ===== Fetch =====
| |
− | Every cycle, Merced fetches two bundles. Each bundle contains three instructions, meaning Merced's frontend can deliver six instructions per clock. Bundles are then queued into an 8-bundle decoupling buffer. This buffer can help hide fetch latency if there's an instruction cache miss. If the backend stalls, the decoupling buffer can allow fetch to keep running ahead.
| |
− |
| |
− | To avoid instruction cache misses, Merced can prefetch code from L2 into a streaming buffer with eight 32-byte entries. Unlike x86 chips, instruction prefetch is controlled by software.
| |
− |
| |
− | ===== Branch Predictor =====
| |
− | Merced features a complex hierarchy of branch predictors to avoid the pipeline flush penalty associated with mispredicts.
| |
− |
| |
− | The fastest branch prediction method is controlled by software. Four target address registers (TARs) can be programmed by the compiler. If the instruction pointer matches a TAR, Merced uses the specified target address. Because this prediction method happens at the instruction pointer generation stage, it allows taken branches to be handled with zero bubbles. These TARs thus function like a small L0 BTB controlled by compiler hints.
| |
− |
| |
− | For dynamic branch prediction, Merced uses a 512 entry branch prediction table. This uses a two-level prediction scheme with 4 bits of local history, somewhat like the mechanism used by Pentium Pro. Indirect branches are handled with a 64 entry multiway branch prediction table. Instead of a large BTB, Merced uses a relatively small 64 entry target address cache to provide taken branch targets. Software hints can update target address cache entries.
| |
− |
| |
− | Merced can resteer the frontend at the ROT stage. Here, Merced has a compiler-assisted loop predictor and branch address calculator.
| |
− |
| |
− | ==== Execution Engine ====
| |
− | Each instruction bundle contains three instructions in certain allowed combinations. This simplifies hardware that binds instructions to execution ports.
| |
− |
| |
− | By avoiding OOO execution, Merced fields an impressive number of execution resources. The execution stage is 9 ports wide. Four ports are available for common integer instructions. Two of these integer ports handle memory operations and transfers to/from the floating point unit. Three ports are available for branch execution, and two more handle floating point instructions.
| |
− |
| |
− | After instructions are dispersed to ports, Merced executes them in-order. To hide cache and memory latency, Merced supports nonblocking loads with a register scoreboard that enforces dependencies. Thus, a load will only stall the pipeline when the result is consumed.
| |
− |
| |
− | To further hide load latency, software can issue speculative loads. Merced queues these into a 32-entry Advanced Load Address Table (ALAT). The ALAT acts like a software controlled load buffer, where speculative load address can be compared with store addresses. While Merced does this address overlap check in hardware, software must handle these conflicts and replay the load. Merced thus allows load speculation while minimizing load/store unit complexity.
| |
− |
| |
− | Finally, Merced can retire six instructions each cycle.
| |
− |
| |
− | == References ==
| |
− | * H. Sharangpani, “ITANIUM PROCESSOR MICROARCHITECTURE.” IEEE MICRO, Sep. 2000.
| |