From WikiChip
Editing intel/microarchitectures/merced

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 53: Line 53:
 
| successor link  = intel/microarchitectures/mckinley
 
| successor link  = intel/microarchitectures/mckinley
 
}}
 
}}
'''Merced''' was the first [[Itanium]] microarchitecture designed by [[Intel]]. The architecture avoids expensive out-of-order execution hardware by moving scheduling responsibilities to the compiler.
+
'''Merced''' was the first [[Itanium]] microarchitecture designed by [[Intel]].
 
 
== Architecture ==
 
* 10 stage pipeline
 
** IPG: Get next instruction pointer
 
** FET: Fetch from instruction cache
 
** ROT: Instruction rotation, decoupling buffer
 
** EXP: Instruction dispersal
 
** REN: Register remapping
 
** WLD: Word line decode
 
** REG: Register file read
 
** EXE: Execute
 
** DET: Exception detection
 
** WRB: Writeback
 
* Branch Predictor
 
** Early zero bubble predictor using Target Address Registers controlled by the compiler
 
** Two-level predictor with 4 bits of local history and a 512 entry prediction table
 
** Indirect branches handled with 64 entry Multiway Branch Prediction Table
 
** 64-entry Target Address Cache
 
** 8 entry Return Address Stack
 
** Branch predictor can resteer at ROT stage using the loop exit predictor, or compiler provided prediction hints for the third slot in a bundle.
 
** Branch predictor can resteer at EXP stage for any branch
 
* 16K 4-way L1 Instruction Cache
 
* Fetch and Decode
 
** Two bundles, each containing three instructions, fetched from the instruction cache every cycle
 
** No decoder necessary
 
* Execution Engine
 
** Instructions from bundles dispersed to issue ports
 
** Scoreboarding resolves dependencies with compiler hints
 
** FPU can be accessed from the integer side by floating point get and set instructions.
 
*** Transfer from FPU to integer side takes two clocks
 
*** Transfer from integer side to FPU takes 9 clocks
 
* Memory Subsystem
 
** L1D has two cycle latency
 
** Software can issue "advanced loads", which go into a Advanced Load Address Table that checks for conflicting stores. Software needs to check the ALAT before using the load result. If there's a conflict, a software handler has to reissue the conflicting load.
 
** The FPU is directly fed by the dual-ported L2 cache, with 9 cycle load latency.
 
 
 
== Block Diagram ==
 
[[File:merced.png]]
 
 
 
=== Pipeline ===
 
Merced tries to achieve high performance by ditching almost all hardware scheduling. With the compiler handling instruction scheduling, Merced is free to spend power and die area to create a very wide core with plenty of execution resources.
 
 
 
==== Frontend ====
 
Merced's frontend is very simple. High performance x86 chips have a relatively complex frontend that must determine variable length instruction boundaries and translate x86 instructions into internal micro-ops. In contrast, Merced fetches fixed length bundles of instructions. These instructions map directly to execution units and do not require decoding into micro-ops.
 
 
 
===== Fetch =====
 
Every cycle, Merced fetches two bundles. Each bundle contains three instructions, meaning Merced's frontend can deliver six instructions per clock. Bundles are then queued into an 8-bundle decoupling buffer. This buffer can help hide fetch latency if there's an instruction cache miss. If the backend stalls, the decoupling buffer can allow fetch to keep running ahead.
 
 
 
To avoid instruction cache misses, Merced can prefetch code from L2 into a streaming buffer with eight 32-byte entries. Unlike x86 chips, instruction prefetch is controlled by software.
 
 
 
===== Branch Predictor =====
 
Merced features a complex hierarchy of branch predictors to avoid the pipeline flush penalty associated with mispredicts.
 
 
 
The fastest branch prediction method is controlled by software. Four target address registers (TARs) can be programmed by the compiler. If the instruction pointer matches a TAR, Merced uses the specified target address. Because this prediction method happens at the instruction pointer generation stage, it allows taken branches to be handled with zero bubbles. These TARs thus function like a small L0 BTB controlled by compiler hints.
 
 
 
For dynamic branch prediction, Merced uses a 512 entry branch prediction table. This uses a two-level prediction scheme with 4 bits of local history, somewhat like the mechanism used by Pentium Pro. Indirect branches are handled with a 64 entry multiway branch prediction table. Instead of a large BTB, Merced uses a relatively small 64 entry target address cache to provide taken branch targets. Software hints can update target address cache entries.
 
 
 
Merced can resteer the frontend at the ROT stage. Here, Merced has a compiler-assisted loop predictor and branch address calculator.
 
 
 
==== Execution Engine ====
 
Each instruction bundle contains three instructions in certain allowed combinations. This simplifies hardware that binds instructions to execution ports.
 
 
 
By avoiding OOO execution, Merced fields an impressive number of execution resources. The execution stage is 9 ports wide. Four ports are available for common integer instructions. Two of these integer ports handle memory operations and transfers to/from the floating point unit. Three ports are available for branch execution, and two more handle floating point instructions.
 
 
 
After instructions are dispersed to ports, Merced executes them in-order. To hide cache and memory latency, Merced supports nonblocking loads with a register scoreboard that enforces dependencies. Thus, a load will only stall the pipeline when the result is consumed.
 
 
 
To further hide load latency, software can issue speculative loads. Merced queues these into a 32-entry Advanced Load Address Table (ALAT). The ALAT acts like a software controlled load buffer, where speculative load address can be compared with store addresses. While Merced does this address overlap check in hardware, software must handle these conflicts and replay the load. Merced thus allows load speculation while minimizing load/store unit complexity.
 
 
 
Finally, Merced can retire six instructions each cycle.
 
 
 
== References ==
 
* H. Sharangpani, “ITANIUM PROCESSOR MICROARCHITECTURE.” IEEE MICRO, Sep. 2000.
 

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)
codenameMerced +
core count1 +
designerIntel +
first launchedJune 2001 +
full page nameintel/microarchitectures/merced +
instance ofmicroarchitecture +
instruction set architectureIA-64 +
manufacturerIntel +
microarchitecture typeCPU +
nameMerced +
process180 nm (0.18 μm, 1.8e-4 mm) +