Editing intel/microarchitectures/merced

{{intel title|Merced|arch}}
{{microarchitecture
| atype         = CPU
| name          = Merced
| designer      = Intel
| manufacturer  = Intel
| introduction  = June, 2001
| phase-out     = 
| process       = 180 nm
| cores         = 1
| pipeline      = <!-- yes for following options -->
| type          = <!-- e.g. "Superscalar" -->
| type 2        = 
| type N        = 
| OoOE          = <!-- Yes or No only -->
| speculative   = <!-- Yes or No only -->
| renaming      = <!-- Yes or No only -->
| stages        = <!-- ONLY IF FIXED SIZE, otherwise use below for range -->
| stages min    = 
| stages max    =
| issues        = 

| inst          = <!-- yes for instructions options -->
| isa           = IA-64
| feature       = 
| extension     = 
| extension 2   = 
| extension N   = 

| cache         = <!-- yes for cache info -->
| l1i           =
| l1i per       =
| l1i desc      =
| l1d           = 
| l1d per       = 
| l1d desc      =
| l2            = 
| l2 per        = 
| l2 desc       = 
| l3            = 
| l3 per        = 
| l3 desc       = 

| core names       = <!-- Yes if specify -->
| core name        =
| core name 2      =
| core name N      =

| succession       = Yes
| predecessor      = 
| predecessor link = 
| successor        = McKinley
| successor link   = intel/microarchitectures/mckinley
}}
'''Merced''' was the first [[Itanium]] microarchitecture designed by [[Intel]].

== Architecture ==
* 10 stage pipeline
** IPG: Get next instruction pointer
** FET: Fetch from instruction cache
** ROT: Instruction rotation, decoupling buffer
** EXP: Instruction dispersal
** REN: Register remapping
** WLD: Word line decode
** REG: Register file read
** EXE: Execute
** DET: Exception detection
** WRB: Writeback
* Branch Predictor
** Early zero bubble predictor using Target Address Registers controlled by the compiler
** Two-level predictor with 4 bits of local history and a 512 entry prediction table
** Indirect branches handled with 64 entry Multiway Branch Prediction Table
** 64-entry Target Address Cache
** 8 entry Return Address Stack
** Branch predictor can resteer at ROT stage using the loop exit predictor, or compiler provided prediction hints for the third slot in a bundle. 
** Branch predictor can resteer at EXP stage for any branch
* 16K 4-way L1 Instruction Cache
* Fetch and Decode
** Two bundles, each containing three instructions, fetched from the instruction cache every cycle
** No decoder necessary
* Execution Engine
** Instructions from bundles dispersed to issue ports
** Scoreboarding resolves dependencies with compiler hints
** FPU can be accessed from the integer side by floating point get and set instructions. 
*** Transfer from FPU to integer side takes two clocks
*** Transfer from integer side to FPU takes 9 clocks
* Memory Subsystem
** L1D has two cycle latency
** Software can issue "advanced loads", which go into a Advanced Load Address Table that checks for conflicting stores. Software needs to check the ALAT before using the load result. If there's a conflict, a software handler has to reissue the conflicting load.
** The FPU is directly fed by the dual-ported L2 cache, with 9 cycle load latency. 

[[File:merced.png]]
@@ Line 53: / Line 53: @@
 | successor link   = intel/microarchitectures/mckinley
 }}
-'''Merced''' was the first [[Itanium]] microarchitecture designed by [[Intel]]. The architecture avoids expensive out-of-order execution hardware by moving scheduling responsibilities to the compiler.
+'''Merced''' was the first [[Itanium]] microarchitecture designed by [[Intel]].
 == Architecture ==
@@ Line 90: / Line 90: @@
 ** The FPU is directly fed by the dual-ported L2 cache, with 9 cycle load latency.
-== Block Diagram ==
 [[File:merced.png]]
-=== Pipeline ===
-Merced tries to achieve high performance by ditching almost all hardware scheduling. With the compiler handling instruction scheduling, Merced is free to spend power and die area to create a very wide core with plenty of execution resources.
-==== Frontend ====
-Merced's frontend is very simple. High performance x86 chips have a relatively complex frontend that must determine variable length instruction boundaries and translate x86 instructions into internal micro-ops. In contrast, Merced fetches fixed length bundles of instructions. These instructions map directly to execution units and do not require decoding into micro-ops.
-===== Fetch =====
-Every cycle, Merced fetches two bundles. Each bundle contains three instructions, meaning Merced's frontend can deliver six instructions per clock. Bundles are then queued into an 8-bundle decoupling buffer. This buffer can help hide fetch latency if there's an instruction cache miss. If the backend stalls, the decoupling buffer can allow fetch to keep running ahead.
-To avoid instruction cache misses, Merced can prefetch code from L2 into a streaming buffer with eight 32-byte entries. Unlike x86 chips, instruction prefetch is controlled by software.
-===== Branch Predictor =====
-Merced features a complex hierarchy of branch predictors to avoid the pipeline flush penalty associated with mispredicts.
-The fastest branch prediction method is controlled by software. Four target address registers (TARs) can be programmed by the compiler. If the instruction pointer matches a TAR, Merced uses the specified target address. Because this prediction method happens at the instruction pointer generation stage, it allows taken branches to be handled with zero bubbles. These TARs thus function like a small L0 BTB controlled by compiler hints.
-For dynamic branch prediction, Merced uses a 512 entry branch prediction table. This uses a two-level prediction scheme with 4 bits of local history, somewhat like the mechanism used by Pentium Pro. Indirect branches are handled with a 64 entry multiway branch prediction table. Instead of a large BTB, Merced uses a relatively small 64 entry target address cache to provide taken branch targets. Software hints can update target address cache entries.
-Merced can resteer the frontend at the ROT stage. Here, Merced has a compiler-assisted loop predictor and branch address calculator.
-==== Execution Engine ====
-Each instruction bundle contains three instructions in certain allowed combinations. This simplifies hardware that binds instructions to execution ports.
-By avoiding OOO execution, Merced fields an impressive number of execution resources. The execution stage is 9 ports wide. Four ports are available for common integer instructions. Two of these integer ports handle memory operations and transfers to/from the floating point unit. Three ports are available for branch execution, and two more handle floating point instructions.
-After instructions are dispersed to ports, Merced executes them in-order. To hide cache and memory latency, Merced supports nonblocking loads with a register scoreboard that enforces dependencies. Thus, a load will only stall the pipeline when the result is consumed.
-To further hide load latency, software can issue speculative loads. Merced queues these into a 32-entry Advanced Load Address Table (ALAT). The ALAT acts like a software controlled load buffer, where speculative load address can be compared with store addresses. While Merced does this address overlap check in hardware, software must handle these conflicts and replay the load. Merced thus allows load speculation while minimizing load/store unit complexity.
-Finally, Merced can retire six instructions each cycle.
-== References ==
-* H. Sharangpani, “ITANIUM PROCESSOR MICROARCHITECTURE.” IEEE MICRO, Sep. 2000.
codename	Merced +
core count	1 +
designer	Intel +
first launched	June 2001 +
full page name	intel/microarchitectures/merced +
instance of	microarchitecture +
instruction set architecture	IA-64 +
manufacturer	Intel +
microarchitecture type	CPU +
name	Merced +
process	180 nm (0.18 μm, 1.8e-4 mm) +