From WikiChip
Au1 - Microarchitectures - Alchemy
< alchemy(Redirected from rmi/microarchitectures/au1)

Edit Values
Au1 µarch
General Info
Arch TypeCPU
DesignerAlchemy
ManufacturerTSMC
IntroductionJune 13, 2000
Process180 nm, 130 nm
Core Configs1
Pipeline
OoOENo
SpeculativeNo
Reg RenamingNo
Stages5
Decode1-way
Instructions
ISAMIPS32

Au1 is a microarchitecture developed by Alchemy Semiconductor for their Alchemy family of high performance, ultra low power embedded microprocessors. Details about Au1 were disclosed at the Embedded Processor Forum in San Jose, CA, on June 13, 2000.

The first processor using an Au1 CPU core, the Alchemy Au1000 SoC, is rated for core frequencies up to 500 MHz. At 400 MHz it operates at 1.5 V and the chip consumes no more than 500 mW, with a performance of over 900 Dhrystone 2.1 MIPS/Watt according to the company. Au1000 and Au1500 processors were fabricated on a TSMC 180 nm LV logic 1.5V/3.3V 1P6M process, the Au1100 reduced power consumption further with a TSMC 130 nm process. Manufacturing details of later models are unknown.

Alchemy Semiconductor was acquired by AMD in 2002, who transferred its Alchemy assets to RMI in 2006, both expanding the Alchemy processor family. RMI announced the last processors with an Au1 core in early 2009.

Architecture[edit]

Au1 is a scalar, in-order RISC microarchitecture with a 5-stage pipeline implementing the MIPS32 ISA, Release 1.

It supports the MIPS operating modes User Mode, Kernel Mode, and Debug Mode. The optional Supervisor Mode is not implemented. Virtual address translation is TLB-based.

Au1 implements Coprocessor 0 (PRA) as required by the MIPS standard, no CP1 (FPU), CP2, or CP3 instructions. All floating-point instructions generate the Reserved Instruction exception, therefore can be emulated in software. Code compression (MIPS16) is not implemented, nor are performance counters. However the core supports the MIPS EJTAG interface and hardware breakpoints and watchpoints.

The WAIT instruction can place the core in one of two low power modes: In IDLE1 mode clocks to all core units are stopped. In IDLE0 mode the data cache continues to snoop the internal System Bus to maintain data coherency. The GPRs and CP0 registers are preserved in both modes, the CP0 Count register increments at an unpredictable rate (an unaffected RTC is integrated as a peripheral on all processors with the Au1 core).

Au1 implements interrupts and exceptions compliant with the MIPS specification. The core supports eight interrupt sources with prioritization by software. It does not recognize Soft Reset, Non-Maskable Interrupt, or Cache Error exception conditions.

The first released chip, Au1000 stepping DA, uses revision 1 of the Au1 core, apparently all later chips revision 2. Errata are documented in "AMD Alchemy™ Au1000™ Processor Specification Update", AMD Publ. #27348, Rev. E, June 2005.

Memory Hierarchy[edit]

Data and Instruction Caches[edit]

  • L1I Cache:
    • 16 KiB, 4-way set associative
    • 32 Byte line size
  • L1D Cache:
    • 16 KiB, 4-way set associative
    • 32 Byte line size
    • Write-back, read-allocate policy

Translation Lookaside Buffers[edit]

  • ITLB
    • 4 entries, fully associative
    • 4 KiB to 16 MiB page sizes in power-of-four steps
  • Unified main TLB
    • 32 entries, fully associative
    • 4 KiB to 16 MiB page sizes in power-of-four steps

The ITLB improves instruction fetch performance, is fully coherent with the main TLB and transparent to software. The main TLB is compliant with the MIPS specification, the physical address size is 36 bits. A hardware page table walker is not present, TLB entries are loaded by software in a fast TLB miss exception handler.

System DRAM[edit]

All processors with an Au1 core integrate a memory controller with a 16/32-bit interface supporting SDR, DDR, and/or DDR2 SDRAMs depending on model. ECC is not supported. The CPU core, the memory controller, and other integrated peripherals are linked by an internal System Bus (SBUS) which carries a 36-bit physical address, 32-bit data, and a byte mask, and runs at a configurable ratio of 1/2, 1/3, or 1/4 of the core frequency. SDRAMs run at 1/1 or 1/2 of the SBUS clock on models with a DDR or DDR2 compatible controller, 1/2 of the SBUS clock otherwise.

Pipeline[edit]

The Au1 microarchitecture implements a classic five stage RISC pipeline with several optimizations. It includes an instruction cache, write-back data cache, register file, and write buffer. A branch prediction unit is not present or needed. All pipeline stages complete in one cycle when data is available, and all pipeline hazards and dependencies are enforced by hardware interlocks, so delay slot instructions are generally not required.

The Fetch stage retrieves the next instruction from the instruction cache, passing the fetch address through the ITLB. The minimum page size is 4 KiB and the instruction and data cache are both tagged by physical address bits 31:12 so address translation and cache lookup can overlap. On a cache miss the Fetch stage forwards the fetch address to the virtual memory unit, i.e. the main TLB, to fulfill the request and stalls until the instruction is available. It decodes register numbers preparing for a GPRF access in the Decode stage.

The Decode stage performs several operations in parallel: It decodes the instruction and generates control bits for subsequent pipeline stages. It reads operands from the General Purpose Register File, or receives them from earlier instructions in subsequent stages. An example is the LUI instruction which can supply a constant to a dependent instruction entering the Decode stage in the next cycle. The GPRF has two read ports and one write port. The write port is shared between the Write Back stage and data cache loads after a cache miss. For branches the Decode stage computes the target address, for loads and stores the effective address by adding base and displacement. Supposedly at the end of the decode stage a new program counter value is sent to the fetch stage for the next instruction fetch cycle.

The Fetch stage does not stall when a branch instruction is in the Decode stage. This is the MIPS branch delay slot, an instruction following a branch which is executed even if the branch is taken, generally a NOP or preferably an instruction performing useful work. The Decode stage stalls if resources are not available yet.

The Execute stage executes ALU instructions. Most instructions complete in a single cycle, a few require multiple cycles (e.g. CLO, CLZ, MUL). Multiply and divide instructions are forwarded to the Multiply Accumulate unit. The Execute stage also evaluates branch conditions, passing the target address to the Fetch stage if the branch is taken. For loads and stores it initiates a data cache lookup, passing the effective address through the main TLB for translation. All exception conditions (arithmetic, TLB, interrupt, etc.) are posted by the end of the Execute stage so that exceptions can be signalled in the Cache stage.

In the Cache stage loads which hit in the data cache obtain the data, and forward it to dependent instructions in the pipeline. On a cache miss or if the address is uncachable, a request is sent to the System Bus, with a check for pending stores in the Write Buffer. A store hit writes the data into the cache, otherwise the store is forwarded to the Write Buffer. If any exceptions are posted, an exception is signaled and the Au1 core is directed to fetch instructions at the appropriate exception vector address.

In the Write Back stage, results are stored in the GPRF and forwarded to other stages as needed.

Multiply Accumulate Unit[edit]

The Multiply Accumulate (MAC) unit executes all multiply and divide instructions. It is composed of a 32 × 16 bit pipelined array multiplier with early out detection, a divide block, and the MIPS HI and LO registers. Instructions in the main pipeline which do not depend on MAC results can execute simultaneously with instructions in the MAC unit.

A 16 × 16 or 32 × 16 bit multiplication can complete in one cycle. The 32 × 16 bit multiply must have the sign-extended 16-bit value in register operand rt of the instruction. 32 × 32 bit multiplies can be started every other CPU cycle and complete in two cycles. Instructions writing to the HI/LO registers (multiply and accumulate 64 bits) take two additional cycles. Divide instructions complete in a maximum of 35 cycles.

Cache Operations[edit]

The Au1 core contains a 16 KiB Instruction Cache and a 16 KiB Data Cache, both organized as 128 sets, four-way set associative, holding a 32-byte cache line from a 32-byte aligned memory address. IC and DC lines are tagged with physical address bits 31:12, so addresses beyond 4 GiB are not cacheable. The data cache follows a write-back, read-allocate policy, i.e. stores go the cache on a hit and a write buffer on a miss, rather than filling and modifying a cache line. Data cache lookups are not stalled by a load miss (hit-under-miss). If a load hits, the data is delivered from the cache immediately after the previous load which missed received its data from memory. The MIPS ISA permits only naturally aligned loads and stores.

IC and DC lines are filled with a burst memory read. Depending on Cache Coherency Attributes a burst read generally retrieves the critical word first. True LRU logic replaces lines unless software locks individual lines with a CACHE instruction. A line can be locked in up to three cache ways, one way always remains unlocked for replacement. A modified DC line is moved to a cast-out buffer and written back to memory after the line was filled with the new data. In streaming mode, to reduce IC or DC pollution, code or data is stored only in the first way. Streaming mode is enabled by a variant of the PREF (prefetch) instruction or a per-page streaming bit in the TLB. A hardware prefetcher is not present.

The data cache snoops SBUS transactions if they are flagged coherent to maintain coherency with other System Bus masters, e.g. a DMA engine, providing data on a read hit and updating itself on a write hit. The instruction cache does not enforce coherency, however instructions are available to invalidate IC and DC lines, write DC lines back to memory, and flush the write buffer. The data cache snoops IC line fills so an explicit write back of code in this cache is not necessary after software invalidated the corresponding lines in the instruction cache.

Write Buffer[edit]

The Write Buffer consists of a write-combining stage and a 16-entry store queue, each holding a 36-bit physical memory address (actually bits 35:2?), a 32-bit data word, and a byte mask. Write-combining merges successive stores to the same 32-bit aligned word in memory. If a store to a different word arrives, the currently latched address, data, and byte mask are entered into the store queue. The unit removes data from the queue in FIFO order and performs a 32-bit write to memory after System Bus arbitration. If stores with continuously increasing addresses were entered, this is a burst write of two to eight 32-bit words.

Loads and stores which hit in the data cache can bypass older stores in the Write Buffer. Store to load forwarding is not implemented. If a load misses in the data cache but matches an address in the buffer, the load stalls and the buffer flushes entries until all conflicting data is committed to memory. The Write Buffer does not snoop the System Bus, software can issue a SYNC instruction to flush it. Stores go through the Write Buffer regardless if the segment or page containing the address is flagged as cachable, however the buffer can be disabled entirely, or just write-combining, or write-combining and burst writes can be disabled per page with a non-mergeable and non-gatherable flag in the TLB.

All Processors Using Au1[edit]

See Alchemy family.

Bibliography[edit]

codenameAu1 +
core count1 +
designerAlchemy +
first launchedJune 13, 2000 +
full page namealchemy/microarchitectures/au1 +
instance ofmicroarchitecture +
instruction set architectureMIPS32 +
manufacturerTSMC +
microarchitecture typeCPU +
nameAu1 +
pipeline stages5 +
process180 nm (0.18 μm, 1.8e-4 mm) + and 130 nm (0.13 μm, 1.3e-4 mm) +