From WikiChip
Xiaomi - Microarchitectures - Phytium
< phytium
Revision as of 23:00, 1 May 2018 by Inject (talk | contribs) (Physical Design)

Edit Values
Xiaomi µarch
General Info
Arch TypeCPU
DesignerPhytium
ManufacturerTSMC
Introduction2017
Process28 nm
Pipeline
TypeSuperscalar
SpeculativeYes
Reg RenamingYes
Instructions
ISAARMv8
Cache
L1I Cache32 KiB/Core
L1D Cache32 KiB/Core
L2 Cache4 MiB/Panel
L3 Cache16 MiB/CMC
Cores
Core NamesFTC660,
FTC661

Xiaomi is an ARM microarchitecture designed in-house by Phytium for their consumer market and server-based microprocessors.

Brands

Codename Brand Description
Mars FT-2000
  • High performance
  • High bandwidth, Large memory
  • High bandwidth I/O
  • Large scale cache coherency
Earth FT-1500A
  • Moderate performance
  • High power efficiency
  • High density computing
  • Low cost

Architecture

Overview

  • Fully ARMv8 compatible
    • Support AArch32 and AArch64 modes
    • EL0-EL3 supported
    • ASIMD-128
  • 28 nm process
  • Scalable design
    • 4 to 64 cores
  • Mesh topology network-on-chip
  • Panel-based (grid) architecture
  • Global cache coherency
  • 2x DDR3-1600 channels per panel
  • 2x 16-lane PCIe 3.0

Panel Architecture

xiaomi panel-based data affinity architecture.png

Phytium organizes their processors using a grid-layout they call Panels they call Panel-based data affinity architecture. Each panel consists of 8 independent ARMv8-compatible cores. Phytium "Mars" processor consists of 8 such panels for a total of 64 cores. Panels are interconnected with a 2-dimensional mesh network-on-a-chip level 2 cache with 4 MiB per panel for a total of 32 MiB.

In addition to the main die, Mars uses an additional Cache & Memory chips (CMC) auxiliary chips. "Mars" uses 8 such chips connected to the main die providing 16 MiB of level 3 cache for a total of 128 MiB as well as 8 dual-channel DDR3-1600 memory controllers for a total maximum bandwidth of 204 GiB/s. Mars also provides two 16-lane PCIe 3.0 interfaces. The chips incorporates ECC and parity protection on all caches, tags, and TLBs.

Panel

Each Panel consists of 8 cores - each ARMv8-compatible, supporting AArch32 and AArch64 modes, Exception Levels EL0-EL3, as well as ASIMD-128 operations. Each core has its own inclusive L1 cache and a shared L2 cache (4 MiB per panel). Each panel contains two Directory Control Units (DCU) which are in charge of maintaining directory-based cache coherency and one routing cell for managing the inter-panel communication.

On TSMC's 28 nm process, a panel is 6,000 µm x 10,600 µm (63.6 mm²).

xiaomi panel.png   xiaomi panel die (28nm).png

Cache & Memory Chip (CMC)

xiaomi cmc.png

The solve the complexity involved in having more than eight memory controllers on a chip, Xiaomi uses a coupled auxiliary Cache & Memory Chip (CMC) to scale the bandwidth with computing power. In the case of Phytium "Mars" chip which contains 64 cores on 8 panels, eight CMC chips are used which provides 16 DDR3 controllers (8x2) along with 16 MiB of data L3 cache and 2 MiB of data ECC. Phytium proprietary interface is used between the processor and the CMC chip.

xiaomi latency.png
Memory access Latency(ns)
Local L1 cache hit ~2
Local L2 cache hit ~8
Affinitive L2 cache hit ~20
Affinitive L3 cache hit ~36
Affinitive DDR access ~70
  • Panel & NoC operates @ 2 GHz
  • CMC operates @ 1.5 GHz

Block Diagram

xiaomi block diagram.svg

Memory Hierarchy

  • Cache
    • ECC and parity protection on all caches, tags, and TLBs
    • L1I Cache
      • 32 KiB
    • L1D Cache
      • 32 KiB
      • 4 cycles for fastest load-to-use
    • L2 Cache
    • L3 Cache

Pipeline

Each Xiaomi core is an ARMv8-compatible core implemented as a superscalar, out-of-order, 4-decode/4-dispatch pipeline with a hybrid branch prediction.

Front End

The front-end consists of the instruction caches & prefetches, fetching of instructions, and decoding. Xiaomi cores contain a 32 KiB L1 instruction cache with a prefetcher designed to reduce caches misses. On hits, 2 cycles are required for retrieval of instructions from the L1. Xiaomi has a hybrid branch predictor made of a TAGE predictor and a 512-entry indirect predictor. The BP unit also has a 48-entry Return Address Stack (RAS) for speculative subroutine return and a 2K-entry BTB. Up to four instructions can be fetches each cycle into the instruction buffer which is 32 entries in size.

The buffer is also a loop detection buffer, responsible for detecting loop patterns and hold them in the instruction buffer, bypassing the cache so they can be sent directly to decode. From the instruction buffer, up to four instructions can be decoded each cycle, up to four instructions can be renamed each cycle, and up to four instructions can be dispatched each cycle. Everything is done in-order up to this point.

Back End

The back-end performs operations out-of-order for the most part and is in charge of queuing instructions, executing them and retiring them. Dispatch contains a 160-entry ReOrder Buffer (ROB) and can dispatch up to 4 instructions per cycle. Note that over 210 instructions can be in-flight throughout the entire pipeline. Operand values can be read from Xiaomi's physical register file and an architectural register file in order to remove the various dependencies. Registers are only updated from the physical register file to the architectural register file when the corresponding instructions require. The physical register file contains 192 physical registers supporting up to four parallel instruction reads. Because ARM has instructions with up to four operands, the register file would need 16 ports to support those simultaneous instructions. To reduce some of the complexity, Phytium chose to reduce the number of ports to 12 where some of the ports are dedicated to each instruction and others are multiplexed. Phytium explained this decision resulted in a 2.5% reduction in area while adding 0.017% in overhead performance.

From dispatch, out-of-order instructions go into 4 discrete scheduling queues: 2x Integer/SIMD, 1x FP/SIMD, and 1x Load/Store. The Int/FP queues are each 16-entry deep. Xiaomi includes two separate Integer/SIMD queues. The first one is capable of executing two 64-bit single-cycle integer instructions or one 128-bit single-cycle integer (with the two units locked together). Additionally one of the units is also capable of performing branch operations. The second queue handles two multi-cycle integer/SIMD operations. Just like the single-cycle unit, the multi-cycle unit can also handle one 128-bit operation by combining both units. Xiaomi includes a single floating-point/SIMD queue with both units supporting FMA as well as two 64-bit FP operations or one 128-bit FP operation by combining two units.

The Load/Store queue is slightly larger than the Int or FP queues with 24 entries. Two loads or 1 load + 1 store can be issued each cycle. As with the level 1 instruction cache, the level 1 data cache is also 32 KiB supporting six outstanding loads. Next line and stride detected data prefetch are supported.

Interconnects & Hawk

Hawk is Pythium cache coherence protocol which implements a distributed directory-based global cache coherency across all the panels. Hawk is a MOESI-like package-based protocol. The network has a node on each panel called a Directory Control Unit (DCU) which is responsible for interfacing between the L2 caches in each panel to the CMCs (see § Panel Architecture). Phytium noted that it's optimized for exclusive atomic accesses.

Xiaomi implements a 2D concentrated mesh architecture on-die connecting each of the panels. Phytium "Mars" chip contains 8 panels which are organized in two rows of four panels each. Switching is relatively low latency with 3 cycles per hop. On average, packets will have around 9 cycles latency from any other panel. This network results in a bandwidth of 384 GiB/s each cell.

xiaomi 2d network.png
Destination Latency
0 3
1 6
2 9
3 12
4 15
5 12
6 9
7 6
Average 9

Physical Design

  • Mars is fabricated on TSMC's 28 nm process
  • 10 metal layers
  • ~180 million instances
  • 639.576 mm² die size
  • FCBGA Package
    • ~3000 pins
  • 0.9 VCORE, 1.8 VIO
  • 2 GHz, 120 W
xiaomi floor plan.png

Performance Claims

Int FP
SPEC_CPU2006_base
(Single copy)
19.2 17.8
SPEC_CPU2006_rate
(64 copies)
672 585

References

  • Zhang, C. (2015, August). Mars: A 64-core ARMv8 processor. In Hot Chips 27 Symposium (HCS), 2015 IEEE (pp. 1-23). IEEE.
codenameXiaomi +
designerPhytium +
first launched2017 +
full page namephytium/microarchitectures/xiaomi +
instance ofmicroarchitecture +
instruction set architectureARMv8 +
manufacturerTSMC +
microarchitecture typeCPU +
nameXiaomi +
process28 nm (0.028 μm, 2.8e-5 mm) +