Difference between revisions of "phytium/microarchitectures/xiaomi"

	Edit Values
	Xiaomi µarch
	General Info
Arch Type	CPU
Designer	Phytium
Manufacturer	TSMC
Introduction	2017
Process	28 nm
	Pipeline
Type	Superscalar
OoOE	Yes
Speculative	Yes
Reg Renaming	Yes
	Instructions
ISA	ARMv8
	Cache
L1I Cache	32 KiB/Core
L1D Cache	32 KiB/Core
L2 Cache	4 MiB/Panel
L3 Cache	16 MiB/CMC
	Cores
Core Names	FTC660,; FTC661,; FTC662

Latest revision as of 00:12, 8 March 2021

Xiaomi is an ARM microarchitecture designed in-house by Phytium for their consumer market and server-based microprocessors.

Brands[edit]

Codename	Brand	Description
Mars	FT-2000 FT-2000+	High performance High bandwidth, Large memory High bandwidth I/O Large scale cache coherency
Earth	FT-1500A	Moderate performance High power efficiency High density computing Low cost

Architecture[edit]

Overview[edit]

Fully ARMv8 compatible
- Support AArch32 and AArch64 modes
- EL0-EL3 supported
- ASIMD-128
28 nm process
Scalable design
- 4 to 64 cores
Mesh topology network-on-chip
Panel-based (grid) architecture
Global cache coherency
2x DDR3-1600 channels per panel
- ECC support
2x 16-lane PCIe 3.0

Block Diagram[edit]

Memory Hierarchy[edit]

Cache
- ECC and parity protection on all caches, tags, and TLBs
- L1I Cache
  - 32 KiB
- L1D Cache
  - 32 KiB
  - 4 cycles for fastest load-to-use
- L2 Cache
  - 4 MiB/Panel (per 4 cores) (§ Panel)
- L3 Cache
  - 16 MiB/CMC (§ CMC)

Pipeline[edit]

Each Xiaomi core is an ARMv8-compatible core implemented as a superscalar, out-of-order, 4-decode/4-dispatch pipeline with a hybrid branch prediction.

Front End[edit]

predictor

The front-end consists of the instruction caches & prefetches, fetching of instructions, and decoding. Xiaomi cores contain a 32 KiB L1 instruction cache with a prefetcher designed to reduce caches misses. On hits, 2 cycles are required for retrieval of instructions from the L1. Xiaomi has a hybrid branch predictor made of a TAGE predictor and a 512-entry indirect predictor. The BP unit also has a 48-entry Return Address Stack (RAS) for speculative subroutine return and a 2K-entry BTB. Up to four instructions can be fetches each cycle into the instruction buffer which is 32 entries in size.

The buffer is also a loop detection buffer, responsible for detecting loop patterns and hold them in the instruction buffer, bypassing the cache so they can be sent directly to decode. From the instruction buffer, up to four instructions can be decoded each cycle, up to four instructions can be renamed each cycle, and up to four instructions can be dispatched each cycle. Everything is done in-order up to this point.

Back End[edit]

The back-end performs operations out-of-order for the most part and is in charge of queuing instructions, executing them and retiring them. Dispatch contains a 160-entry ReOrder Buffer (ROB) and can dispatch up to 4 instructions per cycle. Note that over 210 instructions can be in-flight throughout the entire pipeline. Operand values can be read from Xiaomi's physical register file and an architectural register file in order to remove the various dependencies. Registers are only updated from the physical register file to the architectural register file when the corresponding instructions require. The physical register file contains 192 physical registers supporting up to four parallel instruction reads. Because ARM has instructions with up to four operands, the register file would need 16 ports to support those simultaneous instructions. To reduce some of the complexity, Phytium chose to reduce the number of ports to 12 where some of the ports are dedicated to each instruction and others are multiplexed. Phytium explained this decision resulted in a 2.5% reduction in area while adding 0.017% in performance overhead.

Execution Units[edit]

Floating-point execution unt

From dispatch, out-of-order instructions go into 4 discrete scheduling queues: 2x Integer/SIMD, 1x FP/SIMD, and 1x Load/Store. The Int/FP queues are each 16-entry deep. Xiaomi includes two separate Integer/SIMD queues. The first one is capable of executing two 64-bit single-cycle integer instructions or one 128-bit single-cycle integer (with the two units locked together). Additionally one of the units is also capable of performing branch operations. The second queue handles two multi-cycle integer/SIMD operations. Just like the single-cycle unit, the multi-cycle unit can also handle one 128-bit operation by combining both units. Xiaomi includes a single floating-point/SIMD queue with both units supporting FMA as well as two 64-bit FP operations or one 128-bit FP operation by combining two units.

At least on the FTC-662 (16 nm version), floating multiplication is 3 cycles, addition is 3 cycles, and longest float division is 16 cycles.

The Load/Store queue is slightly larger than the Int or FP queues with 24 entries. Two loads or 1 load + 1 store can be issued each cycle. As with the level 1 instruction cache, the level 1 data cache is also 32 KiB supporting six outstanding loads. Next line and stride detected data prefetch are supported.

Performance Claims[edit]

	Int	FP
SPEC_CPU2006_base (Single core)	19.2	17.8
SPEC_CPU2006_rate (64 cores)	672	585

All Xiaomi Processors[edit]

	List of Xiaomi-based Processors
Model	Launched	Cores	L2	Frequency	TDP
FT-1500A/16	26 July 2016	16	8 MiB 8,192 KiB 8,388,608 B 0.00781 GiB	1.5 GHz 1,500 MHz 1,500,000 kHz	35 W 35,000 mW 0.0469 hp 0.035 kW
FT-1500A/4	26 July 2016	4	2 MiB 2,048 KiB 2,097,152 B 0.00195 GiB	2 GHz 2,000 MHz 2,000,000 kHz	15 W 15,000 mW 0.0201 hp 0.015 kW
FT-2000+/64	2019	64	32 MiB 32,768 KiB 33,554,432 B 0.0313 GiB	2.3 GHz 2,300 MHz 2,300,000 kHz	96 W 96,000 mW 0.129 hp 0.096 kW
FT-2000/64	2017	64	32 MiB 32,768 KiB 33,554,432 B 0.0313 GiB	2 GHz 2,000 MHz 2,000,000 kHz	120 W 120,000 mW 0.161 hp 0.12 kW
Count: 4

Bibliography[edit]

Phytium, IEEE Hot Chips 27 Symposium (HCS) 2015.

@@ Line 1: / Line 1: @@
 {{phytium title|Xiaomi|arch}}
 {{microarchitecture
-| atype         = CPU
+|atype=CPU
-| name          = Xiaomi
+|name=Xiaomi
-| designer      = Phytium
+|designer=Phytium
-| manufacturer  = TSMC
+|manufacturer=TSMC
-| introduction  = 2017
+|introduction=2017
-| phase-out     =
+|process=28 nm
-| process       = 28 nm
+|type=Superscalar
-| cores         =
+|oooe=Yes
-| cores 2       =
+|speculative=Yes
-| cores N       =
+|renaming=Yes
+|isa=ARMv8
-| pipeline      = Yes
+|l1i=32 KiB
-| type          = Superscalar
+|l1i per=Core
-| type 2        =
+|l1d=32 KiB
-| type N        =
+|l1d per=Core
-| OoOE          = Yes
+|l2=4 MiB
-| speculative   = Yes
+|l2 per=Panel
-| renaming      = Yes
+|l3=16 MiB
-| stages        = <!-- ONLY IF FIXED SIZE, otherwise use below for range -->
+|l3 per=CMC
-| stages min    =
+|core name=FTC660
-| stages max    =
+|core name 2=FTC661
-| issues        =
+|core name 3=FTC662
+|pipeline=Yes
-| inst          = Yes
+|OoOE=Yes
-| isa           = ARMv8
+|inst=Yes
-| feature       =
+|cache=Yes
-| extension     =
+|core names=Yes
-| extension 2   =
-| extension N   =
-| cache         = <!-- yes for cache info -->
-| l1i           =
-| l1i per       =
-| l1i desc      =
-| l1d           =
-| l1d per       =
-| l1d desc      =
-| l2            =
-| l2 per        =
-| l2 desc       =
-| l3            =
-| l3 per        =
-| l3 desc       =
-| core names       = Yes
-| core name        = FTC660
-| core name 2      = FTC661
-| core name N      =
 }}
 '''Xiaomi''' is an [[ARM]] microarchitecture designed in-house by [[Phytium]] for their consumer market and server-based microprocessors.
@@ Line 56: / Line 35: @@
 ! Codename !! Brand !! Description
 |-
-| Mars || {{phytium|FT-2000}} ||
+| Mars || {{phytium|FT-2000}}<br>{{phytium|FT-2000+}} ||
 * High performance
 * High bandwidth, Large memory
@@ Line 85: / Line 64: @@
 ** [[ECC]] support
 * 2x 16-lane [[PCIe]] 3.0
-=== Panel Architecture ===
-[[File:xiaomi panel-based data affinity architecture.png|right|450px]]
-Phytium organizes their processors using a grid-layout they call '''Panels''' they call '''Panel-based data affinity architecture'''.  Each panel consists of 8 independent [[ARMv8]]-compatible cores. Phytium "Mars" processor consists of 8 such panels for a total of [[64 cores]]. Panels are interconnected with a 2-dimensional mesh network-on-a-chip [[level 2 cache]] with 4 MiB per panel for a total of 32 MiB.
-In addition to the main die, Mars uses an additional '''Cache & Memory chips''' ('''CMC''') auxiliary chips. "Mars" uses 8 such chips connected to the main die providing 16 MiB of [[level 3 cache]] for a total of 128 MiB as well as 8 dual-channel DDR3-1600 [[memory controller]]s for a total maximum bandwidth of 204 GB/s. Mars also provides two 16-lane PCIe 3.0 interfaces. The chips incorporates ECC and parity protection on all caches, tags, and TLBs.
-==== Panel ====
-Each Panel consists of 8 cores - each [[ARMv8]]-compatible, supporting AArch32 and AArch64 modes, Exception Levels EL0-EL3, as well as ASIMD-128 operations. Each core has its own inclusive [[L1 cache]] and a shared [[L2 cache]] (4 MiB per panel). Each panel contains two '''Directory Control Units''' ('''DCU''') which are in charge of maintaining directory-based [[cache coherency]] and one routing cell for managing the inter-panel communication.
-On TSMC's [[28 nm process]], a panel is 6,000 µm x 10,600 µm (63.6 mm²).
-{| style="border-spacing: 15px;"
-| [[File:xiaomi panel.png|400px]] || &nbsp; || [[File:xiaomi panel die (28nm).png|300px]]
-|}
 === Block Diagram ===
@@ Line 105: / Line 69: @@
 === Memory Hierarchy ===
-{{empty section}}
+* Cache
+** ECC and parity protection on all caches, tags, and TLBs
+** L1I Cache
+*** 32 KiB
+** L1D Cache
+*** 32 KiB
+*** 4 cycles for fastest load-to-use
+** L2 Cache
+*** 4 MiB/Panel (per 4 cores) ([[#Panel|§ Panel]])
+** L3 Cache
+*** 16 MiB/CMC ([[#Cache_.26_Memory_Chip_.28CMC.29|§ CMC]])
 === Pipeline ===
@@ Line 111: / Line 85: @@
 ==== Front End ====
+[[File:phytium xiaomi predictor.png|thumb|right|predictor]]
 The front-end consists of the instruction caches & prefetches, fetching of instructions, and decoding. Xiaomi cores contain a 32 [[KiB]] [[L1 instruction cache]] with a prefetcher designed to reduce caches misses. On hits, 2 cycles are required for retrieval of instructions from the L1. Xiaomi has a hybrid [[branch predictor]] made of a [[TAGE predictor]] and a 512-entry [[indirect predictor]]. The BP unit also has a 48-entry Return Address Stack (RAS) for speculative subroutine return and a 2K-entry BTB. Up to four instructions can be fetches each cycle into the instruction buffer which is 32 entries in size.
@@ Line 116: / Line 91: @@
 ==== Back End ====
-The back-end performs operations [[out-of-order]] for the most part and is in charge of queuing instructions, executing them and retiring them. Dispatch contains a 160-entry [[ReOrder Buffer]] (ROB) and can dispatch up to 4 instructions per cycle. Note that over 210 instructions can be in-flight throughout the entire pipeline. Operand values can be read from Xiaomi's [[physical register file]] and an [[architectural register file]] in order to remove the various dependencies. Registers are only updated from the physical register file to the architectural register file when the corresponding instructions require. The physical register file contains 192 physical [[registers]] supporting up to four parallel instruction reads. Because [[ARM]] has instructions with up to four [[operands]], the register file would need 16 ports to support those simultaneous ports. To reduce some of the complexity, Phytium chose to reduce the number of ports to 12 where some of the ports are dedicated to each instruction and others are [[multiplexed]]. Phytium explained this decision resulted in a 2.5% reduction in area while adding 0.017% in overhead performance.
+The back-end performs operations [[out-of-order]] for the most part and is in charge of queuing instructions, executing them and retiring them. Dispatch contains a 160-entry [[ReOrder Buffer]] (ROB) and can dispatch up to 4 instructions per cycle. Note that over 210 instructions can be in-flight throughout the entire pipeline. Operand values can be read from Xiaomi's [[physical register file]] and an [[architectural register file]] in order to remove the various dependencies. Registers are only updated from the physical register file to the architectural register file when the corresponding instructions require. The physical register file contains 192 physical [[registers]] supporting up to four parallel instruction reads. Because [[ARM]] has instructions with up to four [[operands]], the register file would need 16 ports to support those simultaneous instructions. To reduce some of the complexity, Phytium chose to reduce the number of ports to 12 where some of the ports are dedicated to each instruction and others are [[multiplexed]]. Phytium explained this decision resulted in a 2.5% reduction in area while adding 0.017% in performance overhead.
+===== Execution Units =====
+[[File:phytim xiaomi fp eu.png|thumb|right|Floating-point execution unt]]
 From dispatch, [[out-of-order]] instructions go into 4 discrete scheduling queues: 2x Integer/SIMD, 1x FP/SIMD, and 1x Load/Store. The Int/FP queues are each 16-entry deep. Xiaomi includes two separate [[Integer]]/[[SIMD]] queues. The first one is capable of executing two 64-bit single-cycle integer instructions or one 128-bit single-cycle integer (with the two units locked together). Additionally one of the units is also capable of performing branch operations. The second queue handles two multi-cycle integer/SIMD operations. Just like the single-cycle unit, the multi-cycle unit can also handle one 128-bit operation by combining both units. Xiaomi includes a single [[floating-point]]/SIMD queue with both units supporting FMA as well as two 64-bit FP operations or one 128-bit FP operation by combining two units.
+At least on the FTC-662 (16 nm version), floating multiplication is 3 cycles, addition is 3 cycles, and longest float division is 16 cycles.
 The Load/Store queue is slightly larger than the Int or FP queues with 24 entries. Two loads or 1 load + 1 store can be issued each cycle. As with the level 1 instruction cache, the [[level 1 data cache]] is also 32 [[KiB]] supporting six outstanding loads. Next line and stride detected data prefetch are supported.
-== References ==
+== Performance Claims ==
-* Zhang, C. (2015, August). Mars: A 64-core ARMv8 processor. In ''Hot Chips 27 Symposium'' (HCS), 2015 IEEE (pp. 1-23). IEEE.
+{| class="wikitable"
+|-
+! !! Int !! FP
+|-
+! SPEC_CPU2006_base<br>(Single core)
+|19.2
+|17.8
+|-
+! SPEC_CPU2006_rate<br>(64 cores)
+| 672
+| 585
+|}
+== All Xiaomi Processors ==
+<!-- NOTE:
+           This table is generated automatically from the data in the actual articles.
+           If a microprocessor is missing from the list, an appropriate article for it needs to be
+           created and tagged accordingly.
+           Missing a chip? please dump its name here: https://en.wikichip.org/wiki/WikiChip:wanted_chips
+-->
+{{comp table start}}
+<table class="comptable sortable tc4">
+{{comp table header|main|6:List of Xiaomi-based Processors}}
+{{comp table header|cols|Launched|Cores|L2|%Frequency|%TDP}}
+{{#ask: [[Category:microprocessor models by phytium]] [[microarchitecture::Xiaomi]]
+ |?full page name
+ |?model number
+ |?first launched
+ |?core count
+ |?l2$ size
+ |?base frequency#GHz
+ |?tdp#W
+ |format=template
+ |template=proc table 3
+ |userparam=7
+ |mainlabel=-
+}}
+{{comp table count|ask=[[Category:microprocessor models by phytium]] [[microarchitecture::Xiaomi]]}}
+</table>
+{{comp table end}}
+== Bibliography ==
+* {{bib|hc|27|Phytium}}

codename	Xiaomi +
designer	Phytium +
first launched	2017 +
full page name	phytium/microarchitectures/xiaomi +
instance of	microarchitecture +
instruction set architecture	ARMv8 +
manufacturer	TSMC +
microarchitecture type	CPU +
name	Xiaomi +
process	28 nm (0.028 μm, 2.8e-5 mm) +

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung

Intel

AMD

Ampere

Apple