From WikiChip
Editing phytium/microarchitectures/xiaomi
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.
Latest revision | Your text | ||
Line 1: | Line 1: | ||
{{phytium title|Xiaomi|arch}} | {{phytium title|Xiaomi|arch}} | ||
{{microarchitecture | {{microarchitecture | ||
− | |atype=CPU | + | | atype = CPU |
− | |name=Xiaomi | + | | name = Xiaomi |
− | |designer=Phytium | + | | designer = Phytium |
− | |manufacturer=TSMC | + | | manufacturer = TSMC |
− | |introduction=2017 | + | | introduction = 2017 |
− | |process=28 nm | + | | phase-out = |
− | |type=Superscalar | + | | process = 28 nm |
− | | | + | | cores = |
− | |speculative=Yes | + | | cores 2 = |
− | |renaming=Yes | + | | cores N = |
− | |isa=ARMv8 | + | |
− | |l1i= | + | | pipeline = Yes |
− | |l1i per= | + | | type = Superscalar |
− | |l1d= | + | | type 2 = |
− | |l1d per= | + | | type N = |
− | |l2= | + | | OoOE = Yes |
− | |l2 per= | + | | speculative = Yes |
− | |l3= | + | | renaming = Yes |
− | |l3 per= | + | | stages = <!-- ONLY IF FIXED SIZE, otherwise use below for range --> |
− | | | + | | stages min = |
− | |core | + | | stages max = |
− | |core name | + | | issues = |
− | | | + | |
− | + | | inst = Yes | |
− | + | | isa = ARMv8 | |
− | + | | feature = | |
− | |core | + | | extension = |
+ | | extension 2 = | ||
+ | | extension N = | ||
+ | |||
+ | | cache = <!-- yes for cache info --> | ||
+ | | l1i = | ||
+ | | l1i per = | ||
+ | | l1i desc = | ||
+ | | l1d = | ||
+ | | l1d per = | ||
+ | | l1d desc = | ||
+ | | l2 = | ||
+ | | l2 per = | ||
+ | | l2 desc = | ||
+ | | l3 = | ||
+ | | l3 per = | ||
+ | | l3 desc = | ||
+ | |||
+ | | core names = Yes | ||
+ | | core name = FTC660 | ||
+ | | core name 2 = FTC661 | ||
+ | | core name N = | ||
}} | }} | ||
'''Xiaomi''' is an [[ARM]] microarchitecture designed in-house by [[Phytium]] for their consumer market and server-based microprocessors. | '''Xiaomi''' is an [[ARM]] microarchitecture designed in-house by [[Phytium]] for their consumer market and server-based microprocessors. | ||
Line 35: | Line 56: | ||
! Codename !! Brand !! Description | ! Codename !! Brand !! Description | ||
|- | |- | ||
− | | Mars || {{phytium|FT-2000 | + | | Mars || {{phytium|FT-2000}} || |
* High performance | * High performance | ||
* High bandwidth, Large memory | * High bandwidth, Large memory | ||
Line 64: | Line 85: | ||
** [[ECC]] support | ** [[ECC]] support | ||
* 2x 16-lane [[PCIe]] 3.0 | * 2x 16-lane [[PCIe]] 3.0 | ||
+ | |||
+ | === Panel Architecture === | ||
+ | [[File:xiaomi panel-based data affinity architecture.png|right|450px]] | ||
+ | Phytium organizes their processors using a grid-layout they call '''Panels''' they call '''Panel-based data affinity architecture'''. Each panel consists of 8 independent [[ARMv8]]-compatible cores. Phytium "Mars" processor consists of 8 such panels for a total of [[64 cores]]. Panels are interconnected with a 2-dimensional mesh network-on-a-chip [[level 2 cache]] with 4 MiB per panel for a total of 32 MiB. | ||
+ | |||
+ | In addition to the main die, Mars uses an additional '''Cache & Memory chips''' ('''CMC''') auxiliary chips. "Mars" uses 8 such chips connected to the main die providing 16 MiB of [[level 3 cache]] for a total of 128 MiB as well as 8 dual-channel DDR3-1600 [[memory controller]]s for a total maximum bandwidth of 204 GB/s. Mars also provides two 16-lane PCIe 3.0 interfaces. The chips incorporates ECC and parity protection on all caches, tags, and TLBs. | ||
+ | |||
+ | ==== Panel ==== | ||
+ | Each Panel consists of 8 cores - each [[ARMv8]]-compatible, supporting AArch32 and AArch64 modes, Exception Levels EL0-EL3, as well as ASIMD-128 operations. Each core has its own inclusive [[L1 cache]] and a shared [[L2 cache]] (4 MiB per panel). Each panel contains two '''Directory Control Units''' ('''DCU''') which are in charge of maintaining directory-based [[cache coherency]] and one routing cell for managing the inter-panel communication. | ||
+ | |||
+ | On TSMC's [[28 nm process]], a panel is 6,000 µm x 10,600 µm (63.6 mm²). | ||
+ | |||
+ | {| style="border-spacing: 15px;" | ||
+ | | [[File:xiaomi panel.png|400px]] || || [[File:xiaomi panel die (28nm).png|300px]] | ||
+ | |} | ||
=== Block Diagram === | === Block Diagram === | ||
Line 69: | Line 105: | ||
=== Memory Hierarchy === | === Memory Hierarchy === | ||
− | + | {{empty section}} | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
=== Pipeline === | === Pipeline === | ||
Line 85: | Line 111: | ||
==== Front End ==== | ==== Front End ==== | ||
− | |||
The front-end consists of the instruction caches & prefetches, fetching of instructions, and decoding. Xiaomi cores contain a 32 [[KiB]] [[L1 instruction cache]] with a prefetcher designed to reduce caches misses. On hits, 2 cycles are required for retrieval of instructions from the L1. Xiaomi has a hybrid [[branch predictor]] made of a [[TAGE predictor]] and a 512-entry [[indirect predictor]]. The BP unit also has a 48-entry Return Address Stack (RAS) for speculative subroutine return and a 2K-entry BTB. Up to four instructions can be fetches each cycle into the instruction buffer which is 32 entries in size. | The front-end consists of the instruction caches & prefetches, fetching of instructions, and decoding. Xiaomi cores contain a 32 [[KiB]] [[L1 instruction cache]] with a prefetcher designed to reduce caches misses. On hits, 2 cycles are required for retrieval of instructions from the L1. Xiaomi has a hybrid [[branch predictor]] made of a [[TAGE predictor]] and a 512-entry [[indirect predictor]]. The BP unit also has a 48-entry Return Address Stack (RAS) for speculative subroutine return and a 2K-entry BTB. Up to four instructions can be fetches each cycle into the instruction buffer which is 32 entries in size. | ||
Line 91: | Line 116: | ||
==== Back End ==== | ==== Back End ==== | ||
− | The back-end performs operations [[out-of-order]] for the most part and is in charge of queuing instructions, executing them and retiring them. Dispatch contains a 160-entry [[ReOrder Buffer]] (ROB) and can dispatch up to 4 instructions per cycle. Note that over 210 instructions can be in-flight throughout the entire pipeline. Operand values can be read from Xiaomi's [[physical register file]] and an [[architectural register file]] in order to remove the various dependencies. Registers are only updated from the physical register file to the architectural register file when the corresponding instructions require. The physical register file contains 192 physical [[registers]] supporting up to four parallel instruction reads. Because [[ARM]] has instructions with up to four [[operands]], the register file would need 16 ports to support those simultaneous | + | The back-end performs operations [[out-of-order]] for the most part and is in charge of queuing instructions, executing them and retiring them. Dispatch contains a 160-entry [[ReOrder Buffer]] (ROB) and can dispatch up to 4 instructions per cycle. Note that over 210 instructions can be in-flight throughout the entire pipeline. Operand values can be read from Xiaomi's [[physical register file]] and an [[architectural register file]] in order to remove the various dependencies. Registers are only updated from the physical register file to the architectural register file when the corresponding instructions require. The physical register file contains 192 physical [[registers]] supporting up to four parallel instruction reads. Because [[ARM]] has instructions with up to four [[operands]], the register file would need 16 ports to support those simultaneous ports. To reduce some of the complexity, Phytium chose to reduce the number of ports to 12 where some of the ports are dedicated to each instruction and others are [[multiplexed]]. Phytium explained this decision resulted in a 2.5% reduction in area while adding 0.017% in overhead performance. |
− | |||
− | |||
From dispatch, [[out-of-order]] instructions go into 4 discrete scheduling queues: 2x Integer/SIMD, 1x FP/SIMD, and 1x Load/Store. The Int/FP queues are each 16-entry deep. Xiaomi includes two separate [[Integer]]/[[SIMD]] queues. The first one is capable of executing two 64-bit single-cycle integer instructions or one 128-bit single-cycle integer (with the two units locked together). Additionally one of the units is also capable of performing branch operations. The second queue handles two multi-cycle integer/SIMD operations. Just like the single-cycle unit, the multi-cycle unit can also handle one 128-bit operation by combining both units. Xiaomi includes a single [[floating-point]]/SIMD queue with both units supporting FMA as well as two 64-bit FP operations or one 128-bit FP operation by combining two units. | From dispatch, [[out-of-order]] instructions go into 4 discrete scheduling queues: 2x Integer/SIMD, 1x FP/SIMD, and 1x Load/Store. The Int/FP queues are each 16-entry deep. Xiaomi includes two separate [[Integer]]/[[SIMD]] queues. The first one is capable of executing two 64-bit single-cycle integer instructions or one 128-bit single-cycle integer (with the two units locked together). Additionally one of the units is also capable of performing branch operations. The second queue handles two multi-cycle integer/SIMD operations. Just like the single-cycle unit, the multi-cycle unit can also handle one 128-bit operation by combining both units. Xiaomi includes a single [[floating-point]]/SIMD queue with both units supporting FMA as well as two 64-bit FP operations or one 128-bit FP operation by combining two units. | ||
− | |||
− | |||
The Load/Store queue is slightly larger than the Int or FP queues with 24 entries. Two loads or 1 load + 1 store can be issued each cycle. As with the level 1 instruction cache, the [[level 1 data cache]] is also 32 [[KiB]] supporting six outstanding loads. Next line and stride detected data prefetch are supported. | The Load/Store queue is slightly larger than the Int or FP queues with 24 entries. Two loads or 1 load + 1 store can be issued each cycle. As with the level 1 instruction cache, the [[level 1 data cache]] is also 32 [[KiB]] supporting six outstanding loads. Next line and stride detected data prefetch are supported. | ||
− | == | + | == References == |
− | + | * Zhang, C. (2015, August). Mars: A 64-core ARMv8 processor. In ''Hot Chips 27 Symposium'' (HCS), 2015 IEEE (pp. 1-23). IEEE. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Facts about "Xiaomi - Microarchitectures - Phytium"
codename | Xiaomi + |
designer | Phytium + |
first launched | 2017 + |
full page name | phytium/microarchitectures/xiaomi + |
instance of | microarchitecture + |
instruction set architecture | ARMv8 + |
manufacturer | TSMC + |
microarchitecture type | CPU + |
name | Xiaomi + |
process | 28 nm (0.028 μm, 2.8e-5 mm) + |