|L1I Cache||32 KiB/Core|
|L1D Cache||32 KiB/Core|
|L2 Cache||4 MiB/Panel|
|L3 Cache||16 MiB/CMC|
- Fully ARMv8 compatible
- Support AArch32 and AArch64 modes
- EL0-EL3 supported
- 28 nm process
- Scalable design
- 4 to 64 cores
- Mesh topology network-on-chip
- Panel-based (grid) architecture
- Global cache coherency
- 2x DDR3-1600 channels per panel
- ECC support
- 2x 16-lane PCIe 3.0
The front-end consists of the instruction caches & prefetches, fetching of instructions, and decoding. Xiaomi cores contain a 32 KiB L1 instruction cache with a prefetcher designed to reduce caches misses. On hits, 2 cycles are required for retrieval of instructions from the L1. Xiaomi has a hybrid branch predictor made of a TAGE predictor and a 512-entry indirect predictor. The BP unit also has a 48-entry Return Address Stack (RAS) for speculative subroutine return and a 2K-entry BTB. Up to four instructions can be fetches each cycle into the instruction buffer which is 32 entries in size.
The buffer is also a loop detection buffer, responsible for detecting loop patterns and hold them in the instruction buffer, bypassing the cache so they can be sent directly to decode. From the instruction buffer, up to four instructions can be decoded each cycle, up to four instructions can be renamed each cycle, and up to four instructions can be dispatched each cycle. Everything is done in-order up to this point.
The back-end performs operations out-of-order for the most part and is in charge of queuing instructions, executing them and retiring them. Dispatch contains a 160-entry ReOrder Buffer (ROB) and can dispatch up to 4 instructions per cycle. Note that over 210 instructions can be in-flight throughout the entire pipeline. Operand values can be read from Xiaomi's physical register file and an architectural register file in order to remove the various dependencies. Registers are only updated from the physical register file to the architectural register file when the corresponding instructions require. The physical register file contains 192 physical registers supporting up to four parallel instruction reads. Because ARM has instructions with up to four operands, the register file would need 16 ports to support those simultaneous instructions. To reduce some of the complexity, Phytium chose to reduce the number of ports to 12 where some of the ports are dedicated to each instruction and others are multiplexed. Phytium explained this decision resulted in a 2.5% reduction in area while adding 0.017% in overhead performance.
From dispatch, out-of-order instructions go into 4 discrete scheduling queues: 2x Integer/SIMD, 1x FP/SIMD, and 1x Load/Store. The Int/FP queues are each 16-entry deep. Xiaomi includes two separate Integer/SIMD queues. The first one is capable of executing two 64-bit single-cycle integer instructions or one 128-bit single-cycle integer (with the two units locked together). Additionally one of the units is also capable of performing branch operations. The second queue handles two multi-cycle integer/SIMD operations. Just like the single-cycle unit, the multi-cycle unit can also handle one 128-bit operation by combining both units. Xiaomi includes a single floating-point/SIMD queue with both units supporting FMA as well as two 64-bit FP operations or one 128-bit FP operation by combining two units.
At least on the FTC-662 (16 nm version), floating multiplication is 3 cycles, addition is 3 cycles, and longest float division is 16 cycles.
The Load/Store queue is slightly larger than the Int or FP queues with 24 entries. Two loads or 1 load + 1 store can be issued each cycle. As with the level 1 instruction cache, the level 1 data cache is also 32 KiB supporting six outstanding loads. Next line and stride detected data prefetch are supported.
- IEEE Hot Chips 27 Symposium (HCS) 2015.
|first launched||2017 +|
|full page name||phytium/microarchitectures/xiaomi +|
|instance of||microarchitecture +|
|instruction set architecture||ARMv8 +|
|microarchitecture type||CPU +|
|process||28 nm (0.028 μm, 2.8e-5 mm) +|