Edit Values | |
Cortex-X1 (Hera) µarch | |
General Info | |
Arch Type | CPU |
Designer | ARM Holdings |
Manufacturer | TSMC |
Introduction | May 26, 2020 |
Process | 10 nm, 7 nm, 5 nm |
Core Configs | 1, 2, 4, 6, 8 |
Pipeline | |
Type | Superscalar, Pipelined |
OoOE | Yes |
Speculative | Yes |
Reg Renaming | Yes |
Stages | 13 |
Decode | 5-way |
Instructions | |
ISA | ARMv8.2 |
Extensions | FPU, NEON |
Cache | |
L1I Cache | 64 KiB/core 4-way set associative |
L1D Cache | 64 KiB/core 4-way set associative |
L2 Cache | 1 MiB/core 8-way set associative |
L3 Cache | 8 MiB/cluster 16-way set associative |
Cores | |
Core Names | Cortex-X1 |
Succession | |
Contemporary | |
Cortex-A78 (Hercules) |
Cortex-X1 (codename Hera) is a performance-enhanced version of the Cortex-A78 (Hercules), a low-power high-performance ARM microarchitecture designed by Arm for the mobile market. The Cortex-X1 was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable IP core and is licensed to other semiconductor companies to be implemented in their own chips.
The Cortex-X1, which implements the ARMv8.2 ISA, is a higher performance core that is designed to be combined with the Cortex-A78 in a DynamIQ big.LITTLE combination in order to provide even higher single-thread performance. This core, along with the Cortex-A78, are often combined with a number of low(er) power cores (e.g. Cortex-A55) in order to achieve better energy/performance.
Contents
Cortex-X[edit]
Year | Cortex-X Core | Cortex-A Core |
---|---|---|
2020 | Cortex-X1 (Hera) Cortex-X1C (Hera-C) |
Cortex-A78 (Hercules) Cortex-A78C (Hera Prime) |
2021 | Cortex-X2 (Matterhorn-ELP) |
Cortex-A710 (Matterhorn) Cortex-A510 (Klein) |
2022 | Cortex-X3 (Makalu-ELP) | Cortex-A715 (Makalu) |
2023 | Cortex-X4 (Hunter-ELP) | Cortex-A720 (Hunter) Cortex-A520 (Hayes) |
2024 | Cortex-X925 (Blackhawk) |
Cortex-A720AE (Hunter-AE) Cortex-A725 (Chaberton) |
2025 | Cortex-X930 (Travis) | Cortex-A730 (Gelas) Cortex-A530 (Nevis) |
Process Technology[edit]
Although the Cortex-X1 may be fabricated on various process nodes, it has been primarily designed for the 10 nm, 7 nm,
- and 5 nm process nodes with performance, power and area numbers mainly targeting the 5-nanometer node.
Architecture[edit]
Key changes from Cortex-A78[edit]
- See also: Cortex-A78 § Key changes from Cortex-A77
The Cortex-X1 is a custom performance-enhanced variant of the Cortex-A78, therefore it
- inherits most of the changes that were done to the Cortex-A78 from the Cortex-A77.
- Higher performance (See § Performance claims)
- Arm self-reported around 30% performance over the Cortex-A77
(compared to +20% with the Cortex-A78) - 2.0x (machine learning) performance
- Arm self-reported around 30% performance over the Cortex-A77
- Silicon area
- 15% more silicon area (on N5)
- Front-end
- 1.25x wider decode (5-way, up from 4-way)
- 1.33x wider decoded cache bandwidth
(8 MOPs/cycle, up from 6 MOPs/cycle)
- Memory subsystem
- Only 64 KiB L1I cache option (from 32-64 KiB)
- Only 64 KiB L1D cache option (from 32-64 KiB)
- Up to 1 MiB L2 cache option (from 512 KiB)
- Up to 8 MiB L3 cache option (from 4 MiB)
Comparison[edit]
- "Prime" core
Architecture | Cortex-A78 | Cortex-X1 | Cortex-X2 | Cortex-X3 | Cortex-X4 | Cortex-X925 | Cortex-X930 |
---|---|---|---|---|---|---|---|
Code name | Hercules | Hera | Matterhorn-ELP | Makalu-ELP | Hunter-ELP | Blackhawk | Travis |
ISA | ARMv8.2-A | ARMv9.0-A | ARMv9.2-A | ||||
Peak clock speed | ~3.0 GHz | ~3.3 GHz | ~3.4 GHz | ~3.8 GHz | ~4.2 GHz | ||
Max in-flight | 2x 160 | 2x 224 | 2x 288 | 2x 320 | 2x 384 | 2x 768 | |
L0 (Mops entries) | 1536 [1] | 3072 | 1536 | 0 | |||
L1-I + L1-D | 32+32 KiB | 64+64 KiB | 64+64 KiB | 64+64 KiB | |||
L2 | 128–512 KiB | 0.25–1 MiB | 0.5–2 MiB | 2–3 MiB | |||
L3 | 0–8 MiB [2] | 0–16 MiB | 0–32 MiB | ||||
Decode width | 4 | 5 | 6 | 10 [3] | 10 | ||
Dispatch | 6/cycle | 8/cycle | 10/cycle |
Performance claims[edit]
- Compared to the Cortex-A77, the Cortex-X1 is said to be 30% faster in peak performance on SPEC CPU2006.
- The improvement comes from both architectural improvements and frequency improvement with the help
- of process improvement moving from the 7 nm to the 5 nm node.
Performance | |
---|---|
Cortex-A77 | Cortex-X1 |
1.0x | 1.3x |
2.6 GHz | 3.0 GHz |
7 nm (N7) | 5 nm(N5) |
|
- Arm says that, at ISO-process and frequency, the Cortex-X1 achieves 22% higher integer performance (SPEC CPU2006)
- over the Cortex-A78 and 30% higher integer performance over the Cortex-A77. Likewise, due to the doubling
- of the number of NEON units, the Cortex-X1 can achieve twice the ML performance as both the A77 and A78.
Performance @ ISO-process/frequency | |
---|---|
Cortex-A77 | Cortex-X1 |
1.0x | 1.3x (integer performance) |
1.0x | 2.0x (ML performance) |
3.0 GHz | 3.0 GHz |
7 nm (N7) | 5 nm(N5) |
|
Overview[edit]
- The Cortex-X1 is a high-performance synthesizable core designed by Arm. It is delivered as Register
- Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs.
- This core supports the ARMv8.2 extension as well as a number of other partial extensions.
- This is the first from Arm's Cortex-X custom program. The X1 is a performance-enhanced
- version of the A78, it therefore uses the A78 as the starting point for its modifications.
- The Cortex-X1 is built on top of the Cortex-A78, but enhances it in order to extract additional performance,
- albeit at a slight reduction in power efficiency and area. To that end, whereas the Hercules was said to provide
- a 20% sustain performance uplift over the Cortex-A77, the Cortex-X1 offers up to 30% peak performance.
- In other words, whereas the Cortex-A78 is designed for high sustained performance at high performance-efficiency,
- the Cortex-X1 is designed to supplement it with higher peak performance while relaxing the power and area constraints.
- The Cortex-X1 is a fatter version of the Cortex-A78, relying on bigger buffers and a large out-of-order window
- in order to extract further performance. To that end, the X1 features a 5-way decode, twice as many NEON units,
- and larger overall buffers in order to allow for a bigger out-of-order window with more in-flight operations.
- The Cortex-X1 enlarges the pipeline while still retaining the higher frequency which was introduced in the Cortex-A77.
- The Cortex-X1 is intended to be combined with a number of Cortex-A78 cores in DynamIQ Shared Unit (DSU)
- cluster along with possibly with other lower-power cores such as the Cortex-A55 to more efficiently support
- a wide range of workloads at various performance and power levels beyond what's possible with any one core.
DSU Cluster[edit]
- The Cortex-X1 provides additional peak performance beyond what the Cortex-A78 can offer.
- Therefore the X1 is designed to be combined with a number of Cortex-A78 cores in DynamIQ
- Shared Unit (DSU) cluster in order to provide a balance in both power and performance.
- Compared to a quad-core Cortex-A77 cluster on 7 nm, a quad-core Cortex-A78 cluster provides
- +20% sustained performance improvement while reducing the silicon area by about 15%.
- When replacing one of those big Cortex-A78 cores with a single Cortex-X1 core, the cluster
- can now provide a peak single-thread performance of up to 30% versus the Cortex-A77
- at the cost of 15% additional silicon area (or neural area-wise from N7 to N5).
References[edit]
codename | Cortex-X1 (Hera) + |
core count | 1 +, 2 +, 4 +, 6 + and 8 + |
designer | ARM Holdings + |
first launched | May 26, 2020 + |
full page name | arm holdings/microarchitectures/cortex-x1 + |
instance of | microarchitecture + |
instruction set architecture | ARMv8.2 + |
manufacturer | TSMC + |
microarchitecture type | CPU + |
name | Cortex-X1 (Hera) + |
pipeline stages | 13 + |
process | 10 nm (0.01 μm, 1.0e-5 mm) +, 7 nm (0.007 μm, 7.0e-6 mm) + and 5 nm (0.005 μm, 5.0e-6 mm) + |