| Edit Values | |
| Cortex-X1 (Hera) µarch | |
| General Info | |
| Arch Type | CPU |
| Designer | ARM Holdings |
| Manufacturer | TSMC |
| Introduction | May 26, 2020 |
| Process | 10 nm, 7 nm, 5 nm |
| Core Configs | 1, 2, 4, 6, 8 |
| Pipeline | |
| Type | Superscalar, Pipelined |
| OoOE | Yes |
| Speculative | Yes |
| Reg Renaming | Yes |
| Stages | 13 |
| Decode | 5-way |
| Instructions | |
| ISA | ARMv8.2 |
| Extensions | FPU, NEON |
| Cache | |
| L1I Cache | 64 KiB/core 4-way set associative |
| L1D Cache | 64 KiB/core 4-way set associative |
| L2 Cache | 1 MiB/core 8-way set associative |
| L3 Cache | 8 MiB/cluster 16-way set associative |
| Cores | |
| Core Names | Cortex-X1 |
| Succession | |
| Contemporary | |
| Cortex-A78 (Hercules) | |
Cortex-X1 (codename Hera) is a performance-enhanced version of the Cortex-A78 (Hercules), a low-power high-performance ARM microarchitecture designed by Arm for the mobile market. The Cortex-X1 was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable IP core and is licensed to other semiconductor companies to be implemented in their own chips.
The Cortex-X1, which implements the ARMv8.2 ISA, is a higher performance core that is designed to be combined with the Cortex-A78 in a DynamIQ big.LITTLE combination in order to provide even higher single-thread performance. This core, along with the Cortex-A78, are often combined with a number of low(er) power cores (e.g. Cortex-A55) in order to achieve better energy/performance.
Contents
Cortex-X[edit]
| Year | Cortex-X Core | Cortex-A Core |
|---|---|---|
| 2020 | Cortex-X1 (Hera) Cortex-X1C (Hera-C) |
Cortex-A78 (Hercules) Cortex-A78C (Hera Prime) |
| 2021 | Cortex-X2 (Matterhorn-ELP) |
Cortex-A710 (Matterhorn) Cortex-A510 (Klein) |
| 2022 | Cortex-X3 (Makalu-ELP) | Cortex-A715 (Makalu) |
| 2023 | Cortex-X4 (Hunter-ELP) | Cortex-A720 (Hunter) Cortex-A520 (Hayes) |
| 2024 | Cortex-X925 (Blackhawk) |
Cortex-A720AE (Hunter-AE) Cortex-A725 (Chaberton) |
| 2025 | Cortex-X930 (Travis) | Cortex-A730 (Gelas) Cortex-A530 (Nevis) |
Process Technology[edit]
Although the Cortex-X1 may be fabricated on various process nodes, it has been primarily designed for the 10 nm, 7 nm,
- and 5 nm process nodes with performance, power and area numbers mainly targeting the 5-nanometer node.
Architecture[edit]
Key changes from Cortex-A78[edit]
- See also: Cortex-A78 § Key changes from Cortex-A77
The Cortex-X1 is a custom performance-enhanced variant of the Cortex-A78, therefore it
- inherits most of the changes that were done to the Cortex-A78 from the Cortex-A77.
- Higher performance (See § Performance claims)
- Arm self-reported around 30% performance over the Cortex-A77
(compared to +20% with the Cortex-A78) - 2.0x (machine learning) performance
- Arm self-reported around 30% performance over the Cortex-A77
- Silicon area
- 15% more silicon area (on N5)
- Front-end
- 1.25x wider decode (5-way, up from 4-way)
- 1.33x wider decoded cache bandwidth
(8 MOPs/cycle, up from 6 MOPs/cycle)
- Memory subsystem
- Only 64 KiB L1I cache option (from 32-64 KiB)
- Only 64 KiB L1D cache option (from 32-64 KiB)
- Up to 1 MiB L2 cache option (from 512 KiB)
- Up to 8 MiB L3 cache option (from 4 MiB)
Comparison[edit]
- "Prime" core
| Architecture | Cortex-A78 | Cortex-X1 | Cortex-X2 | Cortex-X3 | Cortex-X4 | Cortex-X925 | Cortex-X930 |
|---|---|---|---|---|---|---|---|
| Code name | Hercules | Hera | Matterhorn-ELP | Makalu-ELP | Hunter-ELP | Blackhawk | Travis |
| ISA | ARMv8.2-A | ARMv9.0-A | ARMv9.2-A | ||||
| Peak clock speed | ~3.0 GHz | ~3.3 GHz | ~3.4 GHz | ~3.8 GHz | ~4.2 GHz | ||
| Max in-flight | 2x 160 | 2x 224 | 2x 288 | 2x 320 | 2x 384 | 2x 768 | |
| L0 (Mops entries) | 1536 [1] | 3072 | 1536 | 0 | |||
| L1-I + L1-D | 32+32 KiB | 64+64 KiB | 64+64 KiB | 64+64 KiB | |||
| L2 | 128–512 KiB | 0.25–1 MiB | 0.5–2 MiB | 2–3 MiB | |||
| L3 | 0–8 MiB [2] | 0–16 MiB | 0–32 MiB | ||||
| Decode width | 4 | 5 | 6 | 10 [3] | 10 | ||
| Dispatch | 6/cycle | 8/cycle | 10/cycle | ||||
Performance claims[edit]
- Compared to the Cortex-A77, the Cortex-X1 is said to be 30% faster in peak performance on SPEC CPU2006.
- The improvement comes from both architectural improvements and frequency improvement with the help
- of process improvement moving from the 7 nm to the 5 nm node.
| Performance | |
|---|---|
| Cortex-A77 | Cortex-X1 |
| 1.0x | 1.3x |
| 2.6 GHz | 3.0 GHz |
| 7 nm (N7) | 5 nm(N5) |
| |
- Arm says that, at ISO-process and frequency, the Cortex-X1 achieves 22% higher integer performance (SPEC CPU2006)
- over the Cortex-A78 and 30% higher integer performance over the Cortex-A77. Likewise, due to the doubling
- of the number of NEON units, the Cortex-X1 can achieve twice the ML performance as both the A77 and A78.
| Performance @ ISO-process/frequency | |
|---|---|
| Cortex-A77 | Cortex-X1 |
| 1.0x | 1.3x (integer performance) |
| 1.0x | 2.0x (ML performance) |
| 3.0 GHz | 3.0 GHz |
| 7 nm (N7) | 5 nm(N5) |
| |
Overview[edit]
- The Cortex-X1 is a high-performance synthesizable core designed by Arm. It is delivered as Register
- Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs.
- This core supports the ARMv8.2 extension as well as a number of other partial extensions.
- This is the first from Arm's Cortex-X custom program. The X1 is a performance-enhanced
- version of the A78, it therefore uses the A78 as the starting point for its modifications.
- The Cortex-X1 is built on top of the Cortex-A78, but enhances it in order to extract additional performance,
- albeit at a slight reduction in power efficiency and area. To that end, whereas the Hercules was said to provide
- a 20% sustain performance uplift over the Cortex-A77, the Cortex-X1 offers up to 30% peak performance.
- In other words, whereas the Cortex-A78 is designed for high sustained performance at high performance-efficiency,
- the Cortex-X1 is designed to supplement it with higher peak performance while relaxing the power and area constraints.
- The Cortex-X1 is a fatter version of the Cortex-A78, relying on bigger buffers and a large out-of-order window
- in order to extract further performance. To that end, the X1 features a 5-way decode, twice as many NEON units,
- and larger overall buffers in order to allow for a bigger out-of-order window with more in-flight operations.
- The Cortex-X1 enlarges the pipeline while still retaining the higher frequency which was introduced in the Cortex-A77.
- The Cortex-X1 is intended to be combined with a number of Cortex-A78 cores in DynamIQ Shared Unit (DSU)
- cluster along with possibly with other lower-power cores such as the Cortex-A55 to more efficiently support
- a wide range of workloads at various performance and power levels beyond what's possible with any one core.
DSU Cluster[edit]
- The Cortex-X1 provides additional peak performance beyond what the Cortex-A78 can offer.
- Therefore the X1 is designed to be combined with a number of Cortex-A78 cores in DynamIQ
- Shared Unit (DSU) cluster in order to provide a balance in both power and performance.
- Compared to a quad-core Cortex-A77 cluster on 7 nm, a quad-core Cortex-A78 cluster provides
- +20% sustained performance improvement while reducing the silicon area by about 15%.
- When replacing one of those big Cortex-A78 cores with a single Cortex-X1 core, the cluster
- can now provide a peak single-thread performance of up to 30% versus the Cortex-A77
- at the cost of 15% additional silicon area (or neural area-wise from N7 to N5).
References[edit]
| codename | Cortex-X1 (Hera) + |
| core count | 1 +, 2 +, 4 +, 6 + and 8 + |
| designer | ARM Holdings + |
| first launched | May 26, 2020 + |
| full page name | arm holdings/microarchitectures/cortex-x1 + |
| instance of | microarchitecture + |
| instruction set architecture | ARMv8.2 + |
| manufacturer | TSMC + |
| microarchitecture type | CPU + |
| name | Cortex-X1 (Hera) + |
| pipeline stages | 13 + |
| process | 10 nm (0.01 μm, 1.0e-5 mm) +, 7 nm (0.007 μm, 7.0e-6 mm) + and 5 nm (0.005 μm, 5.0e-6 mm) + |