From WikiChip
Cortex-X1 (Hera) - Microarchitectures - ARM
< arm holdings

Edit Values
Cortex-X1 (Hera) µarch
General Info
Arch TypeCPU
DesignerARM Holdings
ManufacturerTSMC
IntroductionMay 26, 2020
Process10 nm, 7 nm, 5 nm
Core Configs1, 2, 4, 6, 8
Pipeline
TypeSuperscalar, Pipelined
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages13
Decode5-way
Instructions
ISAARMv8.2
ExtensionsFPU, NEON
Cache
L1I Cache64 KiB/core
4-way set associative
L1D Cache64 KiB/core
4-way set associative
L2 Cache1 MiB/core
8-way set associative
L3 Cache8 MiB/cluster
16-way set associative
Cores
Core NamesCortex-X1
Succession
Contemporary
Cortex-A78 (Hercules)

Cortex-X1 (codename Hera) is a performance-enhanced version of the Cortex-A78 (Hercules), a low-power high-performance ARM microarchitecture designed by Arm for the mobile market. The Cortex-X1 was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable IP core and is licensed to other semiconductor companies to be implemented in their own chips.

The Cortex-X1, which implements the ARMv8.2 ISA, is a higher performance core that is designed to be combined with the Cortex-A78 in a DynamIQ big.LITTLE combination in order to provide even higher single-thread performance. This core, along with the Cortex-A78, are often combined with a number of low(er) power cores (e.g. Cortex-A55) in order to achieve better energy/performance.

Cortex-X[edit]

ARMCortex
Year Cortex-X Core Cortex-A Core
2020 Cortex-X1 (Hera)
Cortex-X1C (Hera-C)
Cortex-A78 (Hercules)
Cortex-A78C (Hera Prime)
2021 Cortex-X2
(Matterhorn-ELP)
Cortex-A710 (Matterhorn)
Cortex-A510 (Klein)
2022 Cortex-X3 (Makalu-ELP) Cortex-A715 (Makalu)
2023 Cortex-X4 (Hunter-ELP) Cortex-A720 (Hunter)
Cortex-A520 (Hayes)
2024 Cortex-X5 (Chaberton-ELP)
Cortex-X925 (Blackhawk)
Cortex-A720AE (Hunter-AE)
Cortex-A725 (Chaberton)
2025 Cortex-X930 (Travis) Cortex-A730 (Gelas)
Cortex-A530 (Nevis)

Process Technology[edit]

Although the Cortex-X1 may be fabricated on various process nodes, it has been primarily designed for the 10 nm, 7 nm,

and 5 nm process nodes with performance, power and area numbers mainly targeting the 5-nanometer node.

Architecture[edit]

Key changes from Cortex-A78[edit]

See also: Cortex-A78 § Key changes from Cortex-A77

The Cortex-X1 is a custom performance-enhanced variant of the Cortex-A78, therefore it

inherits most of the changes that were done to the Cortex-A78 from the Cortex-A77.
  • Higher performance (See § Performance claims)
    • Arm self-reported around 30% performance over the Cortex-A77
      (compared to +20% with the Cortex-A78)
    • 2.0x (machine learning) performance
  • Silicon area
    • 15% more silicon area (on N5)
  • Front-end
    • 1.25x wider decode (5-way, up from 4-way)
    • 1.33x wider decoded cache bandwidth
      (8 MOPs/cycle, up from 6 MOPs/cycle)
  • Memory subsystem
    • Only 64 KiB L1I cache option (from 32-64 KiB)
    • Only 64 KiB L1D cache option (from 32-64 KiB)
    • Up to 1 MiB L2 cache option (from 512 KiB)
    • Up to 8 MiB L3 cache option (from 4 MiB)

Comparison[edit]

"Prime" core
Architecture Cortex-A78 Cortex-X1 Cortex-X2 Cortex-X3 Cortex-X4 Cortex-X925 Cortex-X930
Code name Hercules Hera Matterhorn-ELP Makalu-ELP Hunter-ELP Blackhawk Travis
ISA ARMv8.2-A ARMv9.0-A ARMv9.2-A
Peak clock speed ~3.0 GHz ~3.3 GHz ~3.4 GHz ~3.8 GHz ~4.2 GHz
Max in-flight 2x 160 2x 224 2x 288 2x 320 2x 384 2x 768
L0 (Mops entries) 1536 [1] 3072 1536 0
L1-I + L1-D 32+32 KiB 64+64 KiB 64+64 KiB 64+64 KiB
L2 128–512 KiB 0.25–1 MiB 0.5–2 MiB 2–3 MiB
L3 0–8 MiB [2] 0–16 MiB 0–32 MiB
Decode width 4 5 6 10 [3] 10
Dispatch 6/cycle 8/cycle 10/cycle

Performance claims[edit]

The improvement comes from both architectural improvements and frequency improvement with the help
of process improvement moving from the 7 nm to the 5 nm node.
Performance
Cortex-A77 Cortex-X1
1.0x 1.3x
2.6 GHz 3.0 GHz
7 nm (N7) 5 nm(N5)
  • Cortex-X1 1 MiB L2, 8 MiB L3 cache
  • Cortex-A77 512 KiB L2 , 4 MiB L3 cache
  • Arm says that, at ISO-process and frequency, the Cortex-X1 achieves 22% higher integer performance (SPEC CPU2006)
over the Cortex-A78 and 30% higher integer performance over the Cortex-A77. Likewise, due to the doubling
of the number of NEON units, the Cortex-X1 can achieve twice the ML performance as both the A77 and A78.
Performance @ ISO-process/frequency
Cortex-A77 Cortex-X1
1.0x 1.3x (integer performance)
1.0x 2.0x (ML performance)
3.0 GHz 3.0 GHz
7 nm (N7) 5 nm(N5)
  • Cortex-X1 1 MiB L2, 8 MiB L3 cache
  • Cortex-A77 512 KiB L2 , 4 MiB L3 cache

Overview[edit]

  • The Cortex-X1 is a high-performance synthesizable core designed by Arm. It is delivered as Register
Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs.
  • This core supports the ARMv8.2 extension as well as a number of other partial extensions.
This is the first from Arm's Cortex-X custom program. The X1 is a performance-enhanced
version of the A78, it therefore uses the A78 as the starting point for its modifications.
  • The Cortex-X1 is built on top of the Cortex-A78, but enhances it in order to extract additional performance,
albeit at a slight reduction in power efficiency and area. To that end, whereas the Hercules was said to provide
a 20% sustain performance uplift over the Cortex-A77, the Cortex-X1 offers up to 30% peak performance.
  • In other words, whereas the Cortex-A78 is designed for high sustained performance at high performance-efficiency,
the Cortex-X1 is designed to supplement it with higher peak performance while relaxing the power and area constraints.
  • The Cortex-X1 is a fatter version of the Cortex-A78, relying on bigger buffers and a large out-of-order window
in order to extract further performance. To that end, the X1 features a 5-way decode, twice as many NEON units,
and larger overall buffers in order to allow for a bigger out-of-order window with more in-flight operations.
  • The Cortex-X1 enlarges the pipeline while still retaining the higher frequency which was introduced in the Cortex-A77.
  • The Cortex-X1 is intended to be combined with a number of Cortex-A78 cores in DynamIQ Shared Unit (DSU)
cluster along with possibly with other lower-power cores such as the Cortex-A55 to more efficiently support
a wide range of workloads at various performance and power levels beyond what's possible with any one core.

DSU Cluster[edit]

  • The Cortex-X1 provides additional peak performance beyond what the Cortex-A78 can offer.
Therefore the X1 is designed to be combined with a number of Cortex-A78 cores in DynamIQ
Shared Unit (DSU) cluster in order to provide a balance in both power and performance.
+20% sustained performance improvement while reducing the silicon area by about 15%.
  • When replacing one of those big Cortex-A78 cores with a single Cortex-X1 core, the cluster
can now provide a peak single-thread performance of up to 30% versus the Cortex-A77
at the cost of 15% additional silicon area (or neural area-wise from N7 to N5).

References[edit]

  1. Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence.
  2. Schor, David (2020-05-26). Arm Cortex-X1: The First From The Cortex-X Custom Program.
  3. (2023-05-29) Arm Cortex-X4, A720, and A520: 2024 smartphone CPUs deep dive.
codenameCortex-X1 (Hera) +
core count1 +, 2 +, 4 +, 6 + and 8 +
designerARM Holdings +
first launchedMay 26, 2020 +
full page namearm holdings/microarchitectures/cortex-x1 +
instance ofmicroarchitecture +
instruction set architectureARMv8.2 +
manufacturerTSMC +
microarchitecture typeCPU +
nameCortex-X1 (Hera) +
pipeline stages13 +
process10 nm (0.01 μm, 1.0e-5 mm) +, 7 nm (0.007 μm, 7.0e-6 mm) + and 5 nm (0.005 μm, 5.0e-6 mm) +