Difference between revisions of "arm holdings/microarchitectures/cortex-x1"

	Edit Values
	Cortex-X1 (Hera) µarch
	General Info
Arch Type	CPU
Designer	ARM Holdings
Manufacturer	TSMC
Introduction	May 26, 2020
Process	10 nm, 7 nm, 5 nm
Core Configs	1, 2, 4, 6, 8
	Pipeline
Type	Superscalar, Pipelined
OoOE	Yes
Speculative	Yes
Reg Renaming	Yes
Stages	13
Decode	5-way
	Instructions
ISA	ARMv8.2
Extensions	FPU, NEON
	Cache
L1I Cache	64 KiB/core; 4-way set associative
L1D Cache	64 KiB/core; 4-way set associative
L2 Cache	1 MiB/core; 8-way set associative
L3 Cache	8 MiB/cluster; 16-way set associative
	Cores
Core Names	Cortex-X1
	Succession
	Cortex-X2 (Matterhorn-ELP); Cortex-X3 (Makalu-ELP); Cortex-X4 (Hunter-ELP)
	Contemporary
	Cortex-A78 (Hercules)

Latest revision as of 19:43, 15 April 2025

Cortex-X1 (codename Hera) is a performance-enhanced version of the Cortex-A78 (Hercules), a low-power high-performance ARM microarchitecture designed by Arm for the mobile market. The Cortex-X1 was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable IP core and is licensed to other semiconductor companies to be implemented in their own chips.

The Cortex-X1, which implements the ARMv8.2 ISA, is a higher performance core that is designed to be combined with the Cortex-A78 in a DynamIQ big.LITTLE combination in order to provide even higher single-thread performance. This core, along with the Cortex-A78, are often combined with a number of low(er) power cores (e.g. Cortex-A55) in order to achieve better energy/performance.

Year	Cortex-X Core	Cortex-A Core
2020	Cortex-X1 (Hera) Cortex-X1C (Hera-C)	Cortex-A78 (Hercules) Cortex-A78C (Hera Prime)
2021	Cortex-X2 (Matterhorn-ELP)	Cortex-A710 (Matterhorn) Cortex-A510 (Klein)
2022	Cortex-X3 (Makalu-ELP)	Cortex-A715 (Makalu)
2023	Cortex-X4 (Hunter-ELP)	Cortex-A720 (Hunter) Cortex-A520 (Hayes)
2024	~~Cortex-X5 (Chaberton-ELP)~~ Cortex-X925 (Blackhawk)	Cortex-A720AE (Hunter-AE) Cortex-A725 (Chaberton)
2025	Cortex-X930 (Travis)	Cortex-A730 (Gelas) Cortex-A530 (Nevis)

Process Technology[edit]

Although the Cortex-X1 may be fabricated on various process nodes, it has been primarily designed for the 10 nm, 7 nm,

and 5 nm process nodes with performance, power and area numbers mainly targeting the 5-nanometer node.

Architecture[edit]

Key changes from Cortex-A78[edit]

See also: Cortex-A78 § Key changes from Cortex-A77

The Cortex-X1 is a custom performance-enhanced variant of the Cortex-A78, therefore it

inherits most of the changes that were done to the Cortex-A78 from the Cortex-A77.

Higher performance (See § Performance claims)
- Arm self-reported around 30% performance over the Cortex-A77
  (compared to +20% with the Cortex-A78)
- 2.0x (machine learning) performance
Silicon area
- 15% more silicon area (on N5)
Front-end
- 1.25x wider decode (5-way, up from 4-way)
- 1.33x wider decoded cache bandwidth
  (8 MOPs/cycle, up from 6 MOPs/cycle)
Memory subsystem
- Only 64 KiB L1I cache option (from 32-64 KiB)
- Only 64 KiB L1D cache option (from 32-64 KiB)
- Up to 1 MiB L2 cache option (from 512 KiB)
- Up to 8 MiB L3 cache option (from 4 MiB)

Comparison[edit]

"Prime" core

Architecture	Cortex-A78	Cortex-X1	Cortex-X2	Cortex-X3	Cortex-X4	Cortex-X925	Cortex-X930
Code name	Hercules	Hera	Matterhorn-ELP	Makalu-ELP	Hunter-ELP	Blackhawk	Travis
ISA	ARMv8.2-A		ARMv9.0-A		ARMv9.2-A
Peak clock speed	~3.0 GHz			~3.3 GHz	~3.4 GHz	~3.8 GHz	~4.2 GHz
Max in-flight	2x 160	2x 224	2x 288	2x 320	2x 384	2x 768
L0 (Mops entries)	1536 ^[1]	3072		1536	0
L1-I + L1-D	32+32 KiB	64+64 KiB		64+64 KiB		64+64 KiB
L2	128–512 KiB	0.25–1 MiB			0.5–2 MiB	2–3 MiB
L3	0–8 MiB ^[2]		0–16 MiB		0–32 MiB
Decode width	4	5		6	10 ^[3]	10
Dispatch	6/cycle	8/cycle			10/cycle

Performance claims[edit]

Compared to the Cortex-A77, the Cortex-X1 is said to be 30% faster in peak performance on SPEC CPU2006.

The improvement comes from both architectural improvements and frequency improvement with the help

of process improvement moving from the 7 nm to the 5 nm node.

Performance
Cortex-A77	Cortex-X1
1.0x	1.3x
2.6 GHz	3.0 GHz
7 nm (N7)	5 nm(N5)
Cortex-X1 1 MiB L2, 8 MiB L3 cache Cortex-A77 512 KiB L2 , 4 MiB L3 cache

Arm says that, at ISO-process and frequency, the Cortex-X1 achieves 22% higher integer performance (SPEC CPU2006)

over the Cortex-A78 and 30% higher integer performance over the Cortex-A77. Likewise, due to the doubling

of the number of NEON units, the Cortex-X1 can achieve twice the ML performance as both the A77 and A78.

Performance @ ISO-process/frequency
Cortex-A77	Cortex-X1
1.0x	1.3x (integer performance)
1.0x	2.0x (ML performance)
3.0 GHz	3.0 GHz
7 nm (N7)	5 nm(N5)
Cortex-X1 1 MiB L2, 8 MiB L3 cache Cortex-A77 512 KiB L2 , 4 MiB L3 cache

Overview[edit]

The Cortex-X1 is a high-performance synthesizable core designed by Arm. It is delivered as Register

Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs.

This core supports the ARMv8.2 extension as well as a number of other partial extensions.

This is the first from Arm's Cortex-X custom program. The X1 is a performance-enhanced

version of the A78, it therefore uses the A78 as the starting point for its modifications.

The Cortex-X1 is built on top of the Cortex-A78, but enhances it in order to extract additional performance,

albeit at a slight reduction in power efficiency and area. To that end, whereas the Hercules was said to provide

a 20% sustain performance uplift over the Cortex-A77, the Cortex-X1 offers up to 30% peak performance.

In other words, whereas the Cortex-A78 is designed for high sustained performance at high performance-efficiency,

the Cortex-X1 is designed to supplement it with higher peak performance while relaxing the power and area constraints.

The Cortex-X1 is a fatter version of the Cortex-A78, relying on bigger buffers and a large out-of-order window

in order to extract further performance. To that end, the X1 features a 5-way decode, twice as many NEON units,

and larger overall buffers in order to allow for a bigger out-of-order window with more in-flight operations.

The Cortex-X1 enlarges the pipeline while still retaining the higher frequency which was introduced in the Cortex-A77.
The Cortex-X1 is intended to be combined with a number of Cortex-A78 cores in DynamIQ Shared Unit (DSU)

cluster along with possibly with other lower-power cores such as the Cortex-A55 to more efficiently support

a wide range of workloads at various performance and power levels beyond what's possible with any one core.

DSU Cluster[edit]

The Cortex-X1 provides additional peak performance beyond what the Cortex-A78 can offer.

Therefore the X1 is designed to be combined with a number of Cortex-A78 cores in DynamIQ

Shared Unit (DSU) cluster in order to provide a balance in both power and performance.

Compared to a quad-core Cortex-A77 cluster on 7 nm, a quad-core Cortex-A78 cluster provides

+20% sustained performance improvement while reducing the silicon area by about 15%.

When replacing one of those big Cortex-A78 cores with a single Cortex-X1 core, the cluster

can now provide a peak single-thread performance of up to 30% versus the Cortex-A77

at the cost of 15% additional silicon area (or neural area-wise from N7 to N5).

References[edit]

↑ Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence.
↑ Schor, David (2020-05-26). Arm Cortex-X1: The First From The Cortex-X Custom Program.
↑ (2023-05-29) Arm Cortex-X4, A720, and A520: 2024 smartphone CPUs deep dive.

[1] Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence.

[2] Schor, David (2020-05-26). Arm Cortex-X1: The First From The Cortex-X Custom Program.

[3] (2023-05-29) Arm Cortex-X4, A720, and A520: 2024 smartphone CPUs deep dive.

[1

[2

[3

codename	Cortex-X1 +
core count	1 +, 2 +, 4 +, 6 + and 8 +
designer	ARM Holdings +
first launched	May 26, 2020 +
full page name	arm holdings/microarchitectures/cortex-x1 +
instance of	microarchitecture +
instruction set architecture	ARMv8.2 +
manufacturer	TSMC +
microarchitecture type	CPU +
name	Cortex-X1 +
pipeline stages	13 +
process	10 nm (0.01 μm, 1.0e-5 mm) +, 7 nm (0.007 μm, 7.0e-6 mm) + and 5 nm (0.005 μm, 5.0e-6 mm) +

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung

Intel

AMD

Ampere

Apple

Cavium

HiSilicon

MediaTek

NXP

Qualcomm

Renesas