From WikiChip
Difference between revisions of "arm holdings/microarchitectures/cortex-x1"
< arm holdings

(fixed)
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{armh title|Cortex-X1|arch}}
+
{{armh title|Cortex-X1 (Hera)|arch}}
 
{{microarchitecture
 
{{microarchitecture
|atype=CPU
+
| atype = CPU
|name=Cortex-X1
+
| name = Cortex-X1 (Hera)
|designer=ARM Holdings
+
| codename = Cortex-X1
|manufacturer=TSMC
+
| core name = '''Cortex-X1'''
|introduction=May 26, 2020
+
| designer = ARM Holdings
|process=10 nm
+
| manufacturer = TSMC
|process 2=7 nm
+
| introduction = May 26, 2020
|process 3=5 nm
+
| process = 10 nm
|cores=1
+
| process 2 = 7 nm
|cores 2=2
+
| process 3 = 5 nm
|cores 3=4
+
| cores = 1
|cores 4=6
+
| cores 2 = 2
|cores 5=8
+
| cores 3 = 4
|type=Superscalar
+
| cores 4 = 6
|type 2=Pipelined
+
| cores 5 = 8
|oooe=Yes
+
| type = Superscalar
|speculative=Yes
+
| type 2 = Pipelined
|renaming=Yes
+
| oooe = Yes
|stages=13
+
| speculative = Yes
|decode=5-way
+
| renaming = Yes
|isa=ARMv8.2
+
| stages = 13
|feature=Hardware virtualization
+
| decode = 5-way
|extension=FPU
+
| isa = ARMv8.2
|extension 2=NEON
+
| feature = Hardware virtualization
|l1i=32-64 KiB
+
| extension = FPU
|l1i per=core
+
| extension 2 = NEON
|l1i desc=4-way set associative
+
| l1i = 64 KiB
|l1d=32-64 KiB
+
| l1i per = core
|l1d per=core
+
| l1i desc = 4-way set associative
|l1d desc=4-way set associative
+
| l1d = 64 KiB
|l2=128-512 KiB
+
| l1d per = core
|l2 per=core
+
| l1d desc = 4-way set associative
|l2 desc=8-way set associative
+
| l2 = 1 MiB
|l3=0-4 MiB
+
| l2 per = core
|l3 per=Cluster
+
| l2 desc = 8-way set associative
|l3 desc=16-way set associative
+
| l3 = 8 MiB
|contemporary=Cortex-A78
+
| l3 per = cluster
|contemporary link=arm holdings/microarchitectures/cortex-a78
+
| l3 desc = 16-way set associative
 +
| successor = '''Cortex-X2''' (Matterhorn-ELP)
 +
| successor link = arm holdings/microarchitectures/cortex-x2
 +
| successor 2 = '''Cortex-X3''' (Makalu-ELP)
 +
| successor 2 link = arm holdings/microarchitectures/cortex-x3
 +
| successor 3 = '''Cortex-X4''' (Hunter-ELP)
 +
| successor 3 link = arm holdings/microarchitectures/hunter-elp
 +
| contemporary = '''Cortex-A78''' (Hercules)
 +
| contemporary link = arm holdings/microarchitectures/cortex-a78
 
}}
 
}}
'''Cortex-X1''' (codename '''Hera''') is a performance-enhanced version of the {{armh|Cortex-A78|l=arch}}, a low-power high-performance [[ARM]] [[microarchitecture]] designed by [[Arm]] for the mobile market. The Cortex-X1 was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable [[IP core]] and is licensed to other semiconductor companies to be implemented in their own chips.
+
'''Cortex-X1''' (codename ''Hera'') is a performance-enhanced version of the {{armh|Cortex-A78|l=arch}} ''(Hercules)'', a low-power high-performance [[ARM]] [[microarchitecture]] designed by [[Arm]] for the mobile market. The Cortex-X1 was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable [[IP core]] and is licensed to other semiconductor companies to be implemented in their own chips.
  
The Cortex-X1, which implements the {{arm|ARMv8.2}} ISA, is a higher performance core that is designed to be combined with the {{\\|Cortex-A78}} in a {{armh|DynamIQ big.LITTLE}} combination in order to provide even higher single-thread performance. This core, along with the {{\\|Cortex-A78}}, are often combined with a number of low(er) power cores (e.g. {{\\|Cortex-A55}}) in order to achieve better energy/performance.
+
The '''Cortex-X1''', which implements the {{arm|ARMv8}}.2 ISA, is a higher performance core that is designed to be combined with the {{\\|Cortex-A78}} in a {{armh|big.LITTLE|DynamIQ big.LITTLE}} combination in order to provide even higher single-thread performance. This core, along with the {{\\|Cortex-A78}}, are often combined with a number of low(er) power cores (e.g. {{\\|Cortex-A55}}) in order to achieve better energy/performance.
 +
 
 +
=== [[Cortex]]-X ===
 +
:;[[ARM]] • [[Cortex]]
 +
{| class="wikitable" style="text-align: center;
 +
|-
 +
! Year !! Cortex-X Core !! Cortex-A Core
 +
|-
 +
| [[2020]] || {{armh|Cortex-X1|l=arch}} (''{{armh|Hera|l=arch}}'') <br>{{armh|Cortex-X1C|l=arch}} (''{{armh|Hera-C|l=arch}}'') || {{armh|Cortex-A78|l=arch}} (''{{armh|Hercules|l=arch}}'') <!--<br>{{armh|Cortex-A78AE|l=arch}} (''{{armh|Hercules-AE|l=arch}}'')--> <br>{{armh|Cortex-A78C|l=arch}} (''{{armh|Hera Prime|l=arch}}'')
 +
|-
 +
| [[2021]] || {{armh|Cortex-X2|l=arch}} <br>(''{{armh|Matterhorn-ELP|l=arch}}'') || {{armh|Cortex-A710|l=arch}} (''{{armh|Matterhorn|l=arch}}'') <br>{{armh|Cortex-A510|l=arch}} (''{{armh|Klein|l=arch}}'')
 +
|-
 +
| [[2022]] || {{armh|Cortex-X3|l=arch}} (''{{armh|Makalu-ELP|l=arch}}'') || {{armh|Cortex-A715|l=arch}} (''{{armh|Makalu|l=arch}}'')
 +
|-
 +
| [[2023]] || {{armh|Cortex-X4|l=arch}} (''{{armh|Hunter-ELP|l=arch}}'') || {{armh|Cortex-A720|l=arch}} (''{{armh|Hunter|l=arch}}'') <br>{{armh|Cortex-A520|l=arch}} (''{{armh|Hayes|l=arch}}'')
 +
|-
 +
| [[2024]] || <s>{{armh|Cortex-X5|l=arch}} (''{{armh|Chaberton-ELP|l=arch}}'')</s> <br>{{armh|Cortex-X925|l=arch}} (''{{armh|Blackhawk|l=arch}}'') || {{armh|Cortex-A720AE|l=arch}} (''{{armh|Hunter-AE|l=arch}}'') <br>{{armh|Cortex-A725|l=arch}} (''{{armh|Chaberton|l=arch}}'')
 +
|-
 +
| [[2025]] || {{armh|Cortex-X930|l=arch}} (''{{armh|Travis|l=arch}}'') || {{armh|Cortex-A730|l=arch}} (''{{armh|Gelas|l=arch}}'') <br>{{armh|Cortex-A530|l=arch}} (''{{armh|Nevis|l=arch}}'')
 +
|-
 +
|}
  
 
== Process Technology ==
 
== Process Technology ==
Although the Cortex-X1 may be fabricated on various [[process nodes]], it has been primarily designed for the [[10 nm]], [[7 nm]], and [[5 nm]] process nodes with performance, power and area numbers mainly targeting the [[5-nanometer node]].
+
Although the Cortex-X1 may be fabricated on various [[process nodes]], it has been primarily designed for the [[10 nm]], [[7 nm]],  
 +
:and [[5 nm]] process nodes with performance, power and area numbers mainly targeting the [[5-nanometer node]].
  
== Compiler support ==
+
== Architecture ==
{{empty section}}
 
  
== Architecture ==
 
 
=== Key changes from {{\\|Cortex-A78}} ===
 
=== Key changes from {{\\|Cortex-A78}} ===
 
{{see also|arm_holdings/microarchitectures/cortex-a78#Key_changes_from_Cortex-A77|l1=Cortex-A78 § Key changes from Cortex-A77}}
 
{{see also|arm_holdings/microarchitectures/cortex-a78#Key_changes_from_Cortex-A77|l1=Cortex-A78 § Key changes from Cortex-A77}}
The Cortex-X1 is a custom performance-enhanced variant of the {{\\|Cortex-A78|A78}}, therefore it inherits most of the changes that were done to the A78 from the A77.
+
The Cortex-X1 is a custom performance-enhanced variant of the {{\\|Cortex-A78}}, therefore it
 +
:inherits most of the changes that were done to the {{\\|Cortex-A78}} from the {{\\|Cortex-A77}}.
  
 
* Higher performance (See [[#Performance claims|§ Performance claims]])
 
* Higher performance (See [[#Performance claims|§ Performance claims]])
** [[Arm]] self-reported around 30% performance over the A77 (compared to +20% with the A78)
+
** [[Arm]] self-reported around 30% performance over the {{\\|Cortex-A77}} <br>(compared to +20% with the {{\\|Cortex-A78}})
*** 2.0x (machine learning) performance
+
** 2.0x (machine learning) performance
 
* Silicon area
 
* Silicon area
 
** 15% more silicon area (on [[N5]])
 
** 15% more silicon area (on [[N5]])
 
* Front-end
 
* Front-end
 
** 1.25x wider decode (5-way, up from 4-way)
 
** 1.25x wider decode (5-way, up from 4-way)
** 1.33x wider decoded cache bandwidth (8 MOPs/cycle, up from 6 MOPs/cycle)
+
** 1.33x wider decoded cache bandwidth <br>(8 MOPs/cycle, up from 6 MOPs/cycle)
 
* Memory subsystem
 
* Memory subsystem
** Only 64 KiB [[L1I cache]] option (from 32-64 KiB)
+
** Only 64 KiB L1I cache option (from 32-64 KiB)
** Only 64 KiB [[L1D cache]] option (from 32-64 KiB)
+
** Only 64 KiB L1D cache option (from 32-64 KiB)
** Up to 1 MiB [[L2 cache]] option (from 512 KiB)
+
** Up to 1 MiB L2 cache option (from 512 KiB)
** Up to 8 MiB [[L3 cache]] option (from 4 MiB)
+
** Up to 8 MiB L3 cache option (from 4 MiB)
  
{{expand list}}
+
=== Comparison ===
 +
 
 +
:;"Prime" core
 +
{| class="wikitable sortable" cellpadding="3px" style="border: 1px solid black; border-spacing: 0px; width: 100%; text-align:center;
 +
|-
 +
![[Microarchitecture|Architecture]]
 +
!{{armh|Cortex-A78|l=arch}}
 +
!{{armh|Cortex-X1|l=arch}}
 +
!{{armh|Cortex-X2|l=arch}}
 +
!{{armh|Cortex-X3|l=arch}}
 +
!{{armh|Cortex-X4|l=arch}}
 +
!{{armh|Cortex-X925|l=arch}}
 +
!{{armh|Cortex-X930|l=arch}}
 +
|-
 +
!Code name
 +
|''{{armh|Hercules|l=arch}}''
 +
|''Hera''
 +
|''{{armh|Matterhorn|l=arch}}-ELP''
 +
|''{{armh|Makalu|l=arch}}-ELP''
 +
|''{{armh|Hunter-ELP|l=arch}}''
 +
|''Blackhawk''
 +
|''Travis''
 +
|-
 +
!ISA
 +
| colspan="2" |[[ARMv8]].2-A
 +
| colspan="2" |ARMv9.0-A
 +
| colspan="3" |ARMv9.2-A
 +
|-
 +
!Peak clock speed
 +
| colspan="3" |~3.0&nbsp;GHz
 +
|~3.3&nbsp;GHz
 +
|~3.4&nbsp;GHz
 +
|~3.8&nbsp;GHz
 +
|~4.2&nbsp;GHz
 +
|-
 +
!Max in-flight
 +
|2x 160
 +
|2x 224
 +
|2x 288
 +
|2x 320
 +
|2x 384
 +
|2x 768
 +
|
 +
|-
 +
!L0 (Mops entries)
 +
|1536 <ref>{{cite book |title=Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence |url=https://www.anandtech.com/show/15813/arm-cortex-a78-cortex-x1-cpu-ip-diverging }}</ref>
 +
| colspan="2" |3072
 +
|1536
 +
|0
 +
|
 +
|
 +
|-
 +
!L1-I + L1-D
 +
|32+32 KiB
 +
| colspan="2" |64+64 KiB
 +
| colspan="2" |64+64 KiB
 +
|64+64 KiB
 +
|
 +
|-
 +
!L2
 +
|128–512 KiB
 +
| colspan="3" |0.25–1 MiB
 +
|0.5–2 MiB
 +
|2–3 MiB
 +
|
 +
|-
 +
!L3
 +
| colspan="2" |0–8 MiB <ref>{{cite book |last=Schor |first=David |date=2020-05-26 |title=Arm Cortex-X1: The First From The Cortex-X Custom Program |url=https://fuse.wikichip.org/news/3543/arm-cortex-x1-the-first-from-the-cortex-x-custom-program/ |website=WikiChip Fuse }}</ref>
 +
| colspan="2" |0–16 MiB
 +
| colspan="2" |0–32 MiB
 +
|
 +
|-
 +
!Decode width
 +
|4
 +
| colspan="2" |5
 +
|6
 +
|10 <ref>{{cite book |date=2023-05-29 |title=Arm Cortex-X4, A720, and A520: 2024 smartphone CPUs deep dive |url=https://www.androidauthority.com/arm-cortex-x4-explained-3328008/ |website=Android Authority}}</ref>
 +
|10
 +
|
 +
|-
 +
!Dispatch
 +
|6/cycle
 +
| colspan="3" |8/cycle
 +
| colspan="2" |10/cycle
 +
|
 +
|-
 +
|}
  
 
== Performance claims ==
 
== Performance claims ==
Compared to the {{\\|Cortex-A77}}, the X1 is said to be 30% faster in peak performance on [[SPEC CPU2006]]. The improvement comes from both architectural improvements and frequency improvement with the help of process improvement moving from the [[N7|7 nm]] to the [[N5|5 nm node]].
+
*Compared to the {{\\|Cortex-A77}}, the Cortex-X1 is said to be 30% faster in peak performance on [[SPEC CPU2006]].  
 +
:The improvement comes from both architectural improvements and frequency improvement with the help  
 +
:of process improvement moving from the [[N7|7 nm]] to the [[N5|5 nm node]].
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 82: Line 198:
 
| 1.0x || 1.3x
 
| 1.0x || 1.3x
 
|-
 
|-
| 2,600 MHz || 3,000 MHz
+
| 2.6 GHz || 3.0 GHz
 
|-
 
|-
 
| [[N7|7 nm (N7)]] || [[N5|5 nm(N5)]]
 
| [[N7|7 nm (N7)]] || [[N5|5 nm(N5)]]
 
|-
 
|-
 
| colspan="2" |
 
| colspan="2" |
* Cortex-X1 1 MiB L2, 8 MiB L3 cache
+
* '''Cortex-X1''' 1 MiB L2, 8 MiB L3 cache
 
* {{\\|Cortex-A77}} 512 KiB L2 , 4 MiB L3 cache
 
* {{\\|Cortex-A77}} 512 KiB L2 , 4 MiB L3 cache
 
|}
 
|}
  
Arm says that, at ISO-process and frequency, the Cortex-X1 achieves 22% higher integer performance ([[SPEC CPU2006]]) over the {{\\|Cortex-A78}} and 30% higher integer performance over the {{\\|Cortex-A77}}. Likewise, due to the doubling of the number of {{arm|NEON}} units, the Cortex-X1 can achieve twice the ML performance as both the {{\\|Cortex-A77|A77}} and {{\\|Cortex-A78|A78}}.
+
*Arm says that, at ISO-process and frequency, the Cortex-X1 achieves 22% higher integer performance ([[SPEC CPU2006]])  
 +
:over the {{\\|Cortex-A78}} and 30% higher integer performance over the {{\\|Cortex-A77}}. Likewise, due to the doubling  
 +
:of the number of ''NEON'' units, the Cortex-X1 can achieve twice the ML performance as both the {{\\|Cortex-A77|A77}} and {{\\|Cortex-A78|A78}}.
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 103: Line 221:
 
| 1.0x || 2.0x (ML performance)
 
| 1.0x || 2.0x (ML performance)
 
|-
 
|-
| 3,000 MHz || 3,000 MHz
+
| 3.0 GHz || 3.0 GHz
 
|-
 
|-
 
| [[N7|7 nm (N7)]] || [[N5|5 nm(N5)]]
 
| [[N7|7 nm (N7)]] || [[N5|5 nm(N5)]]
 
|-
 
|-
 
| colspan="2" |
 
| colspan="2" |
* Cortex-X1 1 MiB L2, 8 MiB L3 cache
+
* '''Cortex-X1''' 1 MiB L2, 8 MiB L3 cache
 
* {{\\|Cortex-A77}} 512 KiB L2 , 4 MiB L3 cache
 
* {{\\|Cortex-A77}} 512 KiB L2 , 4 MiB L3 cache
 
|}
 
|}
  
 
== Overview ==
 
== Overview ==
The Cortex-X1 is a high-performance [[synthesizable core]] designed by [[Arm]]. It is delivered as Register Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs. This core supports the {{arm|ARMv8.2}} extension as well as a number of other partial extensions. This is the first from [[Arm]]'s [[Cortex-X]] custom program. The X1 is a performance-enhanced version of the {{\\|Cortex-A78}}, it therefore uses the {{\\|Cortex-A78|A78}} as the starting point for its modifications.
+
*The Cortex-X1 is a high-performance synthesizable core designed by [[Arm]]. It is delivered as Register  
 +
:Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs.  
  
The Cortex-X1 is built on top of the {{\\|Cortex-A78}}, but enhances it in order to extract additional performance, albeit at a slight reduction in power efficiency and area. To that end, whereas the {{\\|Hercules}} was said to provide a 20% sustain performance uplift over the {{\\|Cortex-A77|A77}}, the Cortex-X1 offers up to 30% peak performance. In other words, whereas the {{\\|Cortex-A78|A78}} is designed for high sustained performance at high performance-efficiency, the Cortex-X1 is designed to supplement it with higher peak performance while relaxing the power and area constraints.
+
*This core supports the {{arm|ARMv8}}.2 extension as well as a number of other partial extensions.  
 +
:This is the first from [[Arm]]'s [[Cortex]]-X custom program. The X1 is a performance-enhanced
 +
:version of the {{\\|Cortex-A78|A78}}, it therefore uses the {{\\|Cortex-A78|A78}} as the starting point for its modifications.
  
The Cortex-X1 is a fatter version of the {{\\|Cortex-A78|A78}}, relying on bigger buffers and a large out-of-order window in order to extract further performance. To that end, the X1 features a 5-way decode, twice as many NEON units, and larger overall buffers in order to allow for a bigger out-of-order window with more in-flight operations. The X1 enlarges the pipeline while still retaining the higher frequency which was introduced in the {{\\|Cortex-A77|A77}}. The X1 is intended to be combined with a number of {{\\|Cortex-A78|A78}} cores in [[DynamIQ Shared Unit]] (DSU) cluster along with possibly with other lower-power cores such as the {{\\|Cortex-A55}} to more efficiently support a wide range of workloads at various performance and power levels beyond what's possible with any one core.
+
*The Cortex-X1 is built on top of the {{\\|Cortex-A78}}, but enhances it in order to extract additional performance,
 +
:albeit at a slight reduction in power efficiency and area. To that end, whereas the {{\\|Hercules}} was said to provide
 +
:a 20% sustain performance uplift over the {{\\|Cortex-A77}}, the Cortex-X1 offers up to 30% peak performance.
 +
*In other words, whereas the {{\\|Cortex-A78}} is designed for high sustained performance at high performance-efficiency,
 +
:the Cortex-X1 is designed to supplement it with higher peak performance while relaxing the power and area constraints.
 +
*The Cortex-X1 is a fatter version of the {{\\|Cortex-A78}}, relying on bigger buffers and a large out-of-order window  
 +
:in order to extract further performance. To that end, the X1 features a 5-way decode, twice as many NEON units,  
 +
:and larger overall buffers in order to allow for a bigger out-of-order window with more in-flight operations.  
 +
*The Cortex-X1 enlarges the pipeline while still retaining the higher frequency which was introduced in the {{\\|Cortex-A77}}.  
 +
*The Cortex-X1 is intended to be combined with a number of {{\\|Cortex-A78}} cores in ''DynamIQ Shared Unit'' (DSU)  
 +
:cluster along with possibly with other lower-power cores such as the {{\\|Cortex-A55}} to more efficiently support  
 +
:a wide range of workloads at various performance and power levels beyond what's possible with any one core.
  
 
=== DSU Cluster ===
 
=== DSU Cluster ===
The Cortex-X1 provides additional peak performance beyond what the {{\\|Cortex-A78}} can offer. Therefore the X1 is designed to be combined with a number of Cortex-A78 cores in [[DynamIQ Shared Unit]] (DSU) cluster in order to provide a balance in both power and performance. Compared to a quad-core {{\\|Cortex-A77|A77}} cluster on [[N7|7 nm]], a quad-core {{\\|Cortex-A78|A78}} cluster provides +20% sustained performance improvement while reducing the silicon area by about 15%. When replacing one of those [[big core|big]] {{\\|Cortex-A78|A78}} cores with a single Cortex-X1 core, the cluster can now provide a peak single-thread performance of up to 30% versus the {{\\|Cortex-A77|A77}} at the cost of 15% additional silicon area (or neural area-wise from [[N7]] to [[N5]]).
+
*The Cortex-X1 provides additional peak performance beyond what the {{\\|Cortex-A78}} can offer.  
 +
:Therefore the X1 is designed to be combined with a number of Cortex-A78 cores in ''DynamIQ  
 +
:Shared Unit'' (DSU) cluster in order to provide a balance in both power and performance.  
 +
 
 +
*Compared to a quad-core {{\\|Cortex-A77}} cluster on [[N7|7 nm]], a quad-core {{\\|Cortex-A78}} cluster provides  
 +
:+20% sustained performance improvement while reducing the silicon area by about 15%.  
 +
 
 +
*When replacing one of those [[big core|big]] {{\\|Cortex-A78}} cores with a single Cortex-X1 core, the cluster  
 +
:can now provide a peak single-thread performance of up to 30% versus the {{\\|Cortex-A77}}  
 +
:at the cost of 15% additional silicon area (or neural area-wise from [[N7]] to [[N5]]).
 +
 
 +
== References ==

Latest revision as of 20:43, 15 April 2025

Edit Values
Cortex-X1 (Hera) µarch
General Info
Arch TypeCPU
DesignerARM Holdings
ManufacturerTSMC
IntroductionMay 26, 2020
Process10 nm, 7 nm, 5 nm
Core Configs1, 2, 4, 6, 8
Pipeline
TypeSuperscalar, Pipelined
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages13
Decode5-way
Instructions
ISAARMv8.2
ExtensionsFPU, NEON
Cache
L1I Cache64 KiB/core
4-way set associative
L1D Cache64 KiB/core
4-way set associative
L2 Cache1 MiB/core
8-way set associative
L3 Cache8 MiB/cluster
16-way set associative
Cores
Core NamesCortex-X1
Succession
Contemporary
Cortex-A78 (Hercules)

Cortex-X1 (codename Hera) is a performance-enhanced version of the Cortex-A78 (Hercules), a low-power high-performance ARM microarchitecture designed by Arm for the mobile market. The Cortex-X1 was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable IP core and is licensed to other semiconductor companies to be implemented in their own chips.

The Cortex-X1, which implements the ARMv8.2 ISA, is a higher performance core that is designed to be combined with the Cortex-A78 in a DynamIQ big.LITTLE combination in order to provide even higher single-thread performance. This core, along with the Cortex-A78, are often combined with a number of low(er) power cores (e.g. Cortex-A55) in order to achieve better energy/performance.

Cortex-X[edit]

ARMCortex
Year Cortex-X Core Cortex-A Core
2020 Cortex-X1 (Hera)
Cortex-X1C (Hera-C)
Cortex-A78 (Hercules)
Cortex-A78C (Hera Prime)
2021 Cortex-X2
(Matterhorn-ELP)
Cortex-A710 (Matterhorn)
Cortex-A510 (Klein)
2022 Cortex-X3 (Makalu-ELP) Cortex-A715 (Makalu)
2023 Cortex-X4 (Hunter-ELP) Cortex-A720 (Hunter)
Cortex-A520 (Hayes)
2024 Cortex-X5 (Chaberton-ELP)
Cortex-X925 (Blackhawk)
Cortex-A720AE (Hunter-AE)
Cortex-A725 (Chaberton)
2025 Cortex-X930 (Travis) Cortex-A730 (Gelas)
Cortex-A530 (Nevis)

Process Technology[edit]

Although the Cortex-X1 may be fabricated on various process nodes, it has been primarily designed for the 10 nm, 7 nm,

and 5 nm process nodes with performance, power and area numbers mainly targeting the 5-nanometer node.

Architecture[edit]

Key changes from Cortex-A78[edit]

See also: Cortex-A78 § Key changes from Cortex-A77

The Cortex-X1 is a custom performance-enhanced variant of the Cortex-A78, therefore it

inherits most of the changes that were done to the Cortex-A78 from the Cortex-A77.
  • Higher performance (See § Performance claims)
    • Arm self-reported around 30% performance over the Cortex-A77
      (compared to +20% with the Cortex-A78)
    • 2.0x (machine learning) performance
  • Silicon area
    • 15% more silicon area (on N5)
  • Front-end
    • 1.25x wider decode (5-way, up from 4-way)
    • 1.33x wider decoded cache bandwidth
      (8 MOPs/cycle, up from 6 MOPs/cycle)
  • Memory subsystem
    • Only 64 KiB L1I cache option (from 32-64 KiB)
    • Only 64 KiB L1D cache option (from 32-64 KiB)
    • Up to 1 MiB L2 cache option (from 512 KiB)
    • Up to 8 MiB L3 cache option (from 4 MiB)

Comparison[edit]

"Prime" core
Architecture Cortex-A78 Cortex-X1 Cortex-X2 Cortex-X3 Cortex-X4 Cortex-X925 Cortex-X930
Code name Hercules Hera Matterhorn-ELP Makalu-ELP Hunter-ELP Blackhawk Travis
ISA ARMv8.2-A ARMv9.0-A ARMv9.2-A
Peak clock speed ~3.0 GHz ~3.3 GHz ~3.4 GHz ~3.8 GHz ~4.2 GHz
Max in-flight 2x 160 2x 224 2x 288 2x 320 2x 384 2x 768
L0 (Mops entries) 1536 [1] 3072 1536 0
L1-I + L1-D 32+32 KiB 64+64 KiB 64+64 KiB 64+64 KiB
L2 128–512 KiB 0.25–1 MiB 0.5–2 MiB 2–3 MiB
L3 0–8 MiB [2] 0–16 MiB 0–32 MiB
Decode width 4 5 6 10 [3] 10
Dispatch 6/cycle 8/cycle 10/cycle

Performance claims[edit]

The improvement comes from both architectural improvements and frequency improvement with the help
of process improvement moving from the 7 nm to the 5 nm node.
Performance
Cortex-A77 Cortex-X1
1.0x 1.3x
2.6 GHz 3.0 GHz
7 nm (N7) 5 nm(N5)
  • Cortex-X1 1 MiB L2, 8 MiB L3 cache
  • Cortex-A77 512 KiB L2 , 4 MiB L3 cache
  • Arm says that, at ISO-process and frequency, the Cortex-X1 achieves 22% higher integer performance (SPEC CPU2006)
over the Cortex-A78 and 30% higher integer performance over the Cortex-A77. Likewise, due to the doubling
of the number of NEON units, the Cortex-X1 can achieve twice the ML performance as both the A77 and A78.
Performance @ ISO-process/frequency
Cortex-A77 Cortex-X1
1.0x 1.3x (integer performance)
1.0x 2.0x (ML performance)
3.0 GHz 3.0 GHz
7 nm (N7) 5 nm(N5)
  • Cortex-X1 1 MiB L2, 8 MiB L3 cache
  • Cortex-A77 512 KiB L2 , 4 MiB L3 cache

Overview[edit]

  • The Cortex-X1 is a high-performance synthesizable core designed by Arm. It is delivered as Register
Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs.
  • This core supports the ARMv8.2 extension as well as a number of other partial extensions.
This is the first from Arm's Cortex-X custom program. The X1 is a performance-enhanced
version of the A78, it therefore uses the A78 as the starting point for its modifications.
  • The Cortex-X1 is built on top of the Cortex-A78, but enhances it in order to extract additional performance,
albeit at a slight reduction in power efficiency and area. To that end, whereas the Hercules was said to provide
a 20% sustain performance uplift over the Cortex-A77, the Cortex-X1 offers up to 30% peak performance.
  • In other words, whereas the Cortex-A78 is designed for high sustained performance at high performance-efficiency,
the Cortex-X1 is designed to supplement it with higher peak performance while relaxing the power and area constraints.
  • The Cortex-X1 is a fatter version of the Cortex-A78, relying on bigger buffers and a large out-of-order window
in order to extract further performance. To that end, the X1 features a 5-way decode, twice as many NEON units,
and larger overall buffers in order to allow for a bigger out-of-order window with more in-flight operations.
  • The Cortex-X1 enlarges the pipeline while still retaining the higher frequency which was introduced in the Cortex-A77.
  • The Cortex-X1 is intended to be combined with a number of Cortex-A78 cores in DynamIQ Shared Unit (DSU)
cluster along with possibly with other lower-power cores such as the Cortex-A55 to more efficiently support
a wide range of workloads at various performance and power levels beyond what's possible with any one core.

DSU Cluster[edit]

  • The Cortex-X1 provides additional peak performance beyond what the Cortex-A78 can offer.
Therefore the X1 is designed to be combined with a number of Cortex-A78 cores in DynamIQ
Shared Unit (DSU) cluster in order to provide a balance in both power and performance.
+20% sustained performance improvement while reducing the silicon area by about 15%.
  • When replacing one of those big Cortex-A78 cores with a single Cortex-X1 core, the cluster
can now provide a peak single-thread performance of up to 30% versus the Cortex-A77
at the cost of 15% additional silicon area (or neural area-wise from N7 to N5).

References[edit]

  1. Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence.
  2. Schor, David (2020-05-26). Arm Cortex-X1: The First From The Cortex-X Custom Program.
  3. (2023-05-29) Arm Cortex-X4, A720, and A520: 2024 smartphone CPUs deep dive.
codenameCortex-X1 (Hera) +
core count1 +, 2 +, 4 +, 6 + and 8 +
designerARM Holdings +
first launchedMay 26, 2020 +
full page namearm holdings/microarchitectures/cortex-x1 +
instance ofmicroarchitecture +
instruction set architectureARMv8.2 +
manufacturerTSMC +
microarchitecture typeCPU +
nameCortex-X1 (Hera) +
pipeline stages13 +
process10 nm (0.01 μm, 1.0e-5 mm) +, 7 nm (0.007 μm, 7.0e-6 mm) + and 5 nm (0.005 μm, 5.0e-6 mm) +