(Cortex-X1) |
(fixed) |
||
(5 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | {{armh title|Cortex-X1|arch}} | + | {{armh title|Cortex-X1 (Hera)|arch}} |
{{microarchitecture | {{microarchitecture | ||
− | |atype=CPU | + | | atype = CPU |
− | |name=Cortex-X1 | + | | name = Cortex-X1 (Hera) |
− | |designer=ARM Holdings | + | | codename = Cortex-X1 |
− | |manufacturer=TSMC | + | | core name = '''Cortex-X1''' |
− | |introduction=May 26, 2020 | + | | designer = ARM Holdings |
− | |process=10 nm | + | | manufacturer = TSMC |
− | |process 2=7 nm | + | | introduction = May 26, 2020 |
− | |process 3=5 nm | + | | process = 10 nm |
− | |cores=1 | + | | process 2 = 7 nm |
− | |cores 2=2 | + | | process 3 = 5 nm |
− | |cores 3=4 | + | | cores = 1 |
− | |cores 4=6 | + | | cores 2 = 2 |
− | |cores 5=8 | + | | cores 3 = 4 |
− | |type=Superscalar | + | | cores 4 = 6 |
− | |type 2=Pipelined | + | | cores 5 = 8 |
− | |oooe=Yes | + | | type = Superscalar |
− | |speculative=Yes | + | | type 2 = Pipelined |
− | |renaming=Yes | + | | oooe = Yes |
− | |stages=13 | + | | speculative = Yes |
− | |decode=5-way | + | | renaming = Yes |
− | |isa=ARMv8.2 | + | | stages = 13 |
− | |feature=Hardware virtualization | + | | decode = 5-way |
− | |extension=FPU | + | | isa = ARMv8.2 |
− | |extension 2=NEON | + | | feature = Hardware virtualization |
− | |l1i= | + | | extension = FPU |
− | |l1i per=core | + | | extension 2 = NEON |
− | |l1i desc=4-way set associative | + | | l1i = 64 KiB |
− | |l1d= | + | | l1i per = core |
− | |l1d per=core | + | | l1i desc = 4-way set associative |
− | |l1d desc=4-way set associative | + | | l1d = 64 KiB |
− | |l2= | + | | l1d per = core |
− | |l2 per=core | + | | l1d desc = 4-way set associative |
− | |l2 desc=8-way set associative | + | | l2 = 1 MiB |
− | |l3= | + | | l2 per = core |
− | |l3 per= | + | | l2 desc = 8-way set associative |
− | |l3 desc=16-way set associative | + | | l3 = 8 MiB |
− | |contemporary=Cortex-A78 | + | | l3 per = cluster |
− | |contemporary link=arm holdings/microarchitectures/cortex-a78 | + | | l3 desc = 16-way set associative |
+ | | successor = '''Cortex-X2''' (Matterhorn-ELP) | ||
+ | | successor link = arm holdings/microarchitectures/cortex-x2 | ||
+ | | successor 2 = '''Cortex-X3''' (Makalu-ELP) | ||
+ | | successor 2 link = arm holdings/microarchitectures/cortex-x3 | ||
+ | | successor 3 = '''Cortex-X4''' (Hunter-ELP) | ||
+ | | successor 3 link = arm holdings/microarchitectures/hunter-elp | ||
+ | | contemporary = '''Cortex-A78''' (Hercules) | ||
+ | | contemporary link = arm holdings/microarchitectures/cortex-a78 | ||
}} | }} | ||
− | '''Cortex-X1''' (codename | + | '''Cortex-X1''' (codename ''Hera'') is a performance-enhanced version of the {{armh|Cortex-A78|l=arch}} ''(Hercules)'', a low-power high-performance [[ARM]] [[microarchitecture]] designed by [[Arm]] for the mobile market. The Cortex-X1 was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable [[IP core]] and is licensed to other semiconductor companies to be implemented in their own chips. |
− | The Cortex-X1, which implements the {{arm|ARMv8.2 | + | The '''Cortex-X1''', which implements the {{arm|ARMv8}}.2 ISA, is a higher performance core that is designed to be combined with the {{\\|Cortex-A78}} in a {{armh|big.LITTLE|DynamIQ big.LITTLE}} combination in order to provide even higher single-thread performance. This core, along with the {{\\|Cortex-A78}}, are often combined with a number of low(er) power cores (e.g. {{\\|Cortex-A55}}) in order to achieve better energy/performance. |
+ | |||
+ | === [[Cortex]]-X === | ||
+ | :;[[ARM]] • [[Cortex]] | ||
+ | {| class="wikitable" style="text-align: center; | ||
+ | |- | ||
+ | ! Year !! Cortex-X Core !! Cortex-A Core | ||
+ | |- | ||
+ | | [[2020]] || {{armh|Cortex-X1|l=arch}} (''{{armh|Hera|l=arch}}'') <br>{{armh|Cortex-X1C|l=arch}} (''{{armh|Hera-C|l=arch}}'') || {{armh|Cortex-A78|l=arch}} (''{{armh|Hercules|l=arch}}'') <!--<br>{{armh|Cortex-A78AE|l=arch}} (''{{armh|Hercules-AE|l=arch}}'')--> <br>{{armh|Cortex-A78C|l=arch}} (''{{armh|Hera Prime|l=arch}}'') | ||
+ | |- | ||
+ | | [[2021]] || {{armh|Cortex-X2|l=arch}} <br>(''{{armh|Matterhorn-ELP|l=arch}}'') || {{armh|Cortex-A710|l=arch}} (''{{armh|Matterhorn|l=arch}}'') <br>{{armh|Cortex-A510|l=arch}} (''{{armh|Klein|l=arch}}'') | ||
+ | |- | ||
+ | | [[2022]] || {{armh|Cortex-X3|l=arch}} (''{{armh|Makalu-ELP|l=arch}}'') || {{armh|Cortex-A715|l=arch}} (''{{armh|Makalu|l=arch}}'') | ||
+ | |- | ||
+ | | [[2023]] || {{armh|Cortex-X4|l=arch}} (''{{armh|Hunter-ELP|l=arch}}'') || {{armh|Cortex-A720|l=arch}} (''{{armh|Hunter|l=arch}}'') <br>{{armh|Cortex-A520|l=arch}} (''{{armh|Hayes|l=arch}}'') | ||
+ | |- | ||
+ | | [[2024]] || <s>{{armh|Cortex-X5|l=arch}} (''{{armh|Chaberton-ELP|l=arch}}'')</s> <br>{{armh|Cortex-X925|l=arch}} (''{{armh|Blackhawk|l=arch}}'') || {{armh|Cortex-A720AE|l=arch}} (''{{armh|Hunter-AE|l=arch}}'') <br>{{armh|Cortex-A725|l=arch}} (''{{armh|Chaberton|l=arch}}'') | ||
+ | |- | ||
+ | | [[2025]] || {{armh|Cortex-X930|l=arch}} (''{{armh|Travis|l=arch}}'') || {{armh|Cortex-A730|l=arch}} (''{{armh|Gelas|l=arch}}'') <br>{{armh|Cortex-A530|l=arch}} (''{{armh|Nevis|l=arch}}'') | ||
+ | |- | ||
+ | |} | ||
== Process Technology == | == Process Technology == | ||
− | Although the Cortex-X1 may be fabricated on various [[process nodes]], it has been primarily designed for the [[10 nm]], [[7 nm]], and [[5 nm]] process nodes with performance, power and area numbers mainly targeting the [[5-nanometer node]]. | + | Although the Cortex-X1 may be fabricated on various [[process nodes]], it has been primarily designed for the [[10 nm]], [[7 nm]], |
+ | :and [[5 nm]] process nodes with performance, power and area numbers mainly targeting the [[5-nanometer node]]. | ||
− | == | + | == Architecture == |
− | |||
− | |||
=== Key changes from {{\\|Cortex-A78}} === | === Key changes from {{\\|Cortex-A78}} === | ||
{{see also|arm_holdings/microarchitectures/cortex-a78#Key_changes_from_Cortex-A77|l1=Cortex-A78 § Key changes from Cortex-A77}} | {{see also|arm_holdings/microarchitectures/cortex-a78#Key_changes_from_Cortex-A77|l1=Cortex-A78 § Key changes from Cortex-A77}} | ||
− | The Cortex-X1 is a custom performance-enhanced variant of the {{\\|Cortex- | + | The Cortex-X1 is a custom performance-enhanced variant of the {{\\|Cortex-A78}}, therefore it |
+ | :inherits most of the changes that were done to the {{\\|Cortex-A78}} from the {{\\|Cortex-A77}}. | ||
* Higher performance (See [[#Performance claims|§ Performance claims]]) | * Higher performance (See [[#Performance claims|§ Performance claims]]) | ||
− | ** [[Arm]] self-reported around 30% performance over the A77 (compared to +20% with the A78) | + | ** [[Arm]] self-reported around 30% performance over the {{\\|Cortex-A77}} <br>(compared to +20% with the {{\\|Cortex-A78}}) |
− | + | ** 2.0x (machine learning) performance | |
* Silicon area | * Silicon area | ||
** 15% more silicon area (on [[N5]]) | ** 15% more silicon area (on [[N5]]) | ||
* Front-end | * Front-end | ||
** 1.25x wider decode (5-way, up from 4-way) | ** 1.25x wider decode (5-way, up from 4-way) | ||
− | ** 1.33x wider decoded cache bandwidth (8 MOPs/cycle, up from 6 MOPs/cycle) | + | ** 1.33x wider decoded cache bandwidth <br>(8 MOPs/cycle, up from 6 MOPs/cycle) |
* Memory subsystem | * Memory subsystem | ||
− | ** Only 64 KiB | + | ** Only 64 KiB L1I cache option (from 32-64 KiB) |
− | ** Only 64 KiB | + | ** Only 64 KiB L1D cache option (from 32-64 KiB) |
− | ** Up to 1 MiB | + | ** Up to 1 MiB L2 cache option (from 512 KiB) |
− | ** Up to 8 MiB | + | ** Up to 8 MiB L3 cache option (from 4 MiB) |
− | {{ | + | === Comparison === |
+ | |||
+ | :;"Prime" core | ||
+ | {| class="wikitable sortable" cellpadding="3px" style="border: 1px solid black; border-spacing: 0px; width: 100%; text-align:center; | ||
+ | |- | ||
+ | ![[Microarchitecture|Architecture]] | ||
+ | !{{armh|Cortex-A78|l=arch}} | ||
+ | !{{armh|Cortex-X1|l=arch}} | ||
+ | !{{armh|Cortex-X2|l=arch}} | ||
+ | !{{armh|Cortex-X3|l=arch}} | ||
+ | !{{armh|Cortex-X4|l=arch}} | ||
+ | !{{armh|Cortex-X925|l=arch}} | ||
+ | !{{armh|Cortex-X930|l=arch}} | ||
+ | |- | ||
+ | !Code name | ||
+ | |''{{armh|Hercules|l=arch}}'' | ||
+ | |''Hera'' | ||
+ | |''{{armh|Matterhorn|l=arch}}-ELP'' | ||
+ | |''{{armh|Makalu|l=arch}}-ELP'' | ||
+ | |''{{armh|Hunter-ELP|l=arch}}'' | ||
+ | |''Blackhawk'' | ||
+ | |''Travis'' | ||
+ | |- | ||
+ | !ISA | ||
+ | | colspan="2" |[[ARMv8]].2-A | ||
+ | | colspan="2" |ARMv9.0-A | ||
+ | | colspan="3" |ARMv9.2-A | ||
+ | |- | ||
+ | !Peak clock speed | ||
+ | | colspan="3" |~3.0 GHz | ||
+ | |~3.3 GHz | ||
+ | |~3.4 GHz | ||
+ | |~3.8 GHz | ||
+ | |~4.2 GHz | ||
+ | |- | ||
+ | !Max in-flight | ||
+ | |2x 160 | ||
+ | |2x 224 | ||
+ | |2x 288 | ||
+ | |2x 320 | ||
+ | |2x 384 | ||
+ | |2x 768 | ||
+ | | | ||
+ | |- | ||
+ | !L0 (Mops entries) | ||
+ | |1536 <ref>{{cite book |title=Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence |url=https://www.anandtech.com/show/15813/arm-cortex-a78-cortex-x1-cpu-ip-diverging }}</ref> | ||
+ | | colspan="2" |3072 | ||
+ | |1536 | ||
+ | |0 | ||
+ | | | ||
+ | | | ||
+ | |- | ||
+ | !L1-I + L1-D | ||
+ | |32+32 KiB | ||
+ | | colspan="2" |64+64 KiB | ||
+ | | colspan="2" |64+64 KiB | ||
+ | |64+64 KiB | ||
+ | | | ||
+ | |- | ||
+ | !L2 | ||
+ | |128–512 KiB | ||
+ | | colspan="3" |0.25–1 MiB | ||
+ | |0.5–2 MiB | ||
+ | |2–3 MiB | ||
+ | | | ||
+ | |- | ||
+ | !L3 | ||
+ | | colspan="2" |0–8 MiB <ref>{{cite book |last=Schor |first=David |date=2020-05-26 |title=Arm Cortex-X1: The First From The Cortex-X Custom Program |url=https://fuse.wikichip.org/news/3543/arm-cortex-x1-the-first-from-the-cortex-x-custom-program/ |website=WikiChip Fuse }}</ref> | ||
+ | | colspan="2" |0–16 MiB | ||
+ | | colspan="2" |0–32 MiB | ||
+ | | | ||
+ | |- | ||
+ | !Decode width | ||
+ | |4 | ||
+ | | colspan="2" |5 | ||
+ | |6 | ||
+ | |10 <ref>{{cite book |date=2023-05-29 |title=Arm Cortex-X4, A720, and A520: 2024 smartphone CPUs deep dive |url=https://www.androidauthority.com/arm-cortex-x4-explained-3328008/ |website=Android Authority}}</ref> | ||
+ | |10 | ||
+ | | | ||
+ | |- | ||
+ | !Dispatch | ||
+ | |6/cycle | ||
+ | | colspan="3" |8/cycle | ||
+ | | colspan="2" |10/cycle | ||
+ | | | ||
+ | |- | ||
+ | |} | ||
== Performance claims == | == Performance claims == | ||
− | Compared to the {{\\|Cortex-A77}}, the X1 is said to be 30% faster in peak performance on [[SPEC CPU2006]]. The improvement comes from both architectural improvements and frequency improvement with the help of process improvement moving from the [[N7|7 nm]] to the [[N5|5 nm node]]. | + | *Compared to the {{\\|Cortex-A77}}, the Cortex-X1 is said to be 30% faster in peak performance on [[SPEC CPU2006]]. |
+ | :The improvement comes from both architectural improvements and frequency improvement with the help | ||
+ | :of process improvement moving from the [[N7|7 nm]] to the [[N5|5 nm node]]. | ||
{| class="wikitable" | {| class="wikitable" | ||
Line 82: | Line 198: | ||
| 1.0x || 1.3x | | 1.0x || 1.3x | ||
|- | |- | ||
− | | 2 | + | | 2.6 GHz || 3.0 GHz |
|- | |- | ||
| [[N7|7 nm (N7)]] || [[N5|5 nm(N5)]] | | [[N7|7 nm (N7)]] || [[N5|5 nm(N5)]] | ||
|- | |- | ||
| colspan="2" | | | colspan="2" | | ||
− | * Cortex-X1 1 MiB L2, 8 MiB L3 cache | + | * '''Cortex-X1''' 1 MiB L2, 8 MiB L3 cache |
* {{\\|Cortex-A77}} 512 KiB L2 , 4 MiB L3 cache | * {{\\|Cortex-A77}} 512 KiB L2 , 4 MiB L3 cache | ||
|} | |} | ||
− | Arm says that, at ISO-process and frequency, the Cortex-X1 achieves 22% higher integer performance ([[SPEC CPU2006]]) over the {{\\|Cortex-A78}} and 30% higher integer performance over the {{\\|Cortex-A77}}. Likewise, due to the doubling of the number of | + | *Arm says that, at ISO-process and frequency, the Cortex-X1 achieves 22% higher integer performance ([[SPEC CPU2006]]) |
+ | :over the {{\\|Cortex-A78}} and 30% higher integer performance over the {{\\|Cortex-A77}}. Likewise, due to the doubling | ||
+ | :of the number of ''NEON'' units, the Cortex-X1 can achieve twice the ML performance as both the {{\\|Cortex-A77|A77}} and {{\\|Cortex-A78|A78}}. | ||
{| class="wikitable" | {| class="wikitable" | ||
Line 103: | Line 221: | ||
| 1.0x || 2.0x (ML performance) | | 1.0x || 2.0x (ML performance) | ||
|- | |- | ||
− | | | + | | 3.0 GHz || 3.0 GHz |
|- | |- | ||
| [[N7|7 nm (N7)]] || [[N5|5 nm(N5)]] | | [[N7|7 nm (N7)]] || [[N5|5 nm(N5)]] | ||
|- | |- | ||
| colspan="2" | | | colspan="2" | | ||
− | * Cortex-X1 1 MiB L2, 8 MiB L3 cache | + | * '''Cortex-X1''' 1 MiB L2, 8 MiB L3 cache |
* {{\\|Cortex-A77}} 512 KiB L2 , 4 MiB L3 cache | * {{\\|Cortex-A77}} 512 KiB L2 , 4 MiB L3 cache | ||
|} | |} | ||
== Overview == | == Overview == | ||
− | The Cortex-X1 is a high-performance | + | *The Cortex-X1 is a high-performance synthesizable core designed by [[Arm]]. It is delivered as Register |
+ | :Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs. | ||
− | + | *This core supports the {{arm|ARMv8}}.2 extension as well as a number of other partial extensions. | |
+ | :This is the first from [[Arm]]'s [[Cortex]]-X custom program. The X1 is a performance-enhanced | ||
+ | :version of the {{\\|Cortex-A78|A78}}, it therefore uses the {{\\|Cortex-A78|A78}} as the starting point for its modifications. | ||
− | The Cortex-X1 is a fatter version of the {{\\|Cortex- | + | *The Cortex-X1 is built on top of the {{\\|Cortex-A78}}, but enhances it in order to extract additional performance, |
+ | :albeit at a slight reduction in power efficiency and area. To that end, whereas the {{\\|Hercules}} was said to provide | ||
+ | :a 20% sustain performance uplift over the {{\\|Cortex-A77}}, the Cortex-X1 offers up to 30% peak performance. | ||
+ | *In other words, whereas the {{\\|Cortex-A78}} is designed for high sustained performance at high performance-efficiency, | ||
+ | :the Cortex-X1 is designed to supplement it with higher peak performance while relaxing the power and area constraints. | ||
+ | *The Cortex-X1 is a fatter version of the {{\\|Cortex-A78}}, relying on bigger buffers and a large out-of-order window | ||
+ | :in order to extract further performance. To that end, the X1 features a 5-way decode, twice as many NEON units, | ||
+ | :and larger overall buffers in order to allow for a bigger out-of-order window with more in-flight operations. | ||
+ | *The Cortex-X1 enlarges the pipeline while still retaining the higher frequency which was introduced in the {{\\|Cortex-A77}}. | ||
+ | *The Cortex-X1 is intended to be combined with a number of {{\\|Cortex-A78}} cores in ''DynamIQ Shared Unit'' (DSU) | ||
+ | :cluster along with possibly with other lower-power cores such as the {{\\|Cortex-A55}} to more efficiently support | ||
+ | :a wide range of workloads at various performance and power levels beyond what's possible with any one core. | ||
=== DSU Cluster === | === DSU Cluster === | ||
− | The Cortex-X1 provides additional peak performance beyond what the {{\\|Cortex-A78}} can offer. Therefore the X1 is designed to be combined with a number of Cortex-A78 cores in | + | *The Cortex-X1 provides additional peak performance beyond what the {{\\|Cortex-A78}} can offer. |
+ | :Therefore the X1 is designed to be combined with a number of Cortex-A78 cores in ''DynamIQ | ||
+ | :Shared Unit'' (DSU) cluster in order to provide a balance in both power and performance. | ||
+ | |||
+ | *Compared to a quad-core {{\\|Cortex-A77}} cluster on [[N7|7 nm]], a quad-core {{\\|Cortex-A78}} cluster provides | ||
+ | :+20% sustained performance improvement while reducing the silicon area by about 15%. | ||
+ | |||
+ | *When replacing one of those [[big core|big]] {{\\|Cortex-A78}} cores with a single Cortex-X1 core, the cluster | ||
+ | :can now provide a peak single-thread performance of up to 30% versus the {{\\|Cortex-A77}} | ||
+ | :at the cost of 15% additional silicon area (or neural area-wise from [[N7]] to [[N5]]). | ||
+ | |||
+ | == References == |
Latest revision as of 20:43, 15 April 2025
Edit Values | |
Cortex-X1 (Hera) µarch | |
General Info | |
Arch Type | CPU |
Designer | ARM Holdings |
Manufacturer | TSMC |
Introduction | May 26, 2020 |
Process | 10 nm, 7 nm, 5 nm |
Core Configs | 1, 2, 4, 6, 8 |
Pipeline | |
Type | Superscalar, Pipelined |
OoOE | Yes |
Speculative | Yes |
Reg Renaming | Yes |
Stages | 13 |
Decode | 5-way |
Instructions | |
ISA | ARMv8.2 |
Extensions | FPU, NEON |
Cache | |
L1I Cache | 64 KiB/core 4-way set associative |
L1D Cache | 64 KiB/core 4-way set associative |
L2 Cache | 1 MiB/core 8-way set associative |
L3 Cache | 8 MiB/cluster 16-way set associative |
Cores | |
Core Names | Cortex-X1 |
Succession | |
Contemporary | |
Cortex-A78 (Hercules) |
Cortex-X1 (codename Hera) is a performance-enhanced version of the Cortex-A78 (Hercules), a low-power high-performance ARM microarchitecture designed by Arm for the mobile market. The Cortex-X1 was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable IP core and is licensed to other semiconductor companies to be implemented in their own chips.
The Cortex-X1, which implements the ARMv8.2 ISA, is a higher performance core that is designed to be combined with the Cortex-A78 in a DynamIQ big.LITTLE combination in order to provide even higher single-thread performance. This core, along with the Cortex-A78, are often combined with a number of low(er) power cores (e.g. Cortex-A55) in order to achieve better energy/performance.
Contents
Cortex-X[edit]
Year | Cortex-X Core | Cortex-A Core |
---|---|---|
2020 | Cortex-X1 (Hera) Cortex-X1C (Hera-C) |
Cortex-A78 (Hercules) Cortex-A78C (Hera Prime) |
2021 | Cortex-X2 (Matterhorn-ELP) |
Cortex-A710 (Matterhorn) Cortex-A510 (Klein) |
2022 | Cortex-X3 (Makalu-ELP) | Cortex-A715 (Makalu) |
2023 | Cortex-X4 (Hunter-ELP) | Cortex-A720 (Hunter) Cortex-A520 (Hayes) |
2024 | Cortex-X925 (Blackhawk) |
Cortex-A720AE (Hunter-AE) Cortex-A725 (Chaberton) |
2025 | Cortex-X930 (Travis) | Cortex-A730 (Gelas) Cortex-A530 (Nevis) |
Process Technology[edit]
Although the Cortex-X1 may be fabricated on various process nodes, it has been primarily designed for the 10 nm, 7 nm,
- and 5 nm process nodes with performance, power and area numbers mainly targeting the 5-nanometer node.
Architecture[edit]
Key changes from Cortex-A78[edit]
- See also: Cortex-A78 § Key changes from Cortex-A77
The Cortex-X1 is a custom performance-enhanced variant of the Cortex-A78, therefore it
- inherits most of the changes that were done to the Cortex-A78 from the Cortex-A77.
- Higher performance (See § Performance claims)
- Arm self-reported around 30% performance over the Cortex-A77
(compared to +20% with the Cortex-A78) - 2.0x (machine learning) performance
- Arm self-reported around 30% performance over the Cortex-A77
- Silicon area
- 15% more silicon area (on N5)
- Front-end
- 1.25x wider decode (5-way, up from 4-way)
- 1.33x wider decoded cache bandwidth
(8 MOPs/cycle, up from 6 MOPs/cycle)
- Memory subsystem
- Only 64 KiB L1I cache option (from 32-64 KiB)
- Only 64 KiB L1D cache option (from 32-64 KiB)
- Up to 1 MiB L2 cache option (from 512 KiB)
- Up to 8 MiB L3 cache option (from 4 MiB)
Comparison[edit]
- "Prime" core
Architecture | Cortex-A78 | Cortex-X1 | Cortex-X2 | Cortex-X3 | Cortex-X4 | Cortex-X925 | Cortex-X930 |
---|---|---|---|---|---|---|---|
Code name | Hercules | Hera | Matterhorn-ELP | Makalu-ELP | Hunter-ELP | Blackhawk | Travis |
ISA | ARMv8.2-A | ARMv9.0-A | ARMv9.2-A | ||||
Peak clock speed | ~3.0 GHz | ~3.3 GHz | ~3.4 GHz | ~3.8 GHz | ~4.2 GHz | ||
Max in-flight | 2x 160 | 2x 224 | 2x 288 | 2x 320 | 2x 384 | 2x 768 | |
L0 (Mops entries) | 1536 [1] | 3072 | 1536 | 0 | |||
L1-I + L1-D | 32+32 KiB | 64+64 KiB | 64+64 KiB | 64+64 KiB | |||
L2 | 128–512 KiB | 0.25–1 MiB | 0.5–2 MiB | 2–3 MiB | |||
L3 | 0–8 MiB [2] | 0–16 MiB | 0–32 MiB | ||||
Decode width | 4 | 5 | 6 | 10 [3] | 10 | ||
Dispatch | 6/cycle | 8/cycle | 10/cycle |
Performance claims[edit]
- Compared to the Cortex-A77, the Cortex-X1 is said to be 30% faster in peak performance on SPEC CPU2006.
- The improvement comes from both architectural improvements and frequency improvement with the help
- of process improvement moving from the 7 nm to the 5 nm node.
Performance | |
---|---|
Cortex-A77 | Cortex-X1 |
1.0x | 1.3x |
2.6 GHz | 3.0 GHz |
7 nm (N7) | 5 nm(N5) |
|
- Arm says that, at ISO-process and frequency, the Cortex-X1 achieves 22% higher integer performance (SPEC CPU2006)
- over the Cortex-A78 and 30% higher integer performance over the Cortex-A77. Likewise, due to the doubling
- of the number of NEON units, the Cortex-X1 can achieve twice the ML performance as both the A77 and A78.
Performance @ ISO-process/frequency | |
---|---|
Cortex-A77 | Cortex-X1 |
1.0x | 1.3x (integer performance) |
1.0x | 2.0x (ML performance) |
3.0 GHz | 3.0 GHz |
7 nm (N7) | 5 nm(N5) |
|
Overview[edit]
- The Cortex-X1 is a high-performance synthesizable core designed by Arm. It is delivered as Register
- Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs.
- This core supports the ARMv8.2 extension as well as a number of other partial extensions.
- This is the first from Arm's Cortex-X custom program. The X1 is a performance-enhanced
- version of the A78, it therefore uses the A78 as the starting point for its modifications.
- The Cortex-X1 is built on top of the Cortex-A78, but enhances it in order to extract additional performance,
- albeit at a slight reduction in power efficiency and area. To that end, whereas the Hercules was said to provide
- a 20% sustain performance uplift over the Cortex-A77, the Cortex-X1 offers up to 30% peak performance.
- In other words, whereas the Cortex-A78 is designed for high sustained performance at high performance-efficiency,
- the Cortex-X1 is designed to supplement it with higher peak performance while relaxing the power and area constraints.
- The Cortex-X1 is a fatter version of the Cortex-A78, relying on bigger buffers and a large out-of-order window
- in order to extract further performance. To that end, the X1 features a 5-way decode, twice as many NEON units,
- and larger overall buffers in order to allow for a bigger out-of-order window with more in-flight operations.
- The Cortex-X1 enlarges the pipeline while still retaining the higher frequency which was introduced in the Cortex-A77.
- The Cortex-X1 is intended to be combined with a number of Cortex-A78 cores in DynamIQ Shared Unit (DSU)
- cluster along with possibly with other lower-power cores such as the Cortex-A55 to more efficiently support
- a wide range of workloads at various performance and power levels beyond what's possible with any one core.
DSU Cluster[edit]
- The Cortex-X1 provides additional peak performance beyond what the Cortex-A78 can offer.
- Therefore the X1 is designed to be combined with a number of Cortex-A78 cores in DynamIQ
- Shared Unit (DSU) cluster in order to provide a balance in both power and performance.
- Compared to a quad-core Cortex-A77 cluster on 7 nm, a quad-core Cortex-A78 cluster provides
- +20% sustained performance improvement while reducing the silicon area by about 15%.
- When replacing one of those big Cortex-A78 cores with a single Cortex-X1 core, the cluster
- can now provide a peak single-thread performance of up to 30% versus the Cortex-A77
- at the cost of 15% additional silicon area (or neural area-wise from N7 to N5).
References[edit]
codename | Cortex-X1 (Hera) + |
core count | 1 +, 2 +, 4 +, 6 + and 8 + |
designer | ARM Holdings + |
first launched | May 26, 2020 + |
full page name | arm holdings/microarchitectures/cortex-x1 + |
instance of | microarchitecture + |
instruction set architecture | ARMv8.2 + |
manufacturer | TSMC + |
microarchitecture type | CPU + |
name | Cortex-X1 (Hera) + |
pipeline stages | 13 + |
process | 10 nm (0.01 μm, 1.0e-5 mm) +, 7 nm (0.007 μm, 7.0e-6 mm) + and 5 nm (0.005 μm, 5.0e-6 mm) + |