From WikiChip
Cortex-X1 - Microarchitectures - ARM
< arm holdings
Revision as of 22:45, 30 June 2020 by David (talk | contribs) (Performance claims)

Edit Values
Cortex-X1 µarch
General Info
Arch TypeCPU
DesignerARM Holdings
ManufacturerTSMC
IntroductionMay 26, 2020
Process10 nm, 7 nm, 5 nm
Core Configs1, 2, 4, 6, 8
Pipeline
TypeSuperscalar, Pipelined
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages13
Decode5-way
Instructions
ISAARMv8.2
ExtensionsFPU, NEON
Cache
L1I Cache32-64 KiB/core
4-way set associative
L1D Cache32-64 KiB/core
4-way set associative
L2 Cache128-512 KiB/core
8-way set associative
L3 Cache0-4 MiB/Cluster
16-way set associative
Contemporary
Cortex-A78

Cortex-X1 (codename Hera) is a performance-enhanced version of the Cortex-A78, a low-power high-performance ARM microarchitecture designed by Arm for the mobile market. The Cortex-X1 was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable IP core and is sold to other semiconductor companies to be implemented in their own chips.

The Cortex-X1, which implements the ARMv8.2 ISA, is a higher performance core that is designed to be combined with the Cortex-A78 in a DynamIQ big.LITTLE in order to provide even higher single-thread performance. This core, along with the Cortex-A78, are often combined with a number of low(er) power cores (e.g. Cortex-A55) in order to achieve better energy/performance.

Process Technology

Although the Cortex-X1 may be fabricated on various process nodes, it has been primarily designed for the 10 nm, 7 nm, and 5 nm process nodes with performance, power and area numbers mainly targeting the 5-nanometer node.

Compiler support

New text document.svg This section is empty; you can help add the missing info by editing this page.

Architecture

Key changes from Cortex-A78

See also: Cortex-A78 § Key changes from Cortex-A77

The Cortex-X1 is a custom performance-enhanced variant of the A78, therefore it inherits most of the changes that were done to the A78 from the A77.

  • Higher performance (See § Performance claims)
    • Arm self-reported around 30% performance over the A77 (compared to +20% with the A78)
      • 2.0x (machine learning) performance
  • Silicon area
    • 15% more silicon area (on N5)
  • Front-end
    • 1.25x wider decode (5-way, up from 4-way)
    • 1.33x wider decoded cache bandwidth (8 MOPs/cycle, up from 6 MOPs/cycle)
  • Memory subsystem
    • Only 64 KiB L1I cache option (from 32-64 KiB)
    • Only 64 KiB L1D cache option (from 32-64 KiB)
    • Up to 1 MiB L2 cache option (from 512 KiB)
    • Up to 8 MiB L3 cache option (from 4 MiB)

This list is incomplete; you can help by expanding it.

Performance claims

Compared to the Cortex-A77, the X1 is said to be 30% faster in peak performance on SPEC CPU2006. The improvement comes from both architectural improvements and frequency improvement with the help of process improvement moving from the 7 nm to the 5 nm node.

Performance
Cortex-A77 Cortex-X1
1.0x 1.3x
2,600 MHz 3,000 MHz
7 nm (N7) 5 nm(N5)
  • Cortex-X1 1 MiB L2, 8 MiB L3 cache
  • Cortex-A77 512 KiB L2 , 4 MiB L3 cache

Arm says that, at ISO-process and frequency, the Cortex-X1 achieves 22% higher integer performance (SPEC CPU2006) over the Cortex-A78 and 30% higher integer performance over the Cortex-A77. Likewise, due to the doubling of the number of NEON units, the Cortex-X1 can achieve twice the ML performance as both the A77 and A78.

Performance @ ISO-process/frequency
Cortex-A77 Cortex-X1
1.0x 1.3x (integer performance)
1.0x 2.0x (ML performance)
3,000 MHz 3,000 MHz
7 nm (N7) 5 nm(N5)
  • Cortex-X1 1 MiB L2, 8 MiB L3 cache
  • Cortex-A77 512 KiB L2 , 4 MiB L3 cache

Overview

The Cortex-X1 is a high-performance synthesizable core designed by Arm. It is delivered as Register Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs. This core supports the ARMv8.2 extension as well as a number of other partial extensions. This is the first from Arm's Cortex-X custom program. The X1 is a performance-enhanced version of the Cortex-A78, it therefore uses the A78 as the starting point for its modifications.

The Cortex-X1 is built on top of the Cortex-A78, but enhances it in order to extract additional performance, albeit at a slight reduction in power efficiency and area. To that end, whereas the Hercules was said to provide a 20% sustain performance uplift over the A77, the Cortex-X1 offers up to 30% peak performance. In other words, whereas the A78 is designed for high sustained performance at high performance-efficiency, the Cortex-X1 is designed to supplement it with higher peak performance while relaxing the power and area constraints.

The Cortex-X1 is a fatter version of the A78, relying on bigger buffers and a large out-of-order window in order to extract further performance. To that end, the X1 features a 5-way decode, twice as many NEON units, and larger overall buffers in order to allow for a bigger out-of-order window with more in-flight operations. The X1 enlarges the pipeline while still retaining the higher frequency which was introduced in the A77. The X1 is intended to be combined with a number of A78 cores in DynamIQ Shared Unit (DSU) cluster along with possibly with other lower-power cores such as the Cortex-A55 to more efficiently support a wide range of workloads at various performance and power levels beyond what's possible with any one core.

DSU Cluster

The Cortex-X1 provides additional peak performance beyond what the Cortex-A78 can offer. Therefore the X1 is designed to be combined with a number of Cortex-A78 cores in DynamIQ Shared Unit (DSU) cluster in order to provide a balance in both power and performance. Compared to a quad-core A77 cluster on 7 nm, a quad-core A78 cluster provides +20% sustained performance improvement while reducing the silicon area by about 15%. When replacing one of those big A78 cores with a single Cortex-X1 core, the cluster can now provide a peak single-thread performance of up to 30% versus the A77 at the cost of 15% additional silicon area (or neural area-wise from N7 to N5).

codenameCortex-X1 +
core count1 +, 2 +, 4 +, 6 + and 8 +
designerARM Holdings +
first launchedMay 26, 2020 +
full page namearm holdings/microarchitectures/cortex-x1 +
instance ofmicroarchitecture +
instruction set architectureARMv8.2 +
manufacturerTSMC +
microarchitecture typeCPU +
nameCortex-X1 +
pipeline stages13 +
process10 nm (0.01 μm, 1.0e-5 mm) +, 7 nm (0.007 μm, 7.0e-6 mm) + and 5 nm (0.005 μm, 5.0e-6 mm) +