From WikiChip
Difference between revisions of "arm holdings/microarchitectures/cortex-a78"
< arm holdings

(Overview)
(Fix typos)
 
(2 intermediate revisions by one other user not shown)
Line 164: Line 164:
 
** STLB
 
** STLB
 
*** 1280-entry 5-way set associative
 
*** 1280-entry 5-way set associative
 +
 +
=== Supported Instructions ===
 +
* ARMv8
 +
** {{arm|A64}}, {{arm|A32}}, and {{arm|T32}}
 +
** Everything up to Armv8.2-A
 +
** Reliability, Availability, and, Serviceability (RAS) extension
 +
** Statistical Profiling Extension (SPE)
 +
** Load acquire (LDAPR) instructions extension (from {{arm|Armv8.3-A}})
 +
** Dot Product instructions extension (from {{arm|Armv8.4-A}})
 +
** Traps for EL0 and EL1 cache controls
 +
** PSTATE Speculative Store Bypass Safe (SSBS) bit
 +
** speculation barriers (CSDB, SSBB, PSSBB) instructions extension (from {{arm|Armv8.5‑A}})
  
 
== Performance claims ==
 
== Performance claims ==
Line 206: Line 218:
  
 
=== DSU Cluster ===
 
=== DSU Cluster ===
The Cortex-A78 is designed to be integrated into a [[DynamIQ Shared Unit]] (DSU) cluster with up to [[eight cores]]. Up to four Cortex-A78s may be clustered together. The cluster may also inclde up to four additional [[little cores]] such as the {{\\|Cortex-A55}} in a [[big.LITTLE]] configuration. Additionally, one or more of the A78 cores [[arm_holdings/microarchitectures/cortex-x1#DSU Cluster|may be swapped out]] for a {{\\|Cortex-X1}} core in order to achieve even higher performance. Compared to a quad-core {{\\|Cortex-A77|A77}} cluster on [[N7|7 nm]], a quad-core {{\\|Cortex-A78|A78}} cluster on [[N5|5 nm]] provides +20% sustained performance improvement while reducing the silicon area by about 15%.
+
The Cortex-A78 is designed to be integrated into a [[DynamIQ Shared Unit]] (DSU) cluster with up to [[eight cores]]. Up to four Cortex-A78s may be clustered together. The cluster may also inclde up to four additional [[little cores]] such as the {{\\|Cortex-A55}} in a [[big.LITTLE]] configuration. Additionally, one or more of the A78 cores [[arm_holdings/microarchitectures/cortex-x1#DSU Cluster|may be swapped out]] for a {{\\|Cortex-X1}} core in order to achieve even higher performance. Compared to a quad-core {{\\|Cortex-A77|A77}} cluster on [[N7|7 nm]], a quad-core A78 cluster on [[N5|5 nm]] provides +20% sustained performance improvement while reducing the silicon area by about 15%.
  
 
== Core ==
 
== Core ==
Line 239: Line 251:
  
 
===== Memory subsystem =====
 
===== Memory subsystem =====
The memory subsystem was improved on the A78. Whereas the {{\\|A77}} had two generic [[address-generation unit]] - each capable of supporting both loads and stores, Hercules adds a new deducted load AGU unit, including the load bandwidth by 50%. In other words, the Cortex-A78 is capable of performing either a loads or a store on 2 ports (any combination, e.g., LD+ST or ST+ST) and another load on a third port. Along with those changes, Arm doubled the store-data bandwidth from 16B/cycle to 32B/cycle.
+
The memory subsystem was improved on the A78. Whereas the {{\\|A77}} had two generic [[address-generation unit]] - each capable of supporting both loads and stores, Hercules adds a new dedicated load AGU unit, increasing the load bandwidth by 50%. In other words, the Cortex-A78 is capable of performing either a load or a store on 2 ports (any combination, e.g., LD+ST or ST+ST) and another load on a third port. Along with those changes, Arm doubled the store-data bandwidth from 16B/cycle to 32B/cycle.
  
 
Like the instruction cache, the [[level 1 data cache]] on Hercules was also made configurable, allowing for either 32 KiB or 64 KiB and with an optional ECC protection per 32 bits. It is [[virtually indexed, physically tagged]] which behaves as a [[physically indexed, physically tagged]] 4-way set-associative cache. The L1D cache implements a [[pseudo-LRU]] [[cache replacement]] policy. It features a 4-cycle fastest load-to-use latency with two read ports and one write port meaning it can do two 16B loads/cycle and one 32B store/cycle. From the L1, the A78 supported up to 20 outstanding non-prefetch misses. Previously, the {{\\|A77}} had an 85-entry load buffer and a 90-entry store buffer. Arm says the functionality of those two buffers is now distributed across several structures. Hercules improved the data prefetchers. Arm says Hercules introduced a number of new prefetch engines, covering some new stride patterns and new irregular access patterns.
 
Like the instruction cache, the [[level 1 data cache]] on Hercules was also made configurable, allowing for either 32 KiB or 64 KiB and with an optional ECC protection per 32 bits. It is [[virtually indexed, physically tagged]] which behaves as a [[physically indexed, physically tagged]] 4-way set-associative cache. The L1D cache implements a [[pseudo-LRU]] [[cache replacement]] policy. It features a 4-cycle fastest load-to-use latency with two read ports and one write port meaning it can do two 16B loads/cycle and one 32B store/cycle. From the L1, the A78 supported up to 20 outstanding non-prefetch misses. Previously, the {{\\|A77}} had an 85-entry load buffer and a 90-entry store buffer. Arm says the functionality of those two buffers is now distributed across several structures. Hercules improved the data prefetchers. Arm says Hercules introduced a number of new prefetch engines, covering some new stride patterns and new irregular access patterns.

Latest revision as of 08:25, 31 March 2022

Edit Values
Cortex-A78 µarch
General Info
Arch TypeCPU
DesignerARM Holdings
ManufacturerTSMC
IntroductionMay 26, 2020
Process10 nm, 7 nm, 5 nm
Core Configs1, 2, 4, 6, 8
Pipeline
TypeSuperscalar, Pipelined
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages13
Decode4-way
Instructions
ISAARMv8.2
ExtensionsFPU, NEON
Cache
L1I Cache32-64 KiB/core
4-way set associative
L1D Cache32-64 KiB/core
4-way set associative
L2 Cache128-512 KiB/core
8-way set associative
L3 Cache0-4 MiB/Cluster
16-way set associative
Succession
Contemporary
Cortex-X1

Cortex-A78 (codename Hercules) is the successor to the Cortex-A77, a low-power high-performance ARM microarchitecture designed by Arm for the mobile market. Hercules was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable IP core and is sold to other semiconductor companies to be implemented in their own chips.

The Cortex-A78, which implements the ARMv8.2 ISA, is a high performance core which is often combined with a number of low(er) power cores (e.g. Cortex-A55) in a DynamIQ big.LITTLE configuration to achieve better energy/performance. The A78 may also be mixed with a high-performance Cortex-X1 core in order to provide certain workloads with an additional boost in single-core performance.

History[edit]

Arm client roadmap with Hercules.

Development of the Cortex-A78 started in 2015. Arm formally announced Hercules on May 26 2020.

Process Technology[edit]

Although the Cortex-A78 may be fabricated on various process nodes, it has been primarily designed for the 10 nm, 7 nm, and 5 nm process nodes with performance, power and area numbers mainly targeting the 5-nanometer node.

Compiler support[edit]

Compiler Arch-Specific Arch-Favorable
Arm Compiler -mcpu=cortex-a78 -mtune=cortex-a78
GCC -mcpu=cortex-a78 -mtune=cortex-a78
LLVM -march=cortex-a78 -mtune=cortex-a78

If the Cortex-a78 is coupled with the Cortex-A55 in a big.LITTLE system, GCC also supports the following option:

Compiler Tune
GCC -mtune=cortex-a78.cortex-a55

Architecture[edit]

Key changes from Cortex-A77[edit]

  • Higher performance
    • Arm self-reported around 20% performance on SPEC CPU2006 at iso-power (at 1 W/core)
      • 1.15x higher frequency (due to N5 from N7)
      • 7% IPC improvement for integer performance (at iso-process/frequency)
  • Front-end
    • Branch-prediction
      • 2x bandwidth (2 taken branches/cycle, up from 1)
      • Improved accuracy
      • Predictor structures were optimize for better power/area
        • Certain structures were reduced in size
    • Additional instruction fusion cases
    • Renaming/ordering
      • Register renaming unit
        • Internal structures optimized
        • Register check-pointing scheme has been optimized
      • Physical Register File
        • New packing scheme (improve data density)
      • ROB
        • New packaging scheme (improve instruction density)
      • Buffer size shrunk
      • Execution units
        • Instruction schedulers were optimized for better power efficiency
        • New IMUL integer unit (2 integer multiplication/cycle, up from 1)
    • Improved prefetcher
      • New data prefetcher engines
        • New stride patterns
        • New irregular access pattern detection
  • Memory subsystem
    • 1.5 load issue bandwidth (up to 3 LD/cycle up from 2)
    • 2x store-data bandwidth (32B/cycle, up from 16B/cycle)
    • 2x L2-L1 bandwidth (64B/cycle, up from 32B/cycle)
    • New 32 KiB L1I cache option (from 64 KiB only)
    • New 32 KiB L1D cache option (from 64 KiB only)
    • Improved prefetcher
      • Earlier prefetching for L1 cache misses

This list is incomplete; you can help by expanding it.

Block Diagram[edit]

Typical SoC[edit]

450px

Individual Core[edit]

700px

Memory Hierarchy[edit]

The Cortex-a78 has a private L1I, L1D, and L2 cache.

  • Cache
    • L0 MOP Cache
      • 1536-entry
    • L1I Cache
      • 32 KiB OR 64 KiB, 4-way set associative
      • 64-byte cache lines
      • Optional parity protection
      • Write-back
    • L1D Cache
      • 32 KiB OR 64 KiB, 4-way set associative
      • 64-byte cache lines
      • 4-cycle fastest load-to-use latency
      • Optional ECC protection per 32 bits
      • Write-back
    • L2 Cache
      • 256 KiB OR 512 KiB (2 banks)
      • 8-way set associative
      • 9-cycle fastest load-to-use latency
      • optional ECC protection per 64 bits
      • Modified Exclusive Shared Invalid (MESI) coherency
      • Strictly inclusive of the L1 data cache & non-inclusive of the L1 instruction cache
      • Write-back
    • L3 Cache
      • 0 MiB to 4 MiB, 16-way set associative
      • 26-31 cycles load-to-use
      • Shared by all the cores in the cluster

The A78 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).

  • TLBs
    • ITLB
      • 4 KiB, 16 KiB, 64 KiB, 2 MiB, and 32 MiB page sizes
      • 48-entry fully associative
    • DTLB
      • 48-entry fully associative
      • 4 KiB, 16 KiB, 64 KiB, 2 MiB, and 512 MiB page sizes
    • STLB
      • 1280-entry 5-way set associative

Supported Instructions[edit]

  • ARMv8
    • A64, A32, and T32
    • Everything up to Armv8.2-A
    • Reliability, Availability, and, Serviceability (RAS) extension
    • Statistical Profiling Extension (SPE)
    • Load acquire (LDAPR) instructions extension (from Armv8.3-A)
    • Dot Product instructions extension (from Armv8.4-A)
    • Traps for EL0 and EL1 cache controls
    • PSTATE Speculative Store Bypass Safe (SSBS) bit
    • speculation barriers (CSDB, SSBB, PSSBB) instructions extension (from Armv8.5‑A)

Performance claims[edit]

Compared to the Cortex-A77, the A78 is said to be 20% faster in sustained performance on SPEC CPU2006. The improvement come from a mixture of architectural and the frequency and power efficiency improvement of moving from the 7 nm to the 5 nm node. Likewise, Arm says the A78 can achieve the same level of performance (30 SPECint2006) as the A77 at 50% less power.

Performance   Energy
Cortex-A77 Cortex-A78 Cortex-A77 Cortex-A78
1.0x 1.2x 1.0x 0.5x
2,600 MHz 3,000 MHz 2,300 MHz 2,100 MHz
7 nm (N7) 5 nm(N5) 7 nm (N7) 5 nm(N5)
1 W 1 W  ? W  ? W

Arm reports the following ISO-comparison numbers over the Cortex-A77. Numbers based on Measured estimates SPECint*_base2006.

ISO-Comparison (A77 vs A78)
Performance Power Area
+7% -4% -5%
  • Cortex-A78 32 KiB / 512 KiB (2020)
  • Cortex-A77 64 KiB / 512 KiB (2019)

Overview[edit]

The Cortex-A78, formerly Hercules, is a high-performance synthesizable core designed by Arm as the successor to the Cortex-A77. It is delivered as Register Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs. This core supports the ARMv8.2 extension as well as a number of other partial extensions including the RAS, SPE, LDAPR, and Dot Product extensions.

The Cortex-A78 is built on the extensive design work that was done on the A76 and A77 but enhances it in order to improve its power efficiency. Arm says that both the performance-efficiency and area-efficiency of the core was improved over Deimos. To that end, Arm reports about a 20% sustained performance improvement over Deimos gained through both architectural improvements and transistor improvements due to the migration from the 7-nanometer node to the 5-nanometer node

The A78 is a 6-way (predecoded) 4-way (decode) superscalar out-of-order processor with a 12-wide execution engine, a private level 1, and level 2 caches. It is designed to be implemented inside the DynamIQ Shared Unit (DSU) cluster along with other cores. The DSU cluster supports up to eight cores of any combination (e.g., with little cores such as the Cortex-A55 or other just more Cortex-A78s). Additionally, this core may also be combined with the Cortex-X1 in order to achieve higher single-thread.

DSU Cluster[edit]

The Cortex-A78 is designed to be integrated into a DynamIQ Shared Unit (DSU) cluster with up to eight cores. Up to four Cortex-A78s may be clustered together. The cluster may also inclde up to four additional little cores such as the Cortex-A55 in a big.LITTLE configuration. Additionally, one or more of the A78 cores may be swapped out for a Cortex-X1 core in order to achieve even higher performance. Compared to a quad-core A77 cluster on 7 nm, a quad-core A78 cluster on 5 nm provides +20% sustained performance improvement while reducing the silicon area by about 15%.

Core[edit]

The Cortex-A78 succeeds the Cortex-A77. Whereas the the A76 and A77 targeted the 7-nanometer node, the A78 primarily targets the 5 nm. The A78 revisits many of the changes that were made in the A76 and A77 in order to evaluate their performance-efficiency tradeoffs. The primary focus of the A78 is to optimize and maximize for performance efficiency. To that end, bigger buffers that provided only a slight performance improvement at a larger power gain were trimmed down while components that had better performance:power ratio of improvements were improved to take advantage of this fact.

Pipeline[edit]

The Cortex-A78 is a complex, 4-way superscalar out-of-order processor with a 10-issue back end. The pipeline is 13 stages with a 10-cycle branch misprediction penalty best-case. It has a private level 1 instruction cache and a private level 1 data cache, both of which can be configured as 32 KiB or 64 KiB, along with a private level 2 cache that is configurable as either 256 KiB (1 bank) or 512 KiB (2 banks).

Front-end[edit]

Each cycle, up to 32 bytes are fetched from the L1 instruction cache. The instruction fetch works in tandem with the branch predictor in order to ensure the instruction stream is constantly ready to be fetched. Additionally, there is a return stack which stores the address and instruction set state (AArch32/R14 or AArch64/X30) on branches. On a return (e.g., ret on AArch64), the return stack will pop.

Keeping the instruction stream feed is the task of the branch prediction unit and the prefetchers. Like Deimos, the branch prediction unit on Hercules is decoupled from the instruction fetch, allowing it to run ahead and in parallel with the instruction fetch to hide branch prediction latency. Arm says it has further improved the conditional branch prediction accuracy in this core. The branch predictor has a 8K-entries deep branch target buffer and the instruction window size on Hercules remains at 64 bytes/cycle, allowing it to runeahead of the instruction stream. The BPU on the A77 comprises three stages in order to reduce latency with a 64-entry micro-BTB and a smaller 64-entry nano BTB. Arm says that certain structures were re-balanced on the A78 but did not disclose any of the finer details. The prefetchers on the A78 have been improved, but the exact details were not disclosed. One of the new additions in Hercules is the doubling of the BPU bandwidth by supporting up to 2 taken branches predictions per cycle.

Unlike Deimos which had an L1 instruction cache with a fixed capacity of 64 KiB, Hercules introduces a second configuration that halves that to 32 KiB for SoC designs that could use the further lowering of power and area. The L1I cache is virtually indexed, physically tagged (VIPT), which behaves as a physically indexed, physically tagged (PIPT) 4-way set-associative cache. The L1I$ supports optional parity protection and implements a pseudo-LRU cache replacement policy.

The instruction cache has a 256-bit read interface from the L2 cache. Each cycle up to 16 bytes may be transferred to the L1I cache from the shared L2 cache.

From the instruction fetch, up to four 32-bit instructions are sent to the decode queue (DQ) each cycle. For narrower 16-bit instructions (i.e., Thumb), this means up to eight instructions get queued. The A78 features a 4-way decode. Each cycle, up to four instructions may be decoded into a relatively semi-complex macro-operations (MOPs). There are on average 6% more MOPs than instructions. In total two cycles are involved in this operation - one for alignment and one for decode.

Back-end[edit]

Hercules back-end handles the execution of out-of-order operations. The design is largely inherited from the Cortex-A76 and the A77 but has been further optimized for higher power-efficiency..

Renaming & Allocation[edit]

From the front-end, up to six macro-operations may be sent each cycle to be renamed. Previously, in the A77, there was a capacity to handle up to 160 instructions in-flight. The reorder buffer on Hercules was slightly shrunk, however the exact size was not disclosed. The Micro-operations are broken down into their µOP constituents and are scheduled for execution. In the prior generation, roughly 20% more µOPs are generated from the MOPs. Hercules is said to introduce a number of additional instruction fusion cases, slightly reducing this number. From here, µOPs are sent to the instruction issue which controls when they can be dispatched to the execution pipelines. µOPs are queued in three independent unified issue queues for integer, floating point, and memory.

Execution Units[edit]

Hercules backend issue is 13-wide, one more than the A77 (or 8% wider). This allows for up to thirteen µOPs to execute each cycle - 11 µOPs to the execution units and 2 store data points. Arm says that from a power-efficiency standpoint, considerable improvements were made to the instruction schedulers on Hercules. The execution units on Hercules are grouped into three clusters: integer, advanced SIMD, and memory.

Hercules maintains the same six pipelines in the integer cluster as Deimos. Hercules added a second integer multiply (IMUL) unit, allowing for up to two integer multiples per cycle. In total, there are three simple ALUs that perform arithmetic and logical data processing operations. There is the new integer multiply unit on one of the simple ALU ports and a fourth port which has support for complex arithmetic (e.g. MAC, DIV).

The floating-point cluster is unchanged from Deimos. There are two ASIMD/FP execution pipelines. As with A76/A77, the ASIMD on the A78 are both 128-bit wide capable of 2 double-precision operations, 4 single-precision, 8 half-precision, or 16 8-bit integer operations. Those pipelines can also execute the cryptographic instructions if the extension is supported (not offered by default and requires an additional license from Arm).

Memory subsystem[edit]

The memory subsystem was improved on the A78. Whereas the A77 had two generic address-generation unit - each capable of supporting both loads and stores, Hercules adds a new dedicated load AGU unit, increasing the load bandwidth by 50%. In other words, the Cortex-A78 is capable of performing either a load or a store on 2 ports (any combination, e.g., LD+ST or ST+ST) and another load on a third port. Along with those changes, Arm doubled the store-data bandwidth from 16B/cycle to 32B/cycle.

Like the instruction cache, the level 1 data cache on Hercules was also made configurable, allowing for either 32 KiB or 64 KiB and with an optional ECC protection per 32 bits. It is virtually indexed, physically tagged which behaves as a physically indexed, physically tagged 4-way set-associative cache. The L1D cache implements a pseudo-LRU cache replacement policy. It features a 4-cycle fastest load-to-use latency with two read ports and one write port meaning it can do two 16B loads/cycle and one 32B store/cycle. From the L1, the A78 supported up to 20 outstanding non-prefetch misses. Previously, the A77 had an 85-entry load buffer and a 90-entry store buffer. Arm says the functionality of those two buffers is now distributed across several structures. Hercules improved the data prefetchers. Arm says Hercules introduced a number of new prefetch engines, covering some new stride patterns and new irregular access patterns.

The A78 can be configured with either 128, 256 or 512 KiB of level 2 cache with the two 265 KiB banks. It implements a dynamic biased replacement policy and is ECC protected per 64 bits. The L2 is strictly inclusive of the L1 data cache and non-inclusive of the L1 instruction cache. There is a 256-bit write interface to the L2 and a 256-bit read interface from the L2 cache. The fastest load-to-use latency is 9 cycles. The L2 can support up to 46 outstanding misses to the L3 which is located in the DSU itself. The L3, which is shared by all the cores in the DynamIQ big.LITTLE and is configurable in size ranging from 2 MiB to 4 MiB with load-to-use ranging from 26 to 31 cycles. As with the L2, up to two 32 bytes may be transferred from or to the L2 from the L3 cache. Up to 94 outstanding misses are supported from the L3 to main memory.

In addition to controlling memory accesses, ordering, and cache policies, the MMU is also responsible for the translation of virtual addresses to physical addresses on the system. This is done through a set of virtual-to-physical address mappings and attributes that are held in translation tables. The physical address size here is 40 bits. Hercules incorporates a dedicated L1 TLB for instruction cache and another one for the data cache. Both the ITLB and the DTLB are 48-entry deep and are fully associative. On a memory access operation, the A78 will first perform lookup in there. If there is a miss in the L1 TLBs, the MMU will perform a lookup for the requested entry in the second-level TLB.

There is a unified level 2 TLB comprising of 1280 entries organized as 5-way set associative which is shared by both instruction and data. The STLB handles misses from the instruction and data L1 TLBs. Typically, STLB accesses take three cycles, however, longer latencies are possible when a different block or page size mapping is used. If there is a miss in the L2 TLB, the MMU will resort to a hardware translation table walk. Up to four TLB misses (i.e., translations table walks) can be performed in parallel. The STLB will stall if there are six successive misses. During table walks, the STLB can still perform up to two TLB lookups.

The TLB entries store one or both of the global indicator and an address space identifier (ASID), allowing context switching without TLB invalidation as well as a virtual machine identifier (VMID) which allows for VM switching by the hypervisor without TLB invalidation.

All Cortex-A78 Processors[edit]

 List of Cortex-A78-based Processors
 Main processorIntegrated Graphics
ModelFamilyLaunchedProcessArchCoresFrequencyGPUFrequency
Count: 0

Bibliography[edit]

  • Arm Tech Day, 2020.
  • Arm. personal communication. 2020.

Documents[edit]

New text document.svg This section is empty; you can help add the missing info by editing this page.
codenameCortex-A78 +
core count1 +, 2 +, 4 +, 6 + and 8 +
designerARM Holdings +
first launchedMay 26, 2020 +
full page namearm holdings/microarchitectures/cortex-a78 +
instance ofmicroarchitecture +
instruction set architectureARMv8.2 +
manufacturerTSMC +
microarchitecture typeCPU +
nameCortex-A78 +
pipeline stages13 +
process10 nm (0.01 μm, 1.0e-5 mm) +, 7 nm (0.007 μm, 7.0e-6 mm) + and 5 nm (0.005 μm, 5.0e-6 mm) +