Editing arm holdings/microarchitectures/cortex-a78

{{armh title|Cortex-A78|arch}}
{{microarchitecture
|atype=CPU
|name=Cortex-A78
|designer=ARM Holdings
|manufacturer=TSMC
|introduction=May 26, 2020
|process=10 nm
|process 2=7 nm
|process 3=5 nm
|cores=1
|cores 2=2
|cores 3=4
|cores 4=6
|cores 5=8
|type=Superscalar
|type 2=Pipelined
|oooe=Yes
|speculative=Yes
|renaming=Yes
|stages=13
|decode=4-way
|isa=ARMv8.2
|feature=Hardware virtualization
|extension=FPU
|extension 2=NEON
|l1i=32-64 KiB
|l1i per=core
|l1i desc=4-way set associative
|l1d=32-64 KiB
|l1d per=core
|l1d desc=4-way set associative
|l2=128-512 KiB
|l2 per=core
|l2 desc=8-way set associative
|l3=0-4 MiB
|l3 per=Cluster
|l3 desc=16-way set associative
|predecessor=Cortex-A77
|predecessor link=arm holdings/microarchitectures/cortex-a77
|successor=Matterhorn
|successor link=arm holdings/microarchitectures/matterhorn
|contemporary=Cortex-X1
|contemporary link=arm holdings/microarchitectures/cortex-x1
}}
'''Cortex-A78''' (codename '''Hercules''') is the successor to the {{armh|Cortex-A77|l=arch}}, a low-power high-performance [[ARM]] [[microarchitecture]] designed by [[Arm]] for the mobile market. Hercules was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable [[IP core]] and is sold to other semiconductor companies to be implemented in their own chips.

The Cortex-A78, which implements the {{arm|ARMv8.2}} ISA, is a high performance core which is often combined with a number of low(er) power cores (e.g. {{\\|Cortex-A55}}) in a {{armh|DynamIQ big.LITTLE}} configuration to achieve better energy/performance. The A78 may also be mixed with a high-performance {{\\|Cortex-X1}} core in order to provide certain workloads with an additional boost in single-core performance.

== History ==
[[File:arm deimos roadmap.png|right|thumb|Arm client roadmap with Hercules.]]
Development of the Cortex-A78 started in 2015. [[Arm]] formally announced Hercules on May 26 2020.

== Process Technology ==
Although the Cortex-A78 may be fabricated on various [[process nodes]], it has been primarily designed for the [[10 nm]], [[7 nm]], and [[5 nm]] process nodes with performance, power and area numbers mainly targeting the [[5-nanometer node]].

== Compiler support ==
{| class="wikitable"
|-
! Compiler !! Arch-Specific || Arch-Favorable
|-
| [[Arm Compiler]] || <code>-mcpu=cortex-a78</code> || <code>-mtune=cortex-a78</code>
|-
| [[GCC]] || <code>-mcpu=cortex-a78</code> || <code>-mtune=cortex-a78</code>
|-
| [[LLVM]] || <code>-march=cortex-a78</code> || <code>-mtune=cortex-a78</code>
|}

If the Cortex-a78 is coupled with the {{\\|Cortex-A55}} in a [[big.LITTLE]] system, GCC also supports the following option:

{| class="wikitable"
|-
! Compiler !! Tune
|-
| [[GCC]] || <code>-mtune=cortex-a78.cortex-a55</code>
|}

== Architecture ==
=== Key changes from {{\\|Cortex-A77}} ===
* Higher performance
** [[Arm]] self-reported around 20% performance on [[SPEC CPU2006]] at iso-power (at 1 W/core)
*** 1.15x higher frequency (due to [[N5]] from [[N7]])
*** 7% IPC improvement for integer performance (at iso-process/frequency)
* Front-end
** [[Branch-prediction]]
** 2x bandwidth (2 taken branches/cycle, up from 1)
*** Improved accuracy
*** Predictor structures were optimize for better power/area
**** Certain structures were reduced in size
** Additional instruction fusion cases
** Renaming/ordering
*** Register renaming unit
**** Internal structures optimized
**** Register check-pointing scheme has been optimized
*** Physical Register File
**** New packing scheme (improve data density)
*** ROB
**** New packaging scheme (improve instruction density)
*** Buffer size shrunk
*** Execution units
**** Instruction schedulers were optimized for better power efficiency
**** New IMUL integer unit (2 integer multiplication/cycle, up from 1)
** Improved prefetcher
*** New data prefetcher engines
**** New stride patterns
**** New irregular access pattern detection
* Memory subsystem
** 1.5 load issue bandwidth (up to 3 LD/cycle up from 2)
** 2x store-data bandwidth (32B/cycle, up from 16B/cycle)
** 2x L2-L1 bandwidth (64B/cycle, up from 32B/cycle)
** New 32 KiB [[L1I cache]] option (from 64 KiB only)
** New 32 KiB [[L1D cache]] option (from 64 KiB only)
** Improved prefetcher
*** Earlier prefetching for L1 cache misses
{{expand list}}

=== Block Diagram ===
==== Typical SoC ====
:[[File:cortex-a78 soc block diagram.svg|450px]]

==== Individual Core ====
:[[File:cortex-a78 block diagram.svg|700px]]

=== Memory Hierarchy ===
The Cortex-a78 has a private L1I, L1D, and L2 cache.

* Cache
** L0 MOP Cache
*** 1536-entry
** L1I Cache
*** 32 KiB OR 64 KiB, 4-way set associative
*** 64-byte cache lines
*** Optional parity protection
*** Write-back
** L1D Cache
*** 32 KiB OR 64 KiB, 4-way set associative
*** 64-byte cache lines
*** 4-cycle fastest load-to-use latency
*** Optional ECC protection per 32 bits
*** Write-back
** L2 Cache
*** 256 KiB OR 512 KiB (2 banks)
*** 8-way set associative
*** 9-cycle fastest load-to-use latency
*** optional ECC protection per 64 bits
*** [[Modified Exclusive Shared Invalid]] (MESI) coherency
*** Strictly inclusive of the L1 data cache & non-inclusive of the L1 instruction cache
*** Write-back
** L3 Cache
*** 0 MiB to 4 MiB, 16-way set associative
*** 26-31 cycles load-to-use
*** Shared by all the cores in the cluster
**** located in the {{armh|DynamIQ Shared Unit}} (DSU)

The A78 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).

* TLBs
** ITLB
*** 4 KiB, 16 KiB, 64 KiB, 2 MiB, and 32 MiB page sizes
*** 48-entry fully associative
** DTLB
*** 48-entry fully associative
*** 4 KiB, 16 KiB, 64 KiB, 2 MiB, and 512 MiB page sizes
** STLB
*** 1280-entry 5-way set associative

== Performance claims ==
Compared to the {{\\|Cortex-A77}}, the A78 is said to be 20% faster in sustained performance on [[SPEC CPU2006]]. The improvement come from a mixture of architectural and the frequency and power efficiency improvement of moving from the [[N7|7 nm]] to the [[N5|5 nm node]]. Likewise, Arm says the A78 can achieve the same level of performance (30 SPECint2006) as the {{\\|Cortex-A77|A77}} at 50% less power.

{| class="wikitable"
|-
! colspan="2" | Performance !! rowspan="6" | &nbsp; !! colspan="2" | Energy
|-
| {{\\|Cortex-A77}} || Cortex-A78 || {{\\|Cortex-A77}} || Cortex-A78
|-
| 1.0x || 1.2x || 1.0x || 0.5x
|-
| 2,600 MHz || 3,000 MHz || 2,300 MHz || 2,100 MHz
|-
| [[N7|7 nm (N7)]] || [[N5|5 nm(N5)]] || [[N7|7 nm (N7)]] || [[N5|5 nm(N5)]]
|-
| 1 W || 1 W || ? W || ? W
|}

Arm reports the following ISO-comparison numbers over the {{\\|Cortex-A77}}. Numbers based on Measured estimates SPECint*_base2006.

{| class="wikitable tc1 tc2 tc3"
|-
! colspan="3" | ISO-Comparison ({{\\|Cortex-A77|A77}} vs A78)
|-
! Performance !! Power !! Area
|-
| +7% || -4% || -5%
|-
| colspan="3" | 
* Cortex-A78 32 KiB / 512 KiB (2020)
* Cortex-A77 64 KiB / 512 KiB (2019)
|}

== Overview ==
The Cortex-A78, formerly Hercules, is a high-performance [[synthesizable core]] designed by [[Arm]] as the successor to the {{\\|Cortex-A77}}. It is delivered as Register Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs. This core supports the {{arm|ARMv8.2}} extension as well as a number of other partial extensions.

The Cortex-A78 is built on the extensive design work that was done on the {{\\|A76}} and {{\\|A77}} but enhances it in order to improve its power efficiency. Arm says that both the performance-efficiency and area-efficiency of the core was improved over {{\\|Deimos}}. To that end, Arm reports about a 20% performance improvement over {{\\|Deimos}} gained through both architectural improvements and transistor improvements due to the migration from the [[N7|7-nanometer node]] to the [[N5|5-nanometer node]]

The A78 is a 6-way [[superscalar]] [[out-of-order]] processor with a 12-wide execution engine, a private level 1, and level 2 caches. It is designed to be implemented inside the [[DynamIQ Shared Unit]] (DSU) cluster along with other cores. The DSU cluster supports up to [[eight cores]] of any combination (e.g., with [[little cores]] such as the {{\\|Cortex-A55}} or other just more Cortex-A78s). Additionally, this core may also be combined with the {{\\|Cortex-X1}} in order to achieve higher single-thread.

=== DSU Cluster ===
The Cortex-A78 is designed to be integrated into a [[DynamIQ Shared Unit]] (DSU) cluster with up to [[eight cores]] of any combination (e.g., with [[little cores]] such as the {{\\|Cortex-A55}} or other just more Cortex-A78s). Compared to a quad-core {{\\|Cortex-A77|A77}} cluster on [[N7|7 nm]], a quad-core {{\\|Cortex-A78|A78}} cluster on [[N5|5 nm]] provides +20% sustained performance improvement while reducing the silicon area by about 15%. Additionally, one or more of the A78 cores [[arm_holdings/microarchitectures/cortex-x1#DSU Cluster|may be swapped out]] for a {{\\|Cortex-X1}} core in order to  achieve even higher performance.

== Core ==
The Cortex-A78 succeeds the {{\\|Cortex-A77}}. Whereas the the {{\\|A76}} and {{\\|A77}} targeted the [[N7|7-nanometer node]], the A78 primarily targets the [[N5|5 nm]]. The A78 revisits many of the changes that were made in the {{\\|A76}} and {{\\|A77}} in order to evaluate their performance-efficiency tradeoffs. The primary focus of the A78 is to optimize and maximize for performance efficiency. To that end, bigger buffers that provided only a slight performance improvement at a larger power gain were trimmed down while components that had better performance:power ratio of improvements were improved to take advantage of this fact.

=== Pipeline ===
The Cortex-A78 is a complex, 4-way [[superscalar]] [[out-of-order]] processor with a 10-issue back end. The pipeline is 13 stages with a 10-cycle branch misprediction penalty best-case. It has a private [[level 1]] [[instruction cache]] and a private [[level 1]] [[data cache]], both of which can be configured as 32 KiB or 64 KiB, along with a private [[level 2 cache]] that is configurable as either 256 KiB (1 bank) or 512 KiB (2 banks).

==== Front-end ====
Each cycle, up to 32 bytes are fetched from the [[L1 instruction cache]]. The instruction fetch works in tandem with the branch predictor in order to ensure the instruction stream is constantly ready to be fetched. Additionally, there is a return stack which stores the address and instruction set state ({{arm|AArch32}}/R14 or {{arm|AArch64}}/X30) on branches. On a return (e.g., <code>ret</code> on {{arm|AArch64}}), the return stack will pop.

Keeping the [[instruction stream]] feed is the task of the [[branch prediction unit]] and the [[prefetchers]]. Like {{\\|Deimos}}, the branch prediction unit on Hercules is decoupled from the [[instruction fetch]], allowing it to run ahead and in parallel with the instruction fetch to hide branch prediction latency. Arm says it has further improved the conditional branch prediction accuracy in this core. The branch predictor has a 8K-entries deep [[branch target buffer]] and the instruction window size on Hercules remains at 64 bytes/cycle, allowing it to runeahead of the instruction stream. The BPU on the {{\\|A77}} comprises three stages in order to reduce latency with a 64-entry micro-BTB and a smaller 64-entry nano BTB. Arm says that certain structures were re-balanced on the A78 but did not disclose any of the finer details. The prefetchers on the A78 have been improved, but the exact details were not disclosed. One of the new additions in Hercules is the doubling of the BPU bandwidth by supporting up to 2 taken branches predictions per cycle.

Unlike {{\\|Deimos}} which had an L1 instruction cache with a fixed capacity of 64 KiB, Hercules introduces a second configuration that halves that to 32 KiB for SoC designs that could use the further lowering of power and area. The L1I cache is [[virtually indexed, physically tagged]] (VIPT), which behaves as a [[physically indexed, physically tagged]] (PIPT) 4-way set-associative cache. The L1I$ supports optional parity protection and implements a [[pseudo-LRU]] [[cache replacement]] policy.

The instruction cache has a 256-bit read interface from the L2 cache. Each cycle up to 16 bytes may be transferred to the L1I cache from the shared L2 cache.

From the instruction fetch, up to four 32-bit instructions are sent to the decode queue (DQ) each cycle. For narrower 16-bit instructions (i.e., {{arm|Thumb}}), this means up to eight instructions get queued. The A78 features a 4-way decode. Each cycle, up to four instructions may be decoded into a relatively semi-complex [[macro-operations]] (MOPs). There are on average 6% more MOPs than instructions. In total two cycles are involved in this operation - one for alignment and one for decode.

==== Back-end ====
Hercules back-end handles the execution of out-of-order operations. The design is largely inherited from the {{\\|Cortex-A76}} and the {{\\|A77}} but has been further optimized for higher power-efficiency..

===== Renaming & Allocation =====
From the front-end, up to six [[macro-operations]] may be sent each cycle to be renamed. Previously, in the {{\\|A77}}, there was a capacity to handle up to 160 instructions in-flight. The reorder buffer on Hercules was slightly shrunk, however the exact size was not disclosed. The  [[Micro-operations]] are broken down into their [[µOP]] constituents and are scheduled for execution. In the prior generation, roughly 20% more µOPs are generated from the MOPs. Hercules is said to introduce a number of additional instruction fusion cases, slightly reducing this number. From here, µOPs are sent to the instruction issue which controls when they can be dispatched to the execution pipelines. µOPs are queued in three independent unified issue queues for integer, floating point, and memory.

===== Execution Units =====
Hercules backend issue is 13-wide, one more than the {{\\|A77}} (or 8% wider). This allows for up to thirteen µOPs to execute each cycle - 11 µOPs to the execution units and 2 store data points. Arm says that from a power-efficiency standpoint, considerable improvements were made to the instruction schedulers on Hercules. The execution units on Hercules are grouped into three clusters: integer, {{arm|advanced SIMD}}, and memory.

Hercules maintains the same six pipelines in the integer cluster as {{\\|Deimos}}. Hercules added a second integer multiply (IMUL) unit, allowing for up to two integer multiples per cycle. In total, there are three simple ALUs that perform arithmetic and logical data processing operations. There is the new integer multiply unit on one of the simple ALU ports and a fourth port which has support for complex arithmetic (e.g. MAC, DIV).

The floating-point cluster is unchanged from {{\\|Deimos}}. There are two {{arm|ASIMD}}/FP execution pipelines. As with {{\\|A76}}/{{\\|A77}}, the ASIMD on the A78 are both 128-bit wide capable of 2 [[double-precision]] operations, 4 single-precision, 8 half-precision, or 16 8-bit integer operations. Those pipelines can also execute the cryptographic instructions if the extension is supported (not offered by default and requires an additional license from [[Arm]]).

===== Memory subsystem =====
The memory subsystem was improved on the A78. Whereas the {{\\|A77}} had two generic [[address-generation unit]] - each capable of supporting both loads and stores, Hercules adds a new deducted load AGU unit, including the load bandwidth by 50%. In other words, the Cortex-A78 is capable of performing either a loads or a store on 2 ports (any combination, e.g., LD+ST or ST+ST) and another load on a third port. Along with those changes, Arm doubled the store-data bandwidth from 16B/cycle to 32B/cycle.

Like the instruction cache, the [[level 1 data cache]] on Hercules was also made configurable, allowing for either 32 KiB or 64 KiB and with an optional ECC protection per 32 bits. It is [[virtually indexed, physically tagged]] which behaves as a [[physically indexed, physically tagged]] 4-way set-associative cache. The L1D cache implements a [[pseudo-LRU]] [[cache replacement]] policy. It features a 4-cycle fastest load-to-use latency with two read ports and one write port meaning it can do two 16B loads/cycle and one 32B store/cycle. From the L1, the A78 supported up to 20 outstanding non-prefetch misses. Previously, the {{\\|A77}} had an 85-entry load buffer and a 90-entry store buffer. Arm says the functionality of those two buffers is now distributed across several structures. Hercules improved the data prefetchers. Arm says Hercules introduced a number of new prefetch engines, covering some new stride patterns and new irregular access patterns.

The A78 can be configured with either 128, 256 or 512 KiB of [[level 2 cache]] with the two 265 KiB banks. It implements a [[dynamic biased replacement]] policy and is ECC protected per 64 bits. The L2 is strictly inclusive of the L1 data cache and non-inclusive of the L1 instruction cache. There is a 256-bit write interface to the L2 and a 256-bit read interface from the L2 cache. The fastest load-to-use latency is 9 cycles. The L2 can support up to 46 outstanding misses to the L3 which is located in the {{armh|DSU}} itself. The L3, which is shared by all the cores in the {{armh|DynamIQ big.LITTLE}} and is configurable in size ranging from 2 MiB to 4 MiB with load-to-use ranging from 26 to 31 cycles. As with the L2, up to two 32 bytes may be transferred from or to the L2 from the L3 cache. Up to 94 outstanding misses are supported from the L3 to main memory.

In addition to controlling memory accesses, ordering, and [[cache policies]], the MMU is also responsible for the translation of virtual addresses to physical addresses on the system. This is done through a set of virtual-to-physical address mappings and attributes that are held in translation tables. The physical address size here is 40 bits. Hercules incorporates a dedicated L1 TLB for instruction cache and another one for the data cache. Both the ITLB and the DTLB are 48-entry deep and are fully associative. On a memory access operation, the A78 will first perform lookup in there. If there is a miss in the L1 TLBs, the MMU will perform a lookup for the requested entry in the second-level TLB.
 
There is a unified level 2 TLB comprising of 1280 entries organized as 5-way set associative which is shared by both instruction and data. The STLB handles misses from the instruction and data L1 TLBs. Typically, STLB accesses take three cycles, however, longer latencies are possible when a different block or page size mapping is used. If there is a miss in the L2 TLB, the MMU will resort to a hardware translation table walk. Up to four TLB misses (i.e., translations table walks) can be performed in parallel. The STLB will stall if there are six successive misses. During table walks, the STLB can still perform up to two TLB lookups. 

The TLB entries store one or both of the global indicator and an address space identifier (ASID), allowing context switching without TLB invalidation as well as a virtual machine identifier (VMID) which allows for VM switching by the hypervisor without TLB invalidation.

== All Cortex-A78 Processors ==
<!-- NOTE: 
           This table is generated automatically from the data in the actual articles.
           If a microprocessor is missing from the list, an appropriate article for it needs to be
           created and tagged accordingly.

           Missing a chip? please dump its name here: https://en.wikichip.org/wiki/WikiChip:wanted_chips
-->
{{comp table start}}
<table class="comptable sortable tc4 tc6 tc9">
{{comp table header|main|8:List of Cortex-A78-based Processors}}
{{comp table header|main|6:Main processor|2:Integrated Graphics}}
{{comp table header|cols|Family|Launched|Process|Arch|Cores|%Frequency|GPU|%Frequency}}
{{#ask: [[Category:all microprocessor models]] [[microarchitecture::Cortex-A78]]
 |?full page name
 |?model number
 |?family
 |?first launched
 |?process
 |?microarchitecture
 |?core count
 |?base frequency#GHz
 |?integrated gpu
 |?integrated gpu base frequency
 |format=template
 |template=proc table 3
 |userparam=10
 |mainlabel=-
 |valuesep=,
}}
{{comp table count|ask=[[Category:all microprocessor models]] [[microarchitecture::Cortex-A78]]}}
</table>
{{comp table end}}

== Bibliography ==
* Arm Tech Day, 2020.
* Arm. ''personal communication''. 2020.

== Documents ==
{{empty section}}
@@ Line 84: / Line 84: @@
 * Front-end
 ** [[Branch-prediction]]
-*** 2x bandwidth (2 taken branches/cycle, up from 1)
+** 2x bandwidth (2 taken branches/cycle, up from 1)
 *** Improved accuracy
 *** Predictor structures were optimize for better power/area
@@ Line 164: / Line 164: @@
 ** STLB
 *** 1280-entry 5-way set associative
-=== Supported Instructions ===
-* ARMv8
-** {{arm|A64}}, {{arm|A32}}, and {{arm|T32}}
-** Everything up to Armv8.2-A
-** Reliability, Availability, and, Serviceability (RAS) extension
-** Statistical Profiling Extension (SPE)
-** Load acquire (LDAPR) instructions extension (from {{arm|Armv8.3-A}})
-** Dot Product instructions extension (from {{arm|Armv8.4-A}})
-** Traps for EL0 and EL1 cache controls
-** PSTATE Speculative Store Bypass Safe (SSBS) bit
-** speculation barriers (CSDB, SSBB, PSSBB) instructions extension (from {{arm|Armv8.5‑A}})
 == Performance claims ==
@@ Line 211: / Line 199: @@
 == Overview ==
-The Cortex-A78, formerly Hercules, is a high-performance [[synthesizable core]] designed by [[Arm]] as the successor to the {{\\|Cortex-A77}}. It is delivered as Register Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs. This core supports the {{arm|ARMv8.2}} extension as well as a number of other partial extensions including the {{arm|RAS}}, {{arm|statistical profiling extension|SPE}}, {{arm|LDAPR}}, and {{arm|Dot Product}} extensions.
+The Cortex-A78, formerly Hercules, is a high-performance [[synthesizable core]] designed by [[Arm]] as the successor to the {{\\|Cortex-A77}}. It is delivered as Register Transfer Level (RTL) description in Verilog and is designed to be integrated into customer's SoCs. This core supports the {{arm|ARMv8.2}} extension as well as a number of other partial extensions.
-The Cortex-A78 is built on the extensive design work that was done on the {{\\|A76}} and {{\\|A77}} but enhances it in order to improve its power efficiency. Arm says that both the performance-efficiency and area-efficiency of the core was improved over {{\\|Deimos}}. To that end, Arm reports about a 20% sustained performance improvement over {{\\|Deimos}} gained through both architectural improvements and transistor improvements due to the migration from the [[N7|7-nanometer node]] to the [[N5|5-nanometer node]]
+The Cortex-A78 is built on the extensive design work that was done on the {{\\|A76}} and {{\\|A77}} but enhances it in order to improve its power efficiency. Arm says that both the performance-efficiency and area-efficiency of the core was improved over {{\\|Deimos}}. To that end, Arm reports about a 20% performance improvement over {{\\|Deimos}} gained through both architectural improvements and transistor improvements due to the migration from the [[N7|7-nanometer node]] to the [[N5|5-nanometer node]]
-The A78 is a 6-way (predecoded) 4-way (decode) [[superscalar]] [[out-of-order]] processor with a 12-wide execution engine, a private level 1, and level 2 caches. It is designed to be implemented inside the [[DynamIQ Shared Unit]] (DSU) cluster along with other cores. The DSU cluster supports up to [[eight cores]] of any combination (e.g., with [[little cores]] such as the {{\\|Cortex-A55}} or other just more Cortex-A78s). Additionally, this core may also be combined with the {{\\|Cortex-X1}} in order to achieve higher single-thread.
+The A78 is a 6-way [[superscalar]] [[out-of-order]] processor with a 12-wide execution engine, a private level 1, and level 2 caches. It is designed to be implemented inside the [[DynamIQ Shared Unit]] (DSU) cluster along with other cores. The DSU cluster supports up to [[eight cores]] of any combination (e.g., with [[little cores]] such as the {{\\|Cortex-A55}} or other just more Cortex-A78s). Additionally, this core may also be combined with the {{\\|Cortex-X1}} in order to achieve higher single-thread.
 === DSU Cluster ===
-The Cortex-A78 is designed to be integrated into a [[DynamIQ Shared Unit]] (DSU) cluster with up to [[eight cores]]. Up to four Cortex-A78s may be clustered together. The cluster may also inclde up to four additional [[little cores]] such as the {{\\|Cortex-A55}} in a [[big.LITTLE]] configuration. Additionally, one or more of the A78 cores [[arm_holdings/microarchitectures/cortex-x1#DSU Cluster|may be swapped out]] for a {{\\|Cortex-X1}} core in order to achieve even higher performance. Compared to a quad-core {{\\|Cortex-A77|A77}} cluster on [[N7|7 nm]], a quad-core A78 cluster on [[N5|5 nm]] provides +20% sustained performance improvement while reducing the silicon area by about 15%.
+The Cortex-A78 is designed to be integrated into a [[DynamIQ Shared Unit]] (DSU) cluster with up to [[eight cores]] of any combination (e.g., with [[little cores]] such as the {{\\|Cortex-A55}} or other just more Cortex-A78s). Compared to a quad-core {{\\|Cortex-A77|A77}} cluster on [[N7|7 nm]], a quad-core {{\\|Cortex-A78|A78}} cluster on [[N5|5 nm]] provides +20% sustained performance improvement while reducing the silicon area by about 15%. Additionally, one or more of the A78 cores [[arm_holdings/microarchitectures/cortex-x1#DSU Cluster|may be swapped out]] for a {{\\|Cortex-X1}} core in order to  achieve even higher performance.
 == Core ==
@@ Line 251: / Line 239: @@
 ===== Memory subsystem =====
-The memory subsystem was improved on the A78. Whereas the {{\\|A77}} had two generic [[address-generation unit]] - each capable of supporting both loads and stores, Hercules adds a new dedicated load AGU unit, increasing the load bandwidth by 50%. In other words, the Cortex-A78 is capable of performing either a load or a store on 2 ports (any combination, e.g., LD+ST or ST+ST) and another load on a third port. Along with those changes, Arm doubled the store-data bandwidth from 16B/cycle to 32B/cycle.
+The memory subsystem was improved on the A78. Whereas the {{\\|A77}} had two generic [[address-generation unit]] - each capable of supporting both loads and stores, Hercules adds a new deducted load AGU unit, including the load bandwidth by 50%. In other words, the Cortex-A78 is capable of performing either a loads or a store on 2 ports (any combination, e.g., LD+ST or ST+ST) and another load on a third port. Along with those changes, Arm doubled the store-data bandwidth from 16B/cycle to 32B/cycle.
 Like the instruction cache, the [[level 1 data cache]] on Hercules was also made configurable, allowing for either 32 KiB or 64 KiB and with an optional ECC protection per 32 bits. It is [[virtually indexed, physically tagged]] which behaves as a [[physically indexed, physically tagged]] 4-way set-associative cache. The L1D cache implements a [[pseudo-LRU]] [[cache replacement]] policy. It features a 4-cycle fastest load-to-use latency with two read ports and one write port meaning it can do two 16B loads/cycle and one 32B store/cycle. From the L1, the A78 supported up to 20 outstanding non-prefetch misses. Previously, the {{\\|A77}} had an 85-entry load buffer and a 90-entry store buffer. Arm says the functionality of those two buffers is now distributed across several structures. Hercules improved the data prefetchers. Arm says Hercules introduced a number of new prefetch engines, covering some new stride patterns and new irregular access patterns.
codename	Cortex-A78 +
core count	1 +, 2 +, 4 +, 6 + and 8 +
designer	ARM Holdings +
first launched	May 26, 2020 +
full page name	arm holdings/microarchitectures/cortex-a78 +
instance of	microarchitecture +
instruction set architecture	ARMv8.2 +
manufacturer	TSMC +
microarchitecture type	CPU +
name	Cortex-A78 +
pipeline stages	13 +
process	10 nm (0.01 μm, 1.0e-5 mm) +, 7 nm (0.007 μm, 7.0e-6 mm) + and 5 nm (0.005 μm, 5.0e-6 mm) +