From WikiChip
Difference between revisions of "arm holdings/microarchitectures/cortex-a76"
< arm holdings

(Memory Hierarchy)
(3->4 way is 1.33x, not 1.25x)
 
(31 intermediate revisions by 3 users not shown)
Line 6: Line 6:
 
|manufacturer=TSMC
 
|manufacturer=TSMC
 
|introduction=May 31, 2018
 
|introduction=May 31, 2018
|process=7 nm
+
|process=12 nm
 +
|process 2=7 nm
 +
|process 3=5 nm
 
|cores=1
 
|cores=1
 
|cores 2=2
 
|cores 2=2
 
|cores 3=4
 
|cores 3=4
 +
|cores 4=6
 +
|cores 5=8
 +
|type=Superscalar
 +
|type 2=Pipelined
 
|oooe=Yes
 
|oooe=Yes
 
|speculative=Yes
 
|speculative=Yes
Line 33: Line 39:
 
|predecessor=Cortex-A75
 
|predecessor=Cortex-A75
 
|predecessor link=arm holdings/microarchitectures/cortex-a75
 
|predecessor link=arm holdings/microarchitectures/cortex-a75
|successor=Deimos
+
|successor=Cortex-A77
|successor link=arm holdings/microarchitectures/deimos
+
|successor link=arm holdings/microarchitectures/cortex-a77
|contemporary=Ares
 
|contemporary link=arm holdings/microarchitectures/ares
 
 
}}
 
}}
'''Cortex-A76''' (codename '''Enyo''') is the successor to the {{armh|Cortex-A75|l=arch}}, a low-power high-performance [[ARM]] [[microarchitecture]] designed by [[ARM Holdings]] for the mobile market. This microarchitecture is designed as a synthesizable [[IP core]] and is sold to other semiconductor companies to be implemented in their own chips. The Cortex-A76, which implemented the {{arm|ARMv8.2}} ISA, is the a performant core which is often combined with a number of lower power cores (e.g. {{\\|Cortex-A55}}) in a {{armh|DynamIQ big.LITTLE}} configuration to achieve better energy/performance.
+
'''Cortex-A76''' (codename '''Enyo''') is the successor to the {{armh|Cortex-A75|l=arch}}, a low-power high-performance [[ARM]] [[microarchitecture]] designed by [[ARM Holdings]] for the mobile market. Enyo was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable [[IP core]] and is sold to other semiconductor companies to be implemented in their own chips. The Cortex-A76, which implemented the {{arm|ARMv8.2}} ISA, is the a performant core which is often combined with a number of lower power cores (e.g. {{\\|Cortex-A55}}) in a {{armh|DynamIQ big.LITTLE}} configuration to achieve better energy/performance.
  
 
== History ==
 
== History ==
Development of the Cortex-A76 started in 2013. [[Arm]] formally announced Enyo during Arm Tech Day on May 31 2018.
+
[[File:arm deimos roadmap.png|right|thumb|Cortex-A76 and future cores roadmap.]]
 +
Development of the Cortex-A76 started in 2013. [[Arm]] formally announced Enyo during Computex on May 31 2018.
  
 
== Process Technology ==
 
== Process Technology ==
Line 58: Line 63:
 
|}
 
|}
  
If the Cortex-A76 is coupled with the {{\\|Cortex-A55}} in a big.LITTLE system, GCC also supports the following option:
+
If the Cortex-A76 is coupled with the {{\\|Cortex-A55}} in a [[big.LITTLE]] system, GCC also supports the following option:
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 69: Line 74:
 
== Architecture ==
 
== Architecture ==
 
=== Key changes from {{\\|Cortex-A75}} ===
 
=== Key changes from {{\\|Cortex-A75}} ===
 +
* Significant [[IPC]] uplift ([[Arm]] self-reported around 20% IPC on [[SPEC CPU2006]]/[[SPEC CPU2017]] int)
 +
* Front-end
 +
** [[Branch-prediction]]
 +
*** Improved accuracy
 +
*** Decoupled from the instruction fetch
 +
*** Runahead 32B/cycle
 +
* Execution engine
 +
** 1.33x wider decode (4-way, up from 3-way)
 +
** 1.33x wider rename/commit (4-way, up from 3-way)
 +
** 1.6x wider dispatch (8 µOPs/cycle, up from 5)
 +
 +
{{expand list}}
 +
 
=== Block Diagram ===
 
=== Block Diagram ===
 
==== Typical SoC ====
 
==== Typical SoC ====
Line 81: Line 99:
 
*** 64 KiB, 4-way set associative
 
*** 64 KiB, 4-way set associative
 
*** 64-byte cache lines
 
*** 64-byte cache lines
*** optional parity protection
+
*** Optional parity protection
 +
*** Write-back
 
** L1D Cache
 
** L1D Cache
 
*** 64 KiB, 4-way set associative
 
*** 64 KiB, 4-way set associative
 
*** 64-byte cache lines
 
*** 64-byte cache lines
 
*** 4-cycle fastest load-to-use latency
 
*** 4-cycle fastest load-to-use latency
*** optional ECC protection per 32 bits
+
*** Optional ECC protection per 32 bits
 +
*** Write-back
 
** L2 Cache
 
** L2 Cache
 
*** 256 KiB OR 512 KiB (2 banks)
 
*** 256 KiB OR 512 KiB (2 banks)
Line 93: Line 113:
 
*** optional ECC protection per 64 bits
 
*** optional ECC protection per 64 bits
 
*** [[Modified Exclusive Shared Invalid]] (MESI) coherency
 
*** [[Modified Exclusive Shared Invalid]] (MESI) coherency
 +
*** Strictly inclusive of the L1 data cache & non-inclusive of the L1 instruction cache
 +
*** Write-back
 
** L3 Cache
 
** L3 Cache
 
*** 2 MiB to 4 MiB, 16-way set associative
 
*** 2 MiB to 4 MiB, 16-way set associative
Line 115: Line 137:
  
 
== Core ==
 
== Core ==
The Cortex-A76 succeeds the {{\\|Cortex-A75}}. It is designed to take advantage of the [[7 nm]] node in order to deliver up to 35% higher performance and up to 40% lower power (compared to the A75 on the [[10 nm]] node). It's worth noting that the A76 brings higher performance at a slight hit to the area by going wider. On the [[7 nm process]], the Cortex-A76 targets frequencies of 3 GHz and higher.
+
[[File:a76 perf claims.png|thumb|right|Cortex-A76 on [[7nm]] compared to the {{\\|Cortex-A75}} on [[10nm]].]]
 +
The Cortex-A76 succeeds the {{\\|Cortex-A75}}. It is designed to take advantage of the [[7 nm]] node in order to deliver up to 40% higher performance at the same power level (measured at 750 mW/core), or alternatively, up to 50% lower power for the same performance compared to the {{\\|Cortex-A75}} on the [[10 nm]] node. This is achieved through a combination of both microarchitectural improvements as well as [[process technology]] advantages. It's worth noting that the A76 brings higher performance at a slight hit to the area by going wider. On the [[7 nm process]], the Cortex-A76 targets frequencies of 3 GHz and higher.
  
 
=== Pipeline ===
 
=== Pipeline ===
The Cortex-A76 is a complex, 4-way superscalar out-of-order processor with an 8-issue back end. The pipeline is 13 stages with an 11-cycle branch misprediction penalty. It has a 64 KiB [[level 1]] [[instruction cache]] and a 64 KiB [[level 1]] [[data cache]]along with a private [[level 2 cache]] that is configurable as either 256 KiB (1 bank) or 512 KiB (2 banks)
+
The Cortex-A76 is a complex, 4-way superscalar out-of-order processor with an 8-issue back end. The pipeline is 13 stages with an 11-cycle branch misprediction penalty. It has a 64 KiB [[level 1]] [[instruction cache]] and a 64 KiB [[level 1]] [[data cache]] along with a private [[level 2 cache]] that is configurable as either 256 KiB (1 bank) or 512 KiB (2 banks).
  
 
==== Front-end ====
 
==== Front-end ====
Each cycle, up to 16 bytes are fetched from the [[L1 instruction cache]]. The instruction fetch works in tandem with the branch predictor in order to ensure the instruction stream is ready to be fetched. Additionally, there is a return stack which stores the address and instruction set state ({{arm|AArch32}}/R14 or {{arm|AArch64}}/X30) on branches. On a return (e.g., <code>ret</code> on {{arm|AArch64}}), the return stack will pop. The BPU operates on 32-byte instruction windows, twice the fetch size. The Cortex-A76 has a fixed 64 KiB L1I cache. It is [[Virtually Indexed, Physically Tagged]] (VIPT), which behaves as a [[Physically Indexed, Physically Tagged]] (PIPT) 4-way set-associative cache. The L1I$ supports optional parity protection and implements a [[pseudo-LRU]] [[cache replacement]] policy. The instruction cache has a 256-bit read interface from the L2 cache. Each cycle up to 32 bytes may be transferred to the L1I cache from the shared L2 cache.
+
Each cycle, up to 16 bytes are fetched from the [[L1 instruction cache]]. The instruction fetch works in tandem with the branch predictor in order to ensure the instruction stream is constantly ready to be fetched. Additionally, there is a return stack which stores the address and instruction set state ({{arm|AArch32}}/R14 or {{arm|AArch64}}/X30) on branches. On a return (e.g., <code>ret</code> on {{arm|AArch64}}), the return stack will pop.
  
From the instruction fetch, up to four 32-bit instructions are sent to the decode queue (DQ) each cycle. For narrower 16-bit instructions (i.e., {{arm|Thumb}}), this means up to eight instructions get queued. The A76 features a 4-way decode. Up to four instructions may be decoded into [[macro-operations]] each cycle.
+
Keeping the instruction stream feed is the task of the branch prediction unit. The branch prediction unit on the A76 is decoupled from the instruction fetch, allowing it to run ahead and in parallel with the instruction fetch to hide branch prediction latency. To that end, it now operates on 32-byte instruction windows, twice the fetch size. The main [[branch target buffer]] on the A76 is 6K-entries deep. The BPU comprises three stages in order to reduce latency with a 64-entry micro-BTB and a smaller 16-entry BTB.
 +
 
 +
The Cortex-A76 has a fixed 64 KiB L1I cache. It is [[virtually indexed, physically tagged]] (VIPT), which behaves as a [[physically indexed, physically tagged]] (PIPT) 4-way set-associative cache. The L1I$ supports optional parity protection and implements a [[pseudo-LRU]] [[cache replacement]] policy. The instruction cache has a 256-bit read interface from the L2 cache. Each cycle up to 32 bytes may be transferred to the L1I cache from the shared L2 cache.
 +
 
 +
From the instruction fetch, up to four 32-bit instructions are sent to the decode queue (DQ) each cycle. For narrower 16-bit instructions (i.e., {{arm|Thumb}}), this means up to eight instructions get queued. The A76 features a 4-way decode. Each cycle, up to four instructions may be decoded into a relatively semi-complex [[macro-operations]] (MOPs). There are on average 6% more MOPs than instructions. In total two cycles are involved in this operation - one for alignment and one for decode.
  
 
==== Back-end ====
 
==== Back-end ====
Line 129: Line 156:
  
 
===== Renaming & Allocation =====
 
===== Renaming & Allocation =====
From the front-end, up to four [[macro-operations]] may be sent each cycle to be renamed. The ROB has a capacity of up to 128 instructions in flight. [[Micro-operations]] are broken down into their [[µOP]] constituents and are scheduled for execution. Roughly 20% more µOPs are generated from the MOPs. From here, µOPs are sent to the instruction issue which controls when they can be dispatched to the execution pipelines. µOPs are queued in eight independent issue queues (120 entries in total).
+
From the front-end, up to four [[macro-operations]] may be sent each cycle to be renamed. The A76 has a capacity to handle up to 128 instructions in flight, a number that has no increased for a long time (the {{\\|Cortex-A72}} and even the {{\\|Cortex-A57}} had an out-of-order window of 128 instructions). [[Micro-operations]] are broken down into their [[µOP]] constituents and are scheduled for execution. Roughly 20% more µOPs are generated from the MOPs. From here, µOPs are sent to the instruction issue which controls when they can be dispatched to the execution pipelines. µOPs are queued in eight independent issue queues (120 entries in total).
  
 
===== Execution Units =====
 
===== Execution Units =====
 
The A76 issue is 8-wide, allow for up to eight µOPs to execute each cycle. The execution units can be grouped into three categories: integer, advanced SIMD, and memory.
 
The A76 issue is 8-wide, allow for up to eight µOPs to execute each cycle. The execution units can be grouped into three categories: integer, advanced SIMD, and memory.
  
There are four pipelines in the integer cluster - three for general math operations and a dedicate branch ALU. All three ports have a simple ALU. Those perform arithmetic and logical data processing operations. The third port has support for complex arithmetic (e.g. MAC, DIV).
+
There are four pipelines in the integer cluster - three for general math operations and a dedicate branch ALU. All three ports have a simple ALU. Those perform arithmetic and logical data processing operations. The third port has support for complex arithmetic (e.g. MAC, DIV). Each port is served by an independent 16-entry issue queue.
 +
 
 +
There are two {{arm|ASIMD}}/FP execution pipelines. Those are served by two 16-entry issue queues of their own. In the {{\\|Cortex-A75}}, each of the pipelines were 64-bit wide, on the A76, they were doubled to 128-bit. This means each pipeline is capable of 2 [[double-precision]] operations, 4 single-precision, 8 half-precision, or 16 8-bit integer operations. On the A76, those pipelines can also execute the cryptographic instructions if the extension is supported (not offered by default and requires an additional license from [[Arm]]).
 +
 
 +
Beyond the throughput improvements, the A76 improved some critical execution latencies.
  
There are two {{arm|ASIMD}}/FP execution pipelines. In the {{\\|Cortex-A75}}, each of the pipelines were 64-bit wide, on the A76, they were doubled to 128-bit. This means each pipeline is capable of 2 [[double-precision]] operations, 4 single-precision, 8 half-precision, or 16 8-bit integer operations. On the A76, those pipelines can also execute the cryptographic instructions if the extension is supported (not offered by default and requires an additional license from [[Arm]]).
+
:[[File:a76 latency comprison.svg|thumb|left|700px|Generational improvements in execution latencies.]]
 +
 
 +
{{clear}}
  
 
===== Memory subsystem =====
 
===== Memory subsystem =====
The A76 includes two ports with an [[address-generation unit]] on each. The [[level 1 data cache]] is fixed at 64 KiB and can have an optional ECC protection per 32 bits. It is [[virtually indexed, physically tagged]] which behaves as a [[physically indexed, physically tagged]] 4-way set-associative cache. The L1D cache implements a [[pseudo-LRU]] [[cache replacement]] policy. It features a 4-cycle fastest load-to-use latency with two read ports and one write port meaning it can do two 16B loads/cycle and one 32B store/cycle. From the L1, the A76 supports up to 20 outstanding non-prefetch misses. The load buffer is 68 entries deep while the store buffer is 72-entry deep. In total, the A76 can have 140 simultaneous memory operations in-flight which is actually 25% more than the A76 instruction window.
+
The A76 includes two ports with an [[address-generation unit]] on each - each supporting both loads and stores. The [[level 1 data cache]] is fixed at 64 KiB and can have an optional ECC protection per 32 bits. It is [[virtually indexed, physically tagged]] which behaves as a [[physically indexed, physically tagged]] 4-way set-associative cache. The L1D cache implements a [[pseudo-LRU]] [[cache replacement]] policy. It features a 4-cycle fastest load-to-use latency with two read ports and one write port meaning it can do two 16B loads/cycle and one 32B store/cycle. From the L1, the A76 supports up to 20 outstanding non-prefetch misses. The load buffer is 68 entries deep while the store buffer is 72-entry deep. In total, the A76 can have 140 simultaneous memory operations in-flight which is actually 25% more than the A76 instruction window.
  
The A76 can be configured with either 128, 256 or 512 KiB of [[level 2 cache]] and is ECC protected per 64 bits. There is a 256-bit write interface to the L2 and a 256-bit read interface from the L2 cache. The fastest load-to-use latency is 9 cycles. The L2 can support up to 46 outstanding misses to the L3 which is located in the {{armh|DSU}} itself. The L3, which is shared by all the cores in the {{armh|DynamIQ big.LITTLE}} and is configurable in size ranging from 2 MiB to 4 MiB with load-to-use ranging from 26 to 31 cycles. As with the L2, up to 32 bytes may be transferred from or to the L2 from the L3 cache. Up to 94 outstanding misses are supported from the L3 to main memory.
+
The A76 can be configured with either 128, 256 or 512 KiB of [[level 2 cache]]. It implements a [[dynamic biased replacement]] policy and is ECC protected per 64 bits. The L2 is strictly inclusive of the L1 data cache and non-inclusive of the L1 instruction cache. There is a 256-bit write interface to the L2 and a 256-bit read interface from the L2 cache. The fastest load-to-use latency is 9 cycles. The L2 can support up to 46 outstanding misses to the L3 which is located in the {{armh|DSU}} itself. The L3, which is shared by all the cores in the {{armh|DynamIQ big.LITTLE}} and is configurable in size ranging from 2 MiB to 4 MiB with load-to-use ranging from 26 to 31 cycles. As with the L2, up to two 32 bytes may be transferred from or to the L2 from the L3 cache. Up to 94 outstanding misses are supported from the L3 to main memory.
  
 
In addition to controlling memory accesses, ordering, and [[cache policies]], the MMU is also responsible for the translation of virtual addresses to physical addresses on the system. This is done through a set of virtual-to-physical address mappings and attributes that are held in translation tables. The physical address size here is 40 bits. The Cortex-A76 incorporates a dedicated L1 TLB for instruction cache and another one for the data cache. Both the ITLB and the DTLB are 48-entry deep and are fully associative. On a memory access operation, the A76 will first perform lookup in there. If there is a miss in the L1 TLBs, the MMU will perform a lookup for the requested entry in the second-level TLB.
 
In addition to controlling memory accesses, ordering, and [[cache policies]], the MMU is also responsible for the translation of virtual addresses to physical addresses on the system. This is done through a set of virtual-to-physical address mappings and attributes that are held in translation tables. The physical address size here is 40 bits. The Cortex-A76 incorporates a dedicated L1 TLB for instruction cache and another one for the data cache. Both the ITLB and the DTLB are 48-entry deep and are fully associative. On a memory access operation, the A76 will first perform lookup in there. If there is a miss in the L1 TLBs, the MMU will perform a lookup for the requested entry in the second-level TLB.
Line 148: Line 181:
  
 
The TLB entries store one or both of the global indicator and an address space identifier (ASID), allowing context switching without TLB invalidation as well as a virtual machine identifier (VMID) which allows for VM switching by the hypervisor without TLB invalidation.
 
The TLB entries store one or both of the global indicator and an address space identifier (ASID), allowing context switching without TLB invalidation as well as a virtual machine identifier (VMID) which allows for VM switching by the hypervisor without TLB invalidation.
 +
 +
== All Cortex-A76 Processors ==
 +
<!-- NOTE:
 +
          This table is generated automatically from the data in the actual articles.
 +
          If a microprocessor is missing from the list, an appropriate article for it needs to be
 +
          created and tagged accordingly.
 +
 +
          Missing a chip? please dump its name here: https://en.wikichip.org/wiki/WikiChip:wanted_chips
 +
-->
 +
{{comp table start}}
 +
<table class="comptable sortable tc4 tc6 tc9">
 +
{{comp table header|main|8:List of Cortex-A76-based Processors}}
 +
{{comp table header|main|6:Main processor|2:Integrated Graphics}}
 +
{{comp table header|cols|Family|Launched|Process|Arch|Cores|%Frequency|GPU|%Frequency}}
 +
{{#ask: [[Category:all microprocessor models]] [[microarchitecture::Cortex-A76]]
 +
|?full page name
 +
|?model number
 +
|?family
 +
|?first launched
 +
|?process
 +
|?microarchitecture
 +
|?core count
 +
|?base frequency#GHz
 +
|?integrated gpu
 +
|?integrated gpu base frequency
 +
|format=template
 +
|template=proc table 3
 +
|userparam=10
 +
|mainlabel=-
 +
|valuesep=,
 +
}}
 +
{{comp table count|ask=[[Category:all microprocessor models]] [[microarchitecture::Cortex-A76]]}}
 +
</table>
 +
{{comp table end}}
  
 
== Bibliography ==
 
== Bibliography ==
 
* Arm Tech Day, 2018
 
* Arm Tech Day, 2018

Latest revision as of 17:15, 17 October 2020

Edit Values
Cortex-A76 µarch
General Info
Arch TypeCPU
DesignerARM Holdings
ManufacturerTSMC
IntroductionMay 31, 2018
Process12 nm, 7 nm, 5 nm
Core Configs1, 2, 4, 6, 8
Pipeline
TypeSuperscalar, Pipelined
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages13
Decode4-way
Instructions
ISAARMv8.2
ExtensionsFPU, NEON
Cache
L1I Cache64 KiB/core
4-way set associative
L1D Cache64 KiB/core
4-way set associative
L2 Cache128-512 KiB/core
8-way set associative
L3 Cache0-4 MiB/Cluster
16-way set associative
Succession

Cortex-A76 (codename Enyo) is the successor to the Cortex-A75, a low-power high-performance ARM microarchitecture designed by ARM Holdings for the mobile market. Enyo was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable IP core and is sold to other semiconductor companies to be implemented in their own chips. The Cortex-A76, which implemented the ARMv8.2 ISA, is the a performant core which is often combined with a number of lower power cores (e.g. Cortex-A55) in a DynamIQ big.LITTLE configuration to achieve better energy/performance.

History[edit]

Cortex-A76 and future cores roadmap.

Development of the Cortex-A76 started in 2013. Arm formally announced Enyo during Computex on May 31 2018.

Process Technology[edit]

Though the Cortex-A76 may be fabricated on various different process nodes, it has been primarily designed for the 12 nm, 7 nm, and 5 nm process nodes.

Compiler support[edit]

Compiler Arch-Specific Arch-Favorable
Arm Compiler -mcpu=cortex-a76 -mtune=cortex-a76
GCC -mcpu=cortex-a76 -mtune=cortex-a76
LLVM -march=? -mtune=?

If the Cortex-A76 is coupled with the Cortex-A55 in a big.LITTLE system, GCC also supports the following option:

Compiler Tune
GCC -mtune=cortex-a76.cortex-a55

Architecture[edit]

Key changes from Cortex-A75[edit]

  • Significant IPC uplift (Arm self-reported around 20% IPC on SPEC CPU2006/SPEC CPU2017 int)
  • Front-end
    • Branch-prediction
      • Improved accuracy
      • Decoupled from the instruction fetch
      • Runahead 32B/cycle
  • Execution engine
    • 1.33x wider decode (4-way, up from 3-way)
    • 1.33x wider rename/commit (4-way, up from 3-way)
    • 1.6x wider dispatch (8 µOPs/cycle, up from 5)

This list is incomplete; you can help by expanding it.

Block Diagram[edit]

Typical SoC[edit]

cortex-a76 soc block diagram.svg

Individual Core[edit]

cortex-a76 block diagram.svg

Memory Hierarchy[edit]

The Cortex-A76 has a private L1I, L1D, and L2 cache.

  • Cache
    • L1I Cache
      • 64 KiB, 4-way set associative
      • 64-byte cache lines
      • Optional parity protection
      • Write-back
    • L1D Cache
      • 64 KiB, 4-way set associative
      • 64-byte cache lines
      • 4-cycle fastest load-to-use latency
      • Optional ECC protection per 32 bits
      • Write-back
    • L2 Cache
      • 256 KiB OR 512 KiB (2 banks)
      • 8-way set associative
      • 9-cycle fastest load-to-use latency
      • optional ECC protection per 64 bits
      • Modified Exclusive Shared Invalid (MESI) coherency
      • Strictly inclusive of the L1 data cache & non-inclusive of the L1 instruction cache
      • Write-back
    • L3 Cache
      • 2 MiB to 4 MiB, 16-way set associative
      • 26-31 cycles load-to-use
      • Shared by all the cores in the cluster

The A76 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).

  • TLBs
    • ITLB
      • 4 KiB, 16 KiB, 64 KiB, 2 MiB, and 32 MiB page sizes
      • 48-entry fully associative
    • DTLB
      • 48-entry fully associative
      • 4 KiB, 16 KiB, 64 KiB, 2 MiB, and 512 MiB page sizes
    • STLB
      • 1280-entry 5-way set associative

Overview[edit]

The Cortex-A76 is a high-performance synthesizable core designed by Arm as the successor to the Cortex-A75. It is delivered as Register Transfer Level (RTL) description in Verilog and is designed. This core supports the ARMv8.2 extension as well as a number of other partial extensions. The A76 is a 4-way superscalar out-of-order processor with a private level 1 and level 2 caches. It is designed to be implemented inside the DynamIQ Shared Unit (DSU) cluster along with other cores. The DSU cluster supports up to eight cores of any combination (e.g., with little cores such as the Cortex-A55 or other just more Cortex-A76).

Core[edit]

Cortex-A76 on 7nm compared to the Cortex-A75 on 10nm.

The Cortex-A76 succeeds the Cortex-A75. It is designed to take advantage of the 7 nm node in order to deliver up to 40% higher performance at the same power level (measured at 750 mW/core), or alternatively, up to 50% lower power for the same performance compared to the Cortex-A75 on the 10 nm node. This is achieved through a combination of both microarchitectural improvements as well as process technology advantages. It's worth noting that the A76 brings higher performance at a slight hit to the area by going wider. On the 7 nm process, the Cortex-A76 targets frequencies of 3 GHz and higher.

Pipeline[edit]

The Cortex-A76 is a complex, 4-way superscalar out-of-order processor with an 8-issue back end. The pipeline is 13 stages with an 11-cycle branch misprediction penalty. It has a 64 KiB level 1 instruction cache and a 64 KiB level 1 data cache along with a private level 2 cache that is configurable as either 256 KiB (1 bank) or 512 KiB (2 banks).

Front-end[edit]

Each cycle, up to 16 bytes are fetched from the L1 instruction cache. The instruction fetch works in tandem with the branch predictor in order to ensure the instruction stream is constantly ready to be fetched. Additionally, there is a return stack which stores the address and instruction set state (AArch32/R14 or AArch64/X30) on branches. On a return (e.g., ret on AArch64), the return stack will pop.

Keeping the instruction stream feed is the task of the branch prediction unit. The branch prediction unit on the A76 is decoupled from the instruction fetch, allowing it to run ahead and in parallel with the instruction fetch to hide branch prediction latency. To that end, it now operates on 32-byte instruction windows, twice the fetch size. The main branch target buffer on the A76 is 6K-entries deep. The BPU comprises three stages in order to reduce latency with a 64-entry micro-BTB and a smaller 16-entry BTB.

The Cortex-A76 has a fixed 64 KiB L1I cache. It is virtually indexed, physically tagged (VIPT), which behaves as a physically indexed, physically tagged (PIPT) 4-way set-associative cache. The L1I$ supports optional parity protection and implements a pseudo-LRU cache replacement policy. The instruction cache has a 256-bit read interface from the L2 cache. Each cycle up to 32 bytes may be transferred to the L1I cache from the shared L2 cache.

From the instruction fetch, up to four 32-bit instructions are sent to the decode queue (DQ) each cycle. For narrower 16-bit instructions (i.e., Thumb), this means up to eight instructions get queued. The A76 features a 4-way decode. Each cycle, up to four instructions may be decoded into a relatively semi-complex macro-operations (MOPs). There are on average 6% more MOPs than instructions. In total two cycles are involved in this operation - one for alignment and one for decode.

Back-end[edit]

The Cortex-A76 back-end handles the execution of out-of-order operations. The design is largely inherited from the Cortex-A75 but has been adjusted for higher throughput.

Renaming & Allocation[edit]

From the front-end, up to four macro-operations may be sent each cycle to be renamed. The A76 has a capacity to handle up to 128 instructions in flight, a number that has no increased for a long time (the Cortex-A72 and even the Cortex-A57 had an out-of-order window of 128 instructions). Micro-operations are broken down into their µOP constituents and are scheduled for execution. Roughly 20% more µOPs are generated from the MOPs. From here, µOPs are sent to the instruction issue which controls when they can be dispatched to the execution pipelines. µOPs are queued in eight independent issue queues (120 entries in total).

Execution Units[edit]

The A76 issue is 8-wide, allow for up to eight µOPs to execute each cycle. The execution units can be grouped into three categories: integer, advanced SIMD, and memory.

There are four pipelines in the integer cluster - three for general math operations and a dedicate branch ALU. All three ports have a simple ALU. Those perform arithmetic and logical data processing operations. The third port has support for complex arithmetic (e.g. MAC, DIV). Each port is served by an independent 16-entry issue queue.

There are two ASIMD/FP execution pipelines. Those are served by two 16-entry issue queues of their own. In the Cortex-A75, each of the pipelines were 64-bit wide, on the A76, they were doubled to 128-bit. This means each pipeline is capable of 2 double-precision operations, 4 single-precision, 8 half-precision, or 16 8-bit integer operations. On the A76, those pipelines can also execute the cryptographic instructions if the extension is supported (not offered by default and requires an additional license from Arm).

Beyond the throughput improvements, the A76 improved some critical execution latencies.

Generational improvements in execution latencies.
Memory subsystem[edit]

The A76 includes two ports with an address-generation unit on each - each supporting both loads and stores. The level 1 data cache is fixed at 64 KiB and can have an optional ECC protection per 32 bits. It is virtually indexed, physically tagged which behaves as a physically indexed, physically tagged 4-way set-associative cache. The L1D cache implements a pseudo-LRU cache replacement policy. It features a 4-cycle fastest load-to-use latency with two read ports and one write port meaning it can do two 16B loads/cycle and one 32B store/cycle. From the L1, the A76 supports up to 20 outstanding non-prefetch misses. The load buffer is 68 entries deep while the store buffer is 72-entry deep. In total, the A76 can have 140 simultaneous memory operations in-flight which is actually 25% more than the A76 instruction window.

The A76 can be configured with either 128, 256 or 512 KiB of level 2 cache. It implements a dynamic biased replacement policy and is ECC protected per 64 bits. The L2 is strictly inclusive of the L1 data cache and non-inclusive of the L1 instruction cache. There is a 256-bit write interface to the L2 and a 256-bit read interface from the L2 cache. The fastest load-to-use latency is 9 cycles. The L2 can support up to 46 outstanding misses to the L3 which is located in the DSU itself. The L3, which is shared by all the cores in the DynamIQ big.LITTLE and is configurable in size ranging from 2 MiB to 4 MiB with load-to-use ranging from 26 to 31 cycles. As with the L2, up to two 32 bytes may be transferred from or to the L2 from the L3 cache. Up to 94 outstanding misses are supported from the L3 to main memory.

In addition to controlling memory accesses, ordering, and cache policies, the MMU is also responsible for the translation of virtual addresses to physical addresses on the system. This is done through a set of virtual-to-physical address mappings and attributes that are held in translation tables. The physical address size here is 40 bits. The Cortex-A76 incorporates a dedicated L1 TLB for instruction cache and another one for the data cache. Both the ITLB and the DTLB are 48-entry deep and are fully associative. On a memory access operation, the A76 will first perform lookup in there. If there is a miss in the L1 TLBs, the MMU will perform a lookup for the requested entry in the second-level TLB.

There is a unified level 2 TLB comprising of 1280 entries organized as 5-way set associative which is shared by both instruction and data. The STLB handles misses from the instruction and data L1 TLBs. Typically, STLB accesses take three cycles, however, longer latencies are possible when a different block or page size mapping is used. If there is a miss in the L2 TLB, the MMU will resort to a hardware translation table walk. Up to four TLB misses (i.e., translations table walks) can be performed in parallel. The STLB will stall if there are six successive misses. During table walks, the STLB can still perform up to two TLB lookups.

The TLB entries store one or both of the global indicator and an address space identifier (ASID), allowing context switching without TLB invalidation as well as a virtual machine identifier (VMID) which allows for VM switching by the hypervisor without TLB invalidation.

All Cortex-A76 Processors[edit]

 List of Cortex-A76-based Processors
 Main processorIntegrated Graphics
ModelFamilyLaunchedProcessArchCoresFrequencyGPUFrequency
800DimensityMarch 20207 nm
0.007 μm
7.0e-6 mm
Cortex-A76, Cortex-A5582 GHz
2,000 MHz
2,000,000 kHz
Mali-G57
990Exynos20207 nm
0.007 μm
7.0e-6 mm
Mongoose 5, Cortex-A76, Cortex-A5583.016 GHz
3,016 MHz
3,016,000 kHz
, 2.6 GHz
2,600 MHz
2,600,000 kHz
, 2.106 GHz
2,106 MHz
2,106,000 kHz
Mali-G77832 MHz
0.832 GHz
832,000 KHz
V9Exynos Auto3 January 20188 nm
0.008 μm
8.0e-6 mm
Cortex-A7682.1 GHz
2,100 MHz
2,100,000 kHz
Mali-G76
810Kirin21 June 20197 nm
0.007 μm
7.0e-6 mm
Cortex-A76, Cortex-A5582.27 GHz
2,270 MHz
2,270,000 kHz
, 1.88 GHz
1,880 MHz
1,880,000 kHz
Mali-G52850 MHz
0.85 GHz
850,000 KHz
980Kirin31 August 20187 nm
0.007 μm
7.0e-6 mm
Cortex-A76, Cortex-A5582.6 GHz
2,600 MHz
2,600,000 kHz
, 1.92 GHz
1,920 MHz
1,920,000 kHz
, 1.8 GHz
1,800 MHz
1,800,000 kHz
Mali-G76720 MHz
0.72 GHz
720,000 KHz
990 4GKirin6 September 20197 nm
0.007 μm
7.0e-6 mm
Cortex-A76, Cortex-A5582.86 GHz
2,860 MHz
2,860,000 kHz
, 1.86 GHz
1,860 MHz
1,860,000 kHz
, 2.088 GHz
2,088 MHz
2,088,000 kHz
Mali-G76600 MHz
0.6 GHz
600,000 KHz
990 5GKirin6 September 2019Cortex-A76, Cortex-A5582.86 GHz
2,860 MHz
2,860,000 kHz
, 2.36 GHz
2,360 MHz
2,360,000 kHz
, 1.95 GHz
1,950 MHz
1,950,000 kHz
Mali-G76600 MHz
0.6 GHz
600,000 KHz
SDM675Snapdragon 60022 October 201811 nm
0.011 μm
1.1e-5 mm
Cortex-A76, Cortex-A5582 GHz
2,000 MHz
2,000,000 kHz
, 1.7 GHz
1,700 MHz
1,700,000 kHz
Adreno 612
Snapdragon 720GSnapdragon 70020 January 20208 nm
0.008 μm
8.0e-6 mm
Cortex-A55, Cortex-A7681.8 GHz
1,800 MHz
1,800,000 kHz
, 2.3 GHz
2,300 MHz
2,300,000 kHz
Adreno 618500 MHz
0.5 GHz
500,000 KHz
SDM730Snapdragon 7009 April 20198 nm
0.008 μm
8.0e-6 mm
Cortex-A55, Cortex-A7681.8 GHz
1,800 MHz
1,800,000 kHz
, 2.2 GHz
2,200 MHz
2,200,000 kHz
Adreno 618500 MHz
0.5 GHz
500,000 KHz
SDM730GSnapdragon 7009 April 20198 nm
0.008 μm
8.0e-6 mm
Cortex-A76, Cortex-A5582.2 GHz
2,200 MHz
2,200,000 kHz
, 1.8 GHz
1,800 MHz
1,800,000 kHz
Adreno 618575 MHz
0.575 GHz
575,000 KHz
SDM855Snapdragon 800March 20197 nm
0.007 μm
7.0e-6 mm
Cortex-A55, Cortex-A7681.8 GHz
1,800 MHz
1,800,000 kHz
, 2.42 GHz
2,420 MHz
2,420,000 kHz
, 2.84 GHz
2,840 MHz
2,840,000 kHz
Adreno 640 GPU257 MHz
0.257 GHz
257,000 KHz
SDM855ACSnapdragon 800March 20197 nm
0.007 μm
7.0e-6 mm
Cortex-A76, Cortex-A5581.8 GHz
1,800 MHz
1,800,000 kHz
, 2.42 GHz
2,420 MHz
2,420,000 kHz
, 2.96 GHz
2,960 MHz
2,960,000 kHz
Adreno 640 GPU250 MHz
0.25 GHz
250,000 KHz
8cxSnapdragon 800March 20197 nm
0.007 μm
7.0e-6 mm
Cortex-A76, Cortex-A5581.8 GHz
1,800 MHz
1,800,000 kHz
, 2.84 GHz
2,840 MHz
2,840,000 kHz
, 3.02 GHz
3,020 MHz
3,020,000 kHz
Adreno 680 GPU
Count: 14

Bibliography[edit]

  • Arm Tech Day, 2018
codenameCortex-A76 +
core count1 +, 2 +, 4 +, 6 + and 8 +
designerARM Holdings +
first launchedMay 31, 2018 +
full page namearm holdings/microarchitectures/cortex-a76 +
instance ofmicroarchitecture +
instruction set architectureARMv8.2 +
manufacturerTSMC +
microarchitecture typeCPU +
nameCortex-A76 +
pipeline stages13 +
process12 nm (0.012 μm, 1.2e-5 mm) +, 7 nm (0.007 μm, 7.0e-6 mm) + and 5 nm (0.005 μm, 5.0e-6 mm) +