From WikiChip
Editing arm holdings/microarchitectures/cortex-a76
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.
Latest revision | Your text | ||
Line 6: | Line 6: | ||
|manufacturer=TSMC | |manufacturer=TSMC | ||
|introduction=May 31, 2018 | |introduction=May 31, 2018 | ||
− | |process | + | |process=7 nm |
− | |||
− | |||
|cores=1 | |cores=1 | ||
|cores 2=2 | |cores 2=2 | ||
|cores 3=4 | |cores 3=4 | ||
− | |||
− | |||
− | |||
− | |||
|oooe=Yes | |oooe=Yes | ||
|speculative=Yes | |speculative=Yes | ||
Line 39: | Line 33: | ||
|predecessor=Cortex-A75 | |predecessor=Cortex-A75 | ||
|predecessor link=arm holdings/microarchitectures/cortex-a75 | |predecessor link=arm holdings/microarchitectures/cortex-a75 | ||
− | |successor= | + | |successor=Deimos |
− | |successor link=arm holdings/microarchitectures/ | + | |successor link=arm holdings/microarchitectures/deimos |
+ | |contemporary=Ares | ||
+ | |contemporary link=arm holdings/microarchitectures/ares | ||
}} | }} | ||
− | '''Cortex-A76''' (codename '''Enyo''') is the successor to the {{armh|Cortex-A75|l=arch}}, a low-power high-performance [[ARM]] [[microarchitecture]] designed by [[ARM Holdings]] for the mobile market | + | '''Cortex-A76''' (codename '''Enyo''') is the successor to the {{armh|Cortex-A75|l=arch}}, a low-power high-performance [[ARM]] [[microarchitecture]] designed by [[ARM Holdings]] for the mobile market. This microarchitecture is designed as a synthesizable [[IP core]] and is sold to other semiconductor companies to be implemented in their own chips. The Cortex-A76, which implemented the {{arm|ARMv8.2}} ISA, is the a performant core which is often combined with a number of lower power cores (e.g. {{\\|Cortex-A55}}) in a {{armh|DynamIQ big.LITTLE}} configuration to achieve better energy/performance. |
== History == | == History == | ||
− | + | Development of the Cortex-A76 started in 2013. [[Arm]] formally announced Enyo during Arm Tech Day on May 31 2018. | |
− | Development of the Cortex-A76 started in 2013. [[Arm]] formally announced Enyo during | ||
== Process Technology == | == Process Technology == | ||
Line 63: | Line 58: | ||
|} | |} | ||
− | If the Cortex-A76 is coupled with the {{\\|Cortex-A55}} in a | + | If the Cortex-A76 is coupled with the {{\\|Cortex-A55}} in a big.LITTLE system, GCC also supports the following option: |
{| class="wikitable" | {| class="wikitable" | ||
Line 74: | Line 69: | ||
== Architecture == | == Architecture == | ||
=== Key changes from {{\\|Cortex-A75}} === | === Key changes from {{\\|Cortex-A75}} === | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
=== Block Diagram === | === Block Diagram === | ||
==== Typical SoC ==== | ==== Typical SoC ==== | ||
Line 99: | Line 81: | ||
*** 64 KiB, 4-way set associative | *** 64 KiB, 4-way set associative | ||
*** 64-byte cache lines | *** 64-byte cache lines | ||
− | *** | + | *** optional parity protection |
− | |||
** L1D Cache | ** L1D Cache | ||
*** 64 KiB, 4-way set associative | *** 64 KiB, 4-way set associative | ||
*** 64-byte cache lines | *** 64-byte cache lines | ||
*** 4-cycle fastest load-to-use latency | *** 4-cycle fastest load-to-use latency | ||
− | *** | + | *** optional ECC protection per 32 bits |
− | |||
** L2 Cache | ** L2 Cache | ||
*** 256 KiB OR 512 KiB (2 banks) | *** 256 KiB OR 512 KiB (2 banks) | ||
Line 113: | Line 93: | ||
*** optional ECC protection per 64 bits | *** optional ECC protection per 64 bits | ||
*** [[Modified Exclusive Shared Invalid]] (MESI) coherency | *** [[Modified Exclusive Shared Invalid]] (MESI) coherency | ||
− | |||
− | |||
** L3 Cache | ** L3 Cache | ||
*** 2 MiB to 4 MiB, 16-way set associative | *** 2 MiB to 4 MiB, 16-way set associative | ||
Line 137: | Line 115: | ||
== Core == | == Core == | ||
− | + | The Cortex-A76 succeeds the {{\\|Cortex-A75}}. It is designed to take advantage of the [[7 nm]] node in order to deliver up to 35% higher performance and up to 40% lower power (compared to the A75 on the [[10 nm]] node). It's worth noting that the A76 brings higher performance at a slight hit to the area by going wider. On the [[7 nm process]], the Cortex-A76 targets frequencies of 3 GHz and higher. | |
− | The Cortex-A76 succeeds the {{\\|Cortex-A75}}. It is designed to take advantage of the [[7 nm]] node in order to deliver up to | ||
=== Pipeline === | === Pipeline === | ||
− | The Cortex-A76 is a complex, 4-way superscalar out-of-order processor with an 8-issue back end. The pipeline is 13 stages with an 11-cycle branch misprediction penalty. It has a 64 KiB [[level 1]] [[instruction cache]] and a 64 KiB [[level 1]] [[data cache]] along with a private [[level 2 cache]] that is configurable as either 256 KiB (1 bank) or 512 KiB (2 banks) | + | The Cortex-A76 is a complex, 4-way superscalar out-of-order processor with an 8-issue back end. The pipeline is 13 stages with an 11-cycle branch misprediction penalty. It has a 64 KiB [[level 1]] [[instruction cache]] and a 64 KiB [[level 1]] [[data cache]]along with a private [[level 2 cache]] that is configurable as either 256 KiB (1 bank) or 512 KiB (2 banks) |
==== Front-end ==== | ==== Front-end ==== | ||
− | Each cycle, up to 16 bytes are fetched from the [[L1 instruction cache]]. The instruction fetch works in tandem with the branch predictor in order to ensure the instruction stream is | + | Each cycle, up to 16 bytes are fetched from the [[L1 instruction cache]]. The instruction fetch works in tandem with the branch predictor in order to ensure the instruction stream is ready to be fetched. Additionally, there is a return stack which stores the address and instruction set state ({{arm|AArch32}}/R14 or {{arm|AArch64}}/X30) on branches. On a return (e.g., <code>ret</code> on {{arm|AArch64}}), the return stack will pop. The BPU operates on 32-byte instruction windows, twice the fetch size. The Cortex-A76 has a fixed 64 KiB L1I cache. It is [[Virtually Indexed, Physically Tagged]] (VIPT), which behaves as a [[Physically Indexed, Physically Tagged]] (PIPT) 4-way set-associative cache. The L1I$ supports optional parity protection and implements a [[pseudo-LRU]] [[cache replacement]] policy. The instruction cache has a 256-bit read interface from the L2 cache. Each cycle up to 32 bytes may be transferred to the L1I cache from the shared L2 cache. |
− | + | From the instruction fetch, up to four 32-bit instructions are sent to the decode queue (DQ) each cycle. For narrower 16-bit instructions (i.e., {{arm|Thumb}}), this means up to eight instructions get queued. The A76 features a 4-way decode. Up to four instructions may be decoded into [[macro-operations]] each cycle. | |
− | |||
− | |||
− | |||
− | From the instruction fetch, up to four 32-bit instructions are sent to the decode queue (DQ) each cycle. For narrower 16-bit instructions (i.e., {{arm|Thumb}}), this means up to eight instructions get queued. The A76 features a 4-way decode. | ||
==== Back-end ==== | ==== Back-end ==== | ||
Line 156: | Line 129: | ||
===== Renaming & Allocation ===== | ===== Renaming & Allocation ===== | ||
− | From the front-end, up to four [[macro-operations]] may be sent each cycle to be renamed. The | + | From the front-end, up to four [[macro-operations]] may be sent each cycle to be renamed. The ROB has a capacity of up to 128 instructions in flight. [[Micro-operations]] are broken down into their [[µOP]] constituents and are scheduled for execution. Roughly 20% more µOPs are generated from the MOPs. From here, µOPs are sent to the instruction issue which controls when they can be dispatched to the execution pipelines. µOPs are queued in eight independent issue queues (120 entries in total). |
===== Execution Units ===== | ===== Execution Units ===== | ||
The A76 issue is 8-wide, allow for up to eight µOPs to execute each cycle. The execution units can be grouped into three categories: integer, advanced SIMD, and memory. | The A76 issue is 8-wide, allow for up to eight µOPs to execute each cycle. The execution units can be grouped into three categories: integer, advanced SIMD, and memory. | ||
− | There are four pipelines in the integer cluster - three for general math operations and a dedicate branch ALU. All three ports have a simple ALU. Those perform arithmetic and logical data processing operations. The third port has support for complex arithmetic (e.g. MAC, DIV) | + | There are four pipelines in the integer cluster - three for general math operations and a dedicate branch ALU. All three ports have a simple ALU. Those perform arithmetic and logical data processing operations. The third port has support for complex arithmetic (e.g. MAC, DIV). |
− | |||
− | |||
− | |||
− | |||
− | + | There are two {{arm|ASIMD}}/FP execution pipelines. In the {{\\|Cortex-A75}}, each of the pipelines were 64-bit wide, on the A76, they were doubled to 128-bit. This means each pipeline is capable of 2 [[double-precision]] operations, 4 single-precision, 8 half-precision, or 16 8-bit integer operations. On the A76, those pipelines can also execute the cryptographic instructions if the extension is supported (not offered by default and requires an additional license from [[Arm]]). | |
− | |||
− | {{ | ||
===== Memory subsystem ===== | ===== Memory subsystem ===== | ||
− | The A76 includes two ports with an [[address-generation unit]] on each | + | The A76 includes two ports with an [[address-generation unit]] on each. The [[level 1 data cache]] is fixed at 64 KiB and can have an optional ECC protection per 32 bits. It is [[virtually indexed, physically tagged]] which behaves as a [[physically indexed, physically tagged]] 4-way set-associative cache. The L1D cache implements a [[pseudo-LRU]] [[cache replacement]] policy. It features a 4-cycle fastest load-to-use latency with two read ports and one write port meaning it can do two 16B loads/cycle and one 32B store/cycle. From the L1, the A76 supports up to 20 outstanding non-prefetch misses. The load buffer is 68 entries deep while the store buffer is 72-entry deep. In total, the A76 can have 140 simultaneous memory operations in-flight which is actually 25% more than the A76 instruction window. |
− | The A76 can be configured with either 128, 256 or 512 KiB of [[level 2 cache]] | + | The A76 can be configured with either 128, 256 or 512 KiB of [[level 2 cache]] and is ECC protected per 64 bits. There is a 256-bit write interface to the L2 and a 256-bit read interface from the L2 cache. The fastest load-to-use latency is 9 cycles. The L2 can support up to 46 outstanding misses to the L3 which is located in the {{armh|DSU}} itself. The L3, which is shared by all the cores in the {{armh|DynamIQ big.LITTLE}} and is configurable in size ranging from 2 MiB to 4 MiB with load-to-use ranging from 26 to 31 cycles. As with the L2, up to 32 bytes may be transferred from or to the L2 from the L3 cache. Up to 94 outstanding misses are supported from the L3 to main memory. |
In addition to controlling memory accesses, ordering, and [[cache policies]], the MMU is also responsible for the translation of virtual addresses to physical addresses on the system. This is done through a set of virtual-to-physical address mappings and attributes that are held in translation tables. The physical address size here is 40 bits. The Cortex-A76 incorporates a dedicated L1 TLB for instruction cache and another one for the data cache. Both the ITLB and the DTLB are 48-entry deep and are fully associative. On a memory access operation, the A76 will first perform lookup in there. If there is a miss in the L1 TLBs, the MMU will perform a lookup for the requested entry in the second-level TLB. | In addition to controlling memory accesses, ordering, and [[cache policies]], the MMU is also responsible for the translation of virtual addresses to physical addresses on the system. This is done through a set of virtual-to-physical address mappings and attributes that are held in translation tables. The physical address size here is 40 bits. The Cortex-A76 incorporates a dedicated L1 TLB for instruction cache and another one for the data cache. Both the ITLB and the DTLB are 48-entry deep and are fully associative. On a memory access operation, the A76 will first perform lookup in there. If there is a miss in the L1 TLBs, the MMU will perform a lookup for the requested entry in the second-level TLB. | ||
Line 181: | Line 148: | ||
The TLB entries store one or both of the global indicator and an address space identifier (ASID), allowing context switching without TLB invalidation as well as a virtual machine identifier (VMID) which allows for VM switching by the hypervisor without TLB invalidation. | The TLB entries store one or both of the global indicator and an address space identifier (ASID), allowing context switching without TLB invalidation as well as a virtual machine identifier (VMID) which allows for VM switching by the hypervisor without TLB invalidation. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Bibliography == | == Bibliography == | ||
* Arm Tech Day, 2018 | * Arm Tech Day, 2018 |
Facts about "Cortex-A76 - Microarchitectures - ARM"
codename | Cortex-A76 + |
core count | 1 +, 2 +, 4 +, 6 + and 8 + |
designer | ARM Holdings + |
first launched | May 31, 2018 + |
full page name | arm holdings/microarchitectures/cortex-a76 + |
instance of | microarchitecture + |
instruction set architecture | ARMv8.2 + |
manufacturer | TSMC + |
microarchitecture type | CPU + |
name | Cortex-A76 + |
pipeline stages | 13 + |
process | 12 nm (0.012 μm, 1.2e-5 mm) +, 7 nm (0.007 μm, 7.0e-6 mm) + and 5 nm (0.005 μm, 5.0e-6 mm) + |