From WikiChip
Difference between revisions of "arm holdings/microarchitectures/cortex-a510"
< arm holdings

m (Reverted edits by 78.87.72.230 (talk) to last revision by 79.69.139.32)
(Undo revision 100381 by 79.69.139.32 (talk))
Line 2: Line 2:
 
{| id="mw-blocked-text" style="border:2px solid #000; width:100%; margin:0 auto 6px auto; padding-top:10px;"
 
{| id="mw-blocked-text" style="border:2px solid #000; width:100%; margin:0 auto 6px auto; padding-top:10px;"
 
|
 
|
<div style="width:100%; margin:auto; text-align:center;"><div style="font-size:140%;">'''You are currently unable to edit pages on Wikichip. The reason provided below:'''</div>
+
<div style="width:100%; margin:auto; text-align:center;"><div style="font-size:140%;">'''You are currently unable to edit pages on Wikibooks.'''</div>
 
'''''You can still read pages''', but you cannot edit or create them.''
 
'''''You can still read pages''', but you cannot edit or create them.''
 
</div>
 
</div>

Revision as of 16:23, 18 September 2022

You are currently unable to edit pages on Wikibooks.

You can still read pages, but you cannot edit or create them.

Editing from $7 has been disabled by $1 for the following reason(s):

$2

This block has been set to expire: $6.

Even if blocked, you will usually still be able to edit your user discussion page and may also be able to contact other editors and administrators by e-mail.

What does this mean?

What is a block?

A block is a measure used to protect Wikibooks from possible improper use or modification in breach of editorial policies. Blocks can apply to a user account, an IP address, or a range of IP addresses that is deemed responsible for or related to problematic activity. You may be an innocent victim of collateral damage, whereby a block of some other activity has accidentally caused your account to be inaccessible. The reason for your block may be found above.

Most common causes

  • Your IP address matches that used by another blocked account. Ask for further information and/or request unblocking.
  • Your account or a connected IP has been used in a problematic way against Wikibooks' policies, or your username was unsuitable and you need to choose another. The reason should be in the box above.

What do I do now?

  • If instructions are given above in the block message, follow them.
  • If you have never edited Wikibooks before and/or do not have an account, consider creating one.
  • If you wish to appeal the block, or you believe you have been blocked by mistake, please see the following section.

Appealing

The current block will automatically expire $6. You have many routes to appeal the block or its duration. Firstly, if you are a registered user and have a valid e-mail address confirmed, you can contact $1 via e-mail. This feature can be removed if abused.

Alternatively, you may appeal the block by requesting that another administrator review the block. To do so, add

{{unblock|your reason here}}

to the bottom of your user discussion page to request unblocking. You must state a reason for this, and the block can then be discussed.

If $7 is not blocked, your IP address ($3) or range may have been blocked. Please check here. If this is the case, please copy and paste the following text to the bottom of your user talk page.

{{unblock-ip|1=$3|2=<nowiki>$2</nowiki>|3=$1}}

Please note: While this block may be upsetting or unexpected, abuse of appeal processes, repeatedly using the unblock template when denied, personal attacks, or impolite conduct may lead to the removal of your ability to edit your discussion page.

Additional information

Online Chat

You may also use the freenode IRC network to discuss your block with administrators available at the time. If you have an IRC client, the channel can be found at #wikibooks, or you may use the web interface (channel: #wikibooks). A list of administrators may be found here.

E-mailing us

Due to the high volume of e-mail we receive, you are more likely to get a quick response if you try requesting an unblock via your discussion page first.

Before you e-mail us, please make sure that your situation has not already been addressed by the above sections. Your discussion page may also contain further details related to your block or IP address, and we strongly advise that you review it before contacting us.

When sending us an e-mail, please copy, paste, and fill out one of these forms.

To request unblocking:

  • IP address: http://en.wikibooks.org/wiki/User_talk:$3
  • Blocking admin: $4
  • Block reason: $2
  • Block originally applied to: $7
  • Block ID number: $5
  • Your account name (if you have one):
  • An explanation of why your block is unfair:

To request an account:

Note: your preferred username must not be listed as already taken here.

Even though e-mails sent to the address are viewable only by administrators and trusted volunteers, under no circumstances should you e-mail us your account's password or reveal it to anyone else even if asked to. When requesting an account, a temporary password will be generated and e-mailed to you. Please change the password on your new account as soon as possible. You may contact us at wikibooksTemplate:@wikimedia.org.

IMPORTANT: Please do not e-mail us without providing the information requested above. We will always require that information to help you. Any request to make edits to books on your behalf will be disregarded. Correspondence containing legal threats, personal attacks, or rudeness is likely to be ignored and may result in the extension of pre-existing blocks.



ka

Header text Header text Header text Header text Header text
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Example Example Example Example Example
Edit Values
Cortex-A510 µarch
General Info
Arch TypeCPU
DesignerARM Holdings
ManufacturerTSMC, Samsung, GlobalFoundries, SMIC
IntroductionMay 25, 2021
Process7 nm, 6 nm, 5 nm
Core Configs1, 2
Pipeline
TypeIn-order
OoOENo
SpeculativeYes
Reg RenamingNo
Decode3-way
Instructions
ISAARMv9.0
ExtensionsFPU, NEON, SVE, SVE2, TrustZone
Cache
L1I Cache32-64 KiB/core
4-way set associative
L1D Cache32-64 KiB/core
4-way set associative
L2 Cache0-512 KiB/cluster
4-way set associative
Succession

Cortex-A510 (codename Klein) is an ultra-high efficiency microarchitecture designed by ARM Holdings as a successor to the Cortex-A55. The Cortex-A510, which implements the ARMv9.0 ISA, is typically found in smartphone and other embedded devices. Often A510 cores are combined with higher performance processors (e.g. based on Cortex-A710) in DynamIQ big.LITTLE configuration to achieve better energy/performance.

Note that this microarchitecture is designed as a synthesizable IP core and is sold to other semiconductor companies to be implemented in their own chips.

Process Technology

The Cortex-A510 was primarily designed to take advantage of TSMC's 7 nm, 6 nm, 5 nm as well as Samsung's 7 nm and 5 nm.

Architecture

The Cortex-A510 is a brand new ground-up CPU design. It borrows advanced processor components from Arm's high-performance cores - such as the branch prediction and prefetchers - to extract high performance from a traditional in-order core design. The Cortex-A512 is also designed to seamlessly integrate along with higher-performance cores through Arm's DynamIQ big.LITTLE technology.

Key changes from Cortex-A55

  • Brand new ground-up design
    • Higher performance (Arm claims: +35% IPC (SPECint 2006) / +50% IPC (SPECfp 2006)
    • Lower power (Arm claims: -20% energy @ iso-performance / +10% performance @ iso-power)
    • Core Complex with a merged core architecture
      • Two independent cores
      • One shared vector unit
        • Configurable 64b or 128b pipe
    • Front-End
      • Wider fetch (128b/cycle, up from 64b)
      • Wider decoder (3-way, up from 2-way)
      • New branch predictors
      • New prefetchers
    • Back-End
      • In-order
    • Memory Subsystem
      • Larger L1 (32-64 KiB)
      • Larger L2 (0-512 KiB, up from 0-256 KiB)
        • 2x bandwidth L2->L3
      • Wider loads (128b/cycle, up from 64b/cycle)
      • 2x loads (2 lds/cycle, up from 1/cycle)
    • New ISA Support
      • ARMv9.0 ISA
      • SVE, SVE2 support

Block Diagram

Core Complex

cortex-a510 block diagram.svg

Memory Hierarchy

The Cortex-A510 has a private L1I, L1D, and cluster-wide L2 cache.

  • Cache
    • L1I Cache
      • Private to core
      • 32 KiB OR 64 KiB, 4-way set associative
      • 64-byte cache lines
      • Virtually-indexed, physically-tagged (VIPT) behaving as physically-indexed, physically-tagged (PIPT)
      • Single Error Detect (SED) parity cache protection
      • Pseudo-random cache replacement policy
    • L1D Cache
      • Private to core
      • 32KB or 64KB, 4-way set associative
      • Virtually-Indexed, Physically-Tagged (VIPT) behaving as Physically-Indexed, Physically-Tagged (PIPT)
      • Error Correcting Code (ECC) cache protection
      • 64-byte cache lines
      • Pseudo-random cache replacement policy
    • L2 Cache
      • Private to complex
      • 128 KiB OR 192 KiB OR 256 KiB OR 384 KiB OR 512 KiB, 8-way set associative
      • 64-byte cache lines
      • Can be configured as 1-2 slices
        • Slice includes: data RAMs, L2 tags, L2 replacement RAM, and L1 duplicate tag RAMs
        • Slice can be configured as single/dual partitions for up to two concurrent accesses to different L2 ways


The Cortex-A510 features an instruction TLB (ITLB) and data TLB (DTLB) which are private to each core and an L2 TLB that is private to the core complex.

  • TLBs
    • ITLB
      • 16-entries
      • fully associative
      • TLB hits return the PA to the instruction cache
    • DTLB
      • 16-entries
      • fully associative
      • TLB hits return the PA to the data cache
    • L2 TLB
      • 8-way set associative
      • Shared by both cores in the complex

Overview

cortex-a510-complex-overview.svg

The Cortex-A510 is Arm's successor to the Cortex-A55 which was introduced four years earlier. Designed to be ultra-low-power and versatile, this core can be used as a standalone CPU in low-power SoCs or serve the efficient core as part of a DynamIQ big.LITTLE architecture using the DSU-110. To maintain high efficiency, the Cortex-A510 remains an in-order architecture. However, by borrowing high-performance components such as state-of-the-art branch predictors and prefetchers, the Cortex-A510 enjoys significantly higher performance over its predecessor through higher effective instruction stream throughput. The Cortex-A55 is the first small core from Arm to feature the Armv9 ISA along with the Scalable Vector Extension (SVE) and SVE2 extensions.

The Cortex-A510 introduces the concept of a Core Complex along with a merged core architecture. A core complex tightly integrates two Cortex-A510 cores, sharing a single common level 2 cache and vector processing unit (VPU). Like any other Arm IP, the Cortex-A510 complex can be instantiated within a standard DSU cluster as any other core would. The only difference is that you are dealing with two cores at once in a single instance. Because the effective utilization of the vector unit on the small cores is quite low, by implementing a single vector unit for two A510 cores, the silicon area is maintained relatively low while still offering good peak performance when needed.

Core

a510-fetch.svg

The Cortex-A510 is a new ground-up design. Compared to its predecessor, the Cortex-A55, the A510 is wider, but the underlying architecture remains an in-order pipeline. The choice to remain with an in-order pipeline design was driven by the overall simplicity of the design which maximizes power efficiency and area usage. Each Cortex-A510 core integrates its own private level 1 instruction cache and private level 1 data cache.


The Cortex-A510 features a 10-stage pipeline.


a510-pipeline.svg

Front-end

The front-end of the Cortex-A510 fetches instructions from the private level 1 instruction cache and streams them to decode. The L1I$ on the A510 can be configured as 32 KiB or 64 KiB, both are 4-way set associative.

Fetch & Decode

a510-decode.svg

The A510 speculatively fetches instructions streams. The A510 can predict conditional, unconditional, A32/T32 branch switches, and indirect procedure call/return branches. Each cycle, the A510 can fetch up to 128b (16B) from the instruction cache; twice the fetch bandwidth of the Cortex-A55. This allows it to deliver up to four instructions to decode.

Because the A510 is an in-order design, the performance of the core largely depends on the ability of the hardware to correctly predict future instruction stream patterns and paths. Advanced branch predictors along with aggressive prefetchers that were taken from Arm's big cores (i.e., Cortex-A710) were integrated into the A510 in order to allow for higher effective instruction stream bandwidth.

From the fetch stream, up to three instructions are sent to decode. The Cortex-A510 features a 3-wide decode, 50% wider than the Cortex-A55.

Back-end

a510-eus.svg

Each cycle, up to three instructions are decoded and sent to the back-end for execution. To match fetch and decode width, the A510 features a 3-wide issue and execute. The A510 is 50% wider than its predecessor, the Cortex-A55 which was 2-wide.

Each cycle, up to three μOPs may be sent to execution. The A510 features a 10-wide pipe. Compared to the Cortex-A55, Arm added a third ALU pipeline to increase integer workload bandwidth. With a 3-wide issue, the A510 can issue up to three ALU instructions to the three pipelines each cycle. There is a separate pipe for complex ALU operations as well as a separate ALU for performing branch operations.

Due to the nature of an in-order pipeline, Arm says a lot of work went into the forwarding capabilities of the design and the multi-issue capabilities of the execution engines. Special care was taken to maintain the execution bandwidth up in more cases while being less sensitive to specific instruction scheduling. Due to the inter-instruction dependencies nature of an in-order pipeline, poorly scheduled code can create backups and significantly reduce the effective execution throughput. Arm stated that in the Cortex-A510, a lot of attention has gone into maximizing this concern while keeping everything in order for maximum efficiency.

Memory subsystem

a510-mem.svg

The Cortex-A510 has been significantly improved over the A55. Like the Cortex-A55, the Cortex-A510 maintains two memory pipelines. Whereas the A55 could perform one load on one pipe and one store on the other, the Cortex-A510 is capable of executing loads on both pipes. In other words, the A510 supports up to two loads per cycle or - alternatively - one load and one store per cycle. Arm also doubled the width of the data loads on the A510. Data loads are 128b-wide now, double the 64b-wide loads of the Cortex-A55. Because the A510 supports up to two loads per cycle or 2x128b loads, the peak L1 cache bandwidth is now 4x compared to the A55.

The Cortex-A510 features a private L1 data cache which can be configured with a capacity of 32 KiB or 64 KiB. Both configurations are organized as a 4-way set associative. The bandwidth from the other caches was also doubled – the L2 and L3 bandwidths are doubled that of the Cortex-A55.

Data prefetching on the A510 also received special attention. Arm stated that technologies that were originally developed for the Cortex-X2 were carried over to the Cortex-A510 design. There are multiple complementary data prefetching algorithms implemented on the Cortex-A510.

Core Complex

a510-vpu.svg

That Cortex-A510 is integrated as part of a Complex. Normally, each complex would incorporate two Cortex-A510 cores. Both cores share a complex-private L2 cache and a complex-private vector processing unit. The rationale for a complex is the ability for the A510 to fully support the Armv9.0 ISA including the Scalable Vector Extension (SVE) and SVE2. With a single shared vector processing unit, the A510 can support peak vector performance when needed while conserving silicon area due to lower utilization when compared to the big cores. While the two-core complex is the default and encouraged configuration, customers can opt to use it in a single-core complex, allowing for a dedicated VPU and L2 cache but at higher silicon area.

Vector Processing Unit (VPU)

A major component of the core complex is the vector processing unit (VPU) which is private to the complex and is shared by both Cortex-A510's. Because typical vector workloads will usually be assigned to the big cores, the VPU utilization on the small cores is expected to be relatively low. To maximize silicon area usage while providing sufficient vector performance, a single VPU instance is shared by both cores.

Once a μOP is issued to the VPU from one of the cores it is sent to the arbitration unit. Arbitration is completely hardware-based with execution sharing and utilization being entirely transparent to software. There is no user or OS-level software intervention for the arbitration of vector unit accesses between the cores. Furthermore, from a software perspective, each core appears to have dedicated access to the vector unit. Scheduling is very fine-grained between the cores to make sure there is minimal overhead caused by the sharing. There is no penalty for sharing or using the VPU. Arm says that they see very minimal performance overhead due to the sharing of the VPU. In their own internal testing, on the SPECfp 2006 benchmark, they see around a 1% performance degradation if both cores are executing FP/vector instructions at the same time as they compete to use the vector datapath.

All scalar and floating-point NEON and SVE2 instructions are executed through the vector datapath. There is support for optional crypto as well. This datapath itself is configurable. Possible configurations include having two 64b or two 128b data pipes - twice the peak bandwidth delta between the two configurations. A configuration of 2x128b data pipes will offer twice the peak compute of the Cortex-55. Depending on the configuration, designers have the flexibility in deciding between area and performance depending on their application and workload characterization. In the 2x128b configuration, the Cortex-A510 complex can achieve 16 FP16 FLOPS/clk. Note that regardless of the vector unit size or configuration, for SVE, there is a 128b vector register file.

Performance claims

At the peak operating point of the Cortex-A55, the Cortex-A510 is capable of reaching the same performance at 20% less energy. Alternatively, within the same power budget, the A510 achieves 10% higher performance. Compared to the Cortex-A55, Arm says that the Cortex-A510 has around 35% higher IPC on SPECint 2006. On the SPECfp 2006 benchmark, Arm claims to see an even higher performance improvement at around 50%.

cortex-a510 perf claims.png DVFS of power vs performance at ISO-process on SPECint 2006 for both the Cortex-A55 and the new Cortex-A510.

Die

Advanced implementation die floorplan.

cortex-a510 die plot.jpg

All Cortex-A510 Processors

 List of Cortex-A510-based Processors
 Main processorIntegrated Graphics
ModelFamilyLaunchedProcessArchCoresFrequencyGPUFrequency
Count: 0

Bibliography

  • Arm Tech Day, 2021
  • Arm. personal communication. 2021.

Documents

codenameCortex-A510 +
core count1 + and 2 +
designerARM Holdings +
first launchedMay 25, 2021 +
full page namearm holdings/microarchitectures/cortex-a510 +
instance ofmicroarchitecture +
instruction set architectureARMv9.0 +
manufacturerTSMC +, Samsung +, GlobalFoundries + and SMIC +
microarchitecture typeCPU +
nameCortex-A510 +
process7 nm (0.007 μm, 7.0e-6 mm) +, 6 nm (0.006 μm, 6.0e-6 mm) + and 5 nm (0.005 μm, 5.0e-6 mm) +