Editing arm holdings/microarchitectures/cortex-m55

{{armh title|Cortex-M55|arch}}
{{microarchitecture
|atype=CPU
|name=Cortex-M55
|designer=ARM Holdings
|manufacturer=TSMC
|introduction=February 10, 2020
|process=55 nm
|process 2=45 nm
|process 3=32 nm
|process 4=28 nm
|process 5=22 nm
|process 6=16 nm
|process 7=10 nm
|cores=1
|cores 2=2
|cores 3=4
|type=Scalar
|type 2=Pipelined
|oooe=No
|speculative=No
|renaming=No
|stages min=4
|stages max=5
|decode=1-2-way
|isa=ARMv8.1-M
|feature=Hardware virtualization
|extension=FPU
|extension 2=Helium
|l1i=0-64 KiB
|l1i per=core
|l1i desc=2-way set associative
|l1d=0-64 KiB
|l1d per=core
|l1d desc=4-way set associative
|predecessor=Cortex-M4
|predecessor link=microarchitecture/arm_holdings/microarchitectures/cortex-m4
|predecessor 2=Cortex-M7
|predecessor 2 link=microarchitecture/arm_holdings/microarchitectures/cortex-m7
|process 8=7 nm
|process 9=5 nm
}}
'''Cortex-M55''' is an ultra-low-power [[ARM]] [[microarchitecture]] designed by [[ARM Holdings]] for microcontrollers and embedded subsystems. This microarchitecture is designed as a synthesizable [[IP core]] and is sold to other semiconductor companies to be implemented in their own chips. The Cortex-M55, which implemented the {{arm|ARMv8.1-M}} ISA, is an ultra-low-power core which is often found in microcontrllers, low-power chips, and in the embedded subsystems of more powerful chips.

== History ==
The Cortex-M55 was officially launched on February 10, 2020. Support for {{arm|custom instructions}} will be added in 2021.

== Process Technology ==
Though the Cortex-M55 is designed to be fabricated on various different [[process nodes]] ranging from very mature nodes such as the [[130 nm]] to leading-edge [[7 nm]] and [[5 nm]] nodes.

== Compiler support ==
{| class="wikitable"
|-
! Compiler !! Arch-Specific || Arch-Favorable
|-
| [[Arm Compiler]] || <code>-mcpu=cortex-m55</code> || <code>-mtune=cortex-m55</code>
|-
| [[GCC]] || <code>-mcpu=cortex-m55</code> || <code>-mtune=cortex-m55</code>
|-
| [[LLVM]] || <code>-march=cortex-m55</code> || <code>-mtune=cortex-m55</code>
|}

== Architecture ==
=== Block Diagram ===
:[[File:cortex-m55 block diagram.svg|850px]]

=== Memory Hierarchy ===
The Cortex-M55 has a private L1I, L1D, I-TCM, and D-TCM. All four are configurable in size.

* Cache
** L1I Cache
*** 0 - 64 KiB
*** 2-way set associative
*** Optional ECC support
** L1D Cache
*** 0 - 64 KiB
*** 4-way set associative
*** Supports both [[write-back]] (WB) and [[write-through]] (WT)
*** Optional ECC support
* [[tightly-coupled memory|TCM]]
** I-TCM
*** 0 - 16 MiB
*** Supports wait-states
*** Optional ECC support
** D-TCM
*** 0 - 16 MiB
*** Supports wait-states
*** Optional ECC support

== Overview ==
[[File:cortex-m55 general block.png|thumb|right|Cortex-M55]]
The Cortex-M55 is a synthesizable ultra-low-power core designed by [[Arm]] for an array of applications such as microcontrollers and embedded subsystems doing background work on more performant SoCs. Successionally and architecturally, the Cortex-M55 is the successor to the {{\\|Cortex-M7}} and the {{\\|Cortex-M4}}, although in purely raw performance it's slightly behind the M7, though it makes up for it in new technologies such as its {{arm|Helium|new vector extension}}. The Cortex-M55 is said to deliver 1.6 [[Dhrystone DMIPS/MHz]] and 4.2 [[CoreMark/MHz]] which is about 25% higher than the {{\\|Cortex-M4|M4}} but about 20% lower than the {{\\|Cortex-M7|M7}}. In terms of frequency, the M55 is said to deliver up to 15% higher clock speed over the {{\\|Cortex-M4|M4}}.

In addition to supporting the {{arm|ARMv8.1-M}} [[ISA]], the M55 introduces a number of upgrades and features, most of which are optional and configurable, including support for the {{armh|coprocessor interface}}, {{arm|Helium}} vector extension, and {{arm|custom instructions}}. The architecture has additional optional support for MPUs, {{armh|TrustZone}}, and tightly coupled memory (TCM).

=== Configuration ===
From a programming model (ISA) point of view, the Cortex-M55 supports five different major configurations. FPU can be included without {{arm|Helium}}. {{arm|Helium}} support for fixed-point vectored data types can be implemented without the FPU, while floating-point vector data types must include the FPU.

{| class="wikitable"
! Configuration !! Base (Integer) !! FPU (FP16, FP32, FP64) !! Helium (Int8, Int16, Int 32) !! Helium (FP16, FP32)
|-
| 1 || {{tchk|yes|Included}} || {{tchk|no|-}} || {{tchk|no|-}} || {{tchk|no|-}}
|-
| 2 || {{tchk|yes|Included}} || {{tchk|yes|Included}} || {{tchk|no|-}} || {{tchk|no|-}}
|-
| 3 || {{tchk|yes|Included}} || {{tchk|no|-}} || {{tchk|yes|Included}} || {{tchk|no|-}}
|-
| 4 || {{tchk|yes|Included}} || {{tchk|yes|Included}} || {{tchk|yes|Included}} || {{tchk|no|-}}
|-
| 5 || {{tchk|yes|Included}} || {{tchk|yes|Included}} || {{tchk|yes|Included}} || {{tchk|yes|Included}}
|}

== Pipeline ==
The Cortex-M55 is a 4-stage [[in-order]] [[scalar pipeline]] design. The design comprises of the main pipeline which is always present and an extended processing unit. The main pipeline is the typical integer pipeline designed to support the full {{arm|ARMv8.1-M}} ISA. The extended processing unit is optional and is only present when the core implements the FPU or the {{arm|Helium}} extensions. When the extended processing unit is present, that part of the pipeline is extended by an additional stage (for a total of 5 stages). The separate pipeline allows the core to go into retention state or be entirely power-down when not used.


:[[File:cortex-m55 pipeline.svg|700px]]

=== Fetch & Decode ===
The M55 features a configurable private [[instruction cache]]. It is optional, but when present, it can be configured from 0 KiB to 64 KiB organized as a 2-way set associative. There is also optional ECC support if desired. Each cycle, four bytes are fetched from the instruction cache. There, instructions are pre-parsed and are sent to the decode. Since the {{arm|ARMv8}} supports a limited subset of {{arm|T16}}, when two adjacent instructions are both 16-bit wide, the two instructions may be sent to decode to be decoded simultaneously. However, since the dual-issue capabilities are incredibly limited, Arm does not classify the design as a superscalar (unlike the capabilities of the {{\\|Cortex-M7}}).

=== Extended processing pipeline ===
From decode, the FPU and {{arm|Helium}} instructions are routed to a separate pipeline. In order to save on power, that pipeline may go into a low-power retention state or be powered-down when not used. The extended processing pipeline is present if either the FPU or the {{arm|Helium}} extensions are present. The FPU unit is based on the Arm {{arm|FPv5|FPv5 architecture}}. This is a fully IEEE-754 compliant FPU with support for [[half-precision]], [[single-precision]], and [[double-precision]] scalar [[floating-point]] data forms. Half-precision floating-point operations can be processed at twice the throughput per clock cycle as single-precision floats. 

When the {{arm|Helium}} extension is present, it reuses the FPU registers as vector registers, each being 128-bit wide. Internally, the vector unit is implemented with a 64-bit data path. This is twice as wide as prior Cortex-M designs but half the width of the ISA operations, therefore each operation takes two clock cycles to complete. The architecture permits overlapping execution cycles between instructions which are taken advantage of by the Cortex-M55, therefore when overlapping memory access and data processing operations together, both operations can be carried out in parallel.

=== Memory subsystem ===
The Cortex-M55 has a private [[data cache]]. It is optional and configurable from 0 KiB to 64 KiB in capacity organized as a [[4-way set associative]]. The [[L1D$]] supports both [[write-back]] and [[write-through]] policies as well as optional ECC support.

The Cortex-M55 features a fairly complex memory subsystem. The two main parts are the [[tightly-coupled memory]] (TCM) and the cache subsystem. Both are optional and both are configurable in sizes. The TCM is optimized for real-time applications with highly deterministic behaviors while the cache subsystem is designed for complex memory hierarchies, hiding higher latencies. The main bus interface to the Cortex-M55 from the rest of the system is the 64-bit [[AMBA 5 AXI]]. The interface supports multiple outstanding memory transfers as well as burst transfers and can operate at the core frequency or at some divided clock frequency.

The TCM subsystem is somewhat similar to the one found in the {{\\|Cortex-M7}} but comes with a few noticeable differences. The TCM consists of an instruction TCM (I-TCM) and a data TCM (D-TCM). The instruction TCM is optional and configurable from 0 to 16 MiB with optional ECC support. It is 32-bit wide, allowing up to 4 bytes per cycle to be transferred from the I-TCM to either the [[instruction fetch]] or the [[instruction memory]]. The D-TCM is also optional and is configurable from 0 to 16 MiB in capacity with optional ECC support. Whereas the {{\\|Cortex-M7}} featured two 32-bit data TCM interfaces, the M55 features four 32-bit data TCM interfaces which are split equally using address bits[3:2]. The data TCM interfaces collectively have an aggregated bandwidth of 128-bit per cycle, however since the Helium vector extension features an interface data path of 64-bit, software execution can only generate 64-bit data transfers per cycle. The rest of the bandwidth can be used for other purposes such as data memory access operations, transferring data from and to the TCM simultaneously while the data is read by the core execution. Accesses to TCM memory banks are prioritized for software execution, therefore an attempt by the DMA controller will be stalled while the software is reading form the same TCM bank.

There are a number of additional interfaces including a 32-bit AHB peripheral interface for legacy AHB peripherals. A debug AHB interface is a 32-bit debug AHB5 slave interface providing debug support for the Debug Access Port (DAP) to the memory system.

== All Cortex-M55 chips ==
<!-- NOTE: 
           This table is generated automatically from the data in the actual articles.
           If a microprocessor is missing from the list, an appropriate article for it needs to be
           created and tagged accordingly.

           Missing a chip? please dump its name here: https://en.wikichip.org/wiki/WikiChip:wanted_chips
-->
{{comp table start}}
<table class="comptable sortable">
{{comp table header|main|4:List of Cortex-M55-based Processors}}
{{comp table header|cols|Launched|Cores|%Frequency}}
{{#ask: [[Category:microprocessor models by arm holdings]] [[microarchitecture::Cortex-M55]]
 |?full page name
 |?model number
 |?first launched
 |?core count
 |?base frequency#GHz
 |format=template
 |template=proc table 3
 |userparam=5
 |mainlabel=-
}}
{{comp table count|ask=[[Category:microprocessor models by arm holdings]] [[microarchitecture::Cortex-M55]]}}
</table>
{{comp table end}}

== Bibliography ==
* {{bib|personal|February 2020|Arm}}
@@ Line 41: / Line 41: @@
 |process 9=5 nm
 }}
-'''Cortex-M55''' is an ultra-low-power [[ARM]] [[microarchitecture]] designed by [[ARM Holdings]] for microcontrollers and embedded subsystems. This microarchitecture is designed as a synthesizable [[IP core]] and is sold to other semiconductor companies to be implemented in their own chips. The Cortex-M55, which implements the {{arm|ARMv8.1-M}} ISA, is an ultra-low-power core which is often found in microcontrollers, low-power chips, and in the embedded subsystems of more powerful chips.
+'''Cortex-M55''' is an ultra-low-power [[ARM]] [[microarchitecture]] designed by [[ARM Holdings]] for microcontrollers and embedded subsystems. This microarchitecture is designed as a synthesizable [[IP core]] and is sold to other semiconductor companies to be implemented in their own chips. The Cortex-M55, which implemented the {{arm|ARMv8.1-M}} ISA, is an ultra-low-power core which is often found in microcontrllers, low-power chips, and in the embedded subsystems of more powerful chips.
 == History ==
@@ Line 47: / Line 47: @@
 == Process Technology ==
-The Cortex-M55 is designed to be fabricated on various different [[process nodes]] ranging from very mature nodes such as the [[130 nm]] to leading-edge [[7 nm]] and [[5 nm]] nodes.
+Though the Cortex-M55 is designed to be fabricated on various different [[process nodes]] ranging from very mature nodes such as the [[130 nm]] to leading-edge [[7 nm]] and [[5 nm]] nodes.
 == Compiler support ==
@@ Line 62: / Line 62: @@
 == Architecture ==
-=== Key changes from {{\\|Cortex-M7}}/{{\\|Cortex-M4}} ===
-* {{arm|ARMv8.1-M}} (from {{arm|ARMv7-M}})
-* 64-bit internal bus (from 32-bit)
-* Performance
-** 4.2 [[CoreMark/MHz]] (self reported, +18.6% over {{\\|Cortex-M4|M4}}, -19.3% over {{\\|Cortex-M7|M7}})
-** 1.6 [[DMIPS/MHz]] (self reported, +28% over {{\\|Cortex-M4|M4}}, -25.2% over {{\\|Cortex-M7|M7}})
-** 1.15x frequency over {{\\|Cortex-M4|M4}} (depend on configuration)
-* Pipeline
-** 4 stages (up from 3 in {{\\|Cortex-M4|M4}}, down from 6 in {{\\|Cortex-M7|M7}})
-** 2x external interrupts (480, up from 240)
-** 2x32-bit or 4x16-bit or 8x8-bit MACs/cycle (up from 1x32-bit or 2x16-bit MACs/cycle in {{\\|Cortex-M7|M7}})
-** FPU
-*** new half-precision support (SP/DP in {{\\|Cortex-M7|M7}})
-** TCM
-*** 4x32-bit D-TCM interfaces (up from 2x32-bit in {{\\|Cortex-M7|M7}})
-*** 64-bit AHB DMA port to crossbar interface (up from 32-bit in {{\\|Cortex-M7|M7}})
-* Bus
-** AXI5 (up from AXI4 and AHB Lite in {{\\|Cortex-M7|M7}}), up from AHB Lite {{\\|Cortex-M4|M4}})
-* New integration
-** New {{arm|coprocessor interface}} support
-** New {{arm|custom instructions}}
-** New {{arm|Helium}} extension support
-** {{arm|TrustZone}} for {{arm|ARMv8-M}}
-{{expand list}}
 === Block Diagram ===
 :[[File:cortex-m55 block diagram.svg|850px]]
@@ Line 144: / Line 118: @@
 === Fetch & Decode ===
-The M55 features a configurable private [[instruction cache]]. It is optional, but when present, it can be configured from 0 KiB to 64 KiB organized as a 2-way set associative. There is also optional ECC support if desired. Each cycle, four bytes are fetched from the instruction cache. There, instructions are pre-parsed and are sent to the decode. Since the {{arm|ARMv8}} supports a limited subset of {{arm|T16}}, when two adjacent instructions are both 16-bit wide ({{arm|T16}}+T16), the two instructions may be sent to decode to be decoded simultaneously. However, since the dual-issue capabilities are incredibly limited, Arm does not classify the design as a superscalar (unlike the capabilities of the {{\\|Cortex-M7}}).
+The M55 features a configurable private [[instruction cache]]. It is optional, but when present, it can be configured from 0 KiB to 64 KiB organized as a 2-way set associative. There is also optional ECC support if desired. Each cycle, four bytes are fetched from the instruction cache. There, instructions are pre-parsed and are sent to the decode. Since the {{arm|ARMv8}} supports a limited subset of {{arm|T16}}, when two adjacent instructions are both 16-bit wide, the two instructions may be sent to decode to be decoded simultaneously. However, since the dual-issue capabilities are incredibly limited, Arm does not classify the design as a superscalar (unlike the capabilities of the {{\\|Cortex-M7}}).
-=== Execution ===
-The M55 can do 1 64-bit load or store operation per cycle. Compared to the {{\\|Cortex-M7|M7}}, the M55 can performance twice the MACs/cycle: 2x32-bit, 4x16-bit or 8x8-bit MACs/cycle.
 === Extended processing pipeline ===
@@ Line 158: / Line 129: @@
 The Cortex-M55 features a fairly complex memory subsystem. The two main parts are the [[tightly-coupled memory]] (TCM) and the cache subsystem. Both are optional and both are configurable in sizes. The TCM is optimized for real-time applications with highly deterministic behaviors while the cache subsystem is designed for complex memory hierarchies, hiding higher latencies. The main bus interface to the Cortex-M55 from the rest of the system is the 64-bit [[AMBA 5 AXI]]. The interface supports multiple outstanding memory transfers as well as burst transfers and can operate at the core frequency or at some divided clock frequency.
+The TCM subsystem is somewhat similar to the one found in the {{\\|Cortex-M7}} but comes with a few noticeable differences. The TCM consists of an instruction TCM (I-TCM) and a data TCM (D-TCM). The instruction TCM is optional and configurable from 0 to 16 MiB with optional ECC support. It is 32-bit wide, allowing up to 4 bytes per cycle to be transferred from the I-TCM to either the [[instruction fetch]] or the [[instruction memory]]. The D-TCM is also optional and is configurable from 0 to 16 MiB in capacity with optional ECC support. Whereas the {{\\|Cortex-M7}} featured two 32-bit data TCM interfaces, the M55 features four 32-bit data TCM interfaces which are split equally using address bits[3:2]. The data TCM interfaces collectively have an aggregated bandwidth of 128-bit per cycle, however since the Helium vector extension features an interface data path of 64-bit, software execution can only generate 64-bit data transfers per cycle. The rest of the bandwidth can be used for other purposes such as data memory access operations, transferring data from and to the TCM simultaneously while the data is read by the core execution. Accesses to TCM memory banks are prioritized for software execution, therefore an attempt by the DMA controller will be stalled while the software is reading form the same TCM bank.
 There are a number of additional interfaces including a 32-bit AHB peripheral interface for legacy AHB peripherals. A debug AHB interface is a 32-bit debug AHB5 slave interface providing debug support for the Debug Access Port (DAP) to the memory system.
-==== TCM subsystem ====
-The TCM subsystem is somewhat similar to the one found in the {{\\|Cortex-M7}} but comes with a few noticeable differences. The TCM consists of an instruction TCM (I-TCM) and a data TCM (D-TCM). As with the {{\\|Cortex-M7|M7}}, the purpose of the TCM subsystem is to provide deterministic latencies for real-time applications. The instruction TCM is optional and configurable from 0 to 16 MiB with optional ECC support. It is 32-bit wide, allowing up to 4 bytes per cycle to be transferred from the I-TCM to either the [[instruction fetch]] or the [[instruction memory]]. The D-TCM is also optional and is configurable from 0 to 16 MiB in capacity with optional ECC support. Whereas the {{\\|Cortex-M7}} featured two 32-bit data TCM interfaces, the M55 features four 32-bit data TCM interfaces which are split equally using address bits[3:2]. The data TCM interfaces collectively have an aggregated bandwidth of 128-bit per cycle, however since the Helium vector extension features an interface data path of 64-bit, software execution can only generate 64-bit data transfers per cycle. The rest of the bandwidth can be used for other purposes such as data memory access operations, transferring data from and to the TCM simultaneously while the data is read by the core execution. Accesses to TCM memory banks are prioritized for software execution, therefore an attempt by the DMA controller will be stalled while the software is reading from the same TCM bank.
-:[[File:cortex-m55 tcm xbar.svg|600px]]
 == All Cortex-M55 chips ==
codename	Cortex-M55 +
core count	1 +, 2 + and 4 +
designer	ARM Holdings +
first launched	February 10, 2020 +
full page name	arm holdings/microarchitectures/cortex-m55 +
instance of	microarchitecture +
instruction set architecture	ARMv8.1-M +
manufacturer	TSMC +
microarchitecture type	CPU +
name	Cortex-M55 +
pipeline stages (max)	5 +
pipeline stages (min)	4 +
process	55 nm (0.055 μm, 5.5e-5 mm) +, 45 nm (0.045 μm, 4.5e-5 mm) +, 32 nm (0.032 μm, 3.2e-5 mm) +, 28 nm (0.028 μm, 2.8e-5 mm) +, 22 nm (0.022 μm, 2.2e-5 mm) +, 16 nm (0.016 μm, 1.6e-5 mm) +, 10 nm (0.01 μm, 1.0e-5 mm) +, 7 nm (0.007 μm, 7.0e-6 mm) + and 5 nm (0.005 μm, 5.0e-6 mm) +