From WikiChip
Difference between revisions of "samsung/microarchitectures/m4"
< samsung

(All M4 Processors: Show cTDP with frequency.)
 
(29 intermediate revisions by 7 users not shown)
Line 1: Line 1:
{{samsung title|Mongoose 4 (M4)|arch}}
+
{{samsung title|Exynos M4|arch}}
 
{{microarchitecture
 
{{microarchitecture
 
|atype=CPU
 
|atype=CPU
|name=Mongoose 4
+
|name=Cheetah
 
|designer=Samsung
 
|designer=Samsung
 
|manufacturer=Samsung
 
|manufacturer=Samsung
 
|introduction=2019
 
|introduction=2019
|process=7 nm
+
|process=8 nm
|isa=ARMv8
+
|cores=4
 +
|type=Superscalar
 +
|type 2=Superpipeline
 +
|oooe=Yes
 +
|speculative=Yes
 +
|renaming=Yes
 +
|stages=16
 +
|decode=6-way
 +
|isa=ARMv8.2
 +
|l1i=64 KiB
 +
|l1i per=core
 +
|l1i desc=4-way set associative
 +
|l1d=64 KiB
 +
|l1d per=core
 +
|l1d desc=8-way set associative
 +
|l2=512 KiB
 +
|l2 per=core
 +
|l2 desc=8-way set associative
 +
|l3=2 MiB
 +
|l3 per=cluster
 +
|l3 desc=16-way set associative
 
|predecessor=M3
 
|predecessor=M3
 
|predecessor link=samsung/microarchitectures/m3
 
|predecessor link=samsung/microarchitectures/m3
Line 13: Line 33:
 
|successor link=samsung/microarchitectures/m5
 
|successor link=samsung/microarchitectures/m5
 
}}
 
}}
'''Mongoose 4''' ('''M4''') is the successor to the {{\\|Mongoose 3}}, a planned [[10 nm]]/[[8 nm]](?) [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics.
+
'''Exynos M4''' ('''Cheetah''') is the successor to the {{\\|M3}}, an [[8 nm]] [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics.
  
 +
== Process Technology ==
 +
The M4 is fabricated on Samsung's [[8 nm process]] (8LPP).
 +
 +
== Compiler support ==
 +
{| class="wikitable"
 +
|-
 +
! Compiler !! Arch-Specific || Arch-Favorable
 +
|-
 +
| [[GCC]] || <code>-mcpu=exynos-m4</code> || <code>-mtune=exynos-m4</code>
 +
|-
 +
| [[LLVM]] || <code>-mcpu=exynos-m4</code> || <code>-mtune=exynos-m4</code>
 +
|}
  
{{future information}}
 
  
 
== Architecture ==
 
== Architecture ==
{{empty section}}
+
The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.
 +
 
 
=== Key changes from {{\\|Mongoose 3|M3}} ===
 
=== Key changes from {{\\|Mongoose 3|M3}} ===
{{empty section}}
+
* [[8 nm process]] (from [[10 nm]])
 +
* [[ARMv8.2]] (from [[ARMv8]])
 +
** Support for full FP16 scalar extension
 +
** Support for integer dot product extension
 +
* Front end
 +
** Larger [[instruction queue]] (48 entries, up from 40)
 +
* Back end
 +
** LSU execution units reorganized
 +
** Floating-point execution units reorganized
 +
{{expand list}}
 +
 
 +
=== Block Diagram ===
 +
==== Individual Core ====
 +
 
 +
[[File:mongoose 4 block diagram.svg|900px]]
 +
 
 +
=== Memory Hierarchy ===
 +
* Cache
 +
** L1I Caches
 +
*** 64 KiB, 4-way set associative
 +
**** 128 B line size
 +
**** per core
 +
*** Parity-protected
 +
** L1D Cache
 +
*** 64 KiB, 8-way set associative
 +
**** 64 B line size
 +
**** per core
 +
*** 4 cycles for fastest load-to-use
 +
*** 32 B/cycle load bandwidth
 +
*** 16 B/cycle store bandwidth
 +
** L2 Cache
 +
*** 512 KiB, 8-way set associative
 +
*** Inclusive of L1
 +
*** 12 cycles latency
 +
*** 32 B/cycle bandwidth
 +
** L3 Cache
 +
*** 2 MiB, 16-way set associative
 +
**** 1 MiB slice/core
 +
*** Exlusive of L2
 +
*** ~37-cycle typical (NUCA)
 +
** BIU
 +
*** 80 outstanding transactions
 +
 
 +
The M3 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).
 +
 
 +
* TLBs
 +
** ITLB
 +
*** 512-entry
 +
** DTLB
 +
*** 32-entry
 +
*** 512-entry Mid-level DTLB
 +
** STLB
 +
*** 4,096-entry
 +
*** Per core
 +
 
 +
* BPU
 +
** 4K-entry main BTB
 +
** 128-entry µBTB
 +
** 64-entry return stack
 +
** 16K-entry L2 BTB
 +
 
 +
== Core ==
 +
The core of the M4 is largely the same as {{\\|M3}}. A number of buffers have been enlarged and some of the execution units have been reorganized.
 +
 
 +
=== Execution engine ===
 +
==== Floating-point cluster ====
 +
The execution units on the M4 have been reorganized. In total, three new units were also added  - a second FP square root unit, a second vector multiplication unit, and a new horizontal vector arithmetic unit.
 +
 
 +
:[[File:m4 fp eu pipes changes.svg|thumb|left|600px|Floating-point pipe changes.]]
 +
 
 +
{{clear}}
 +
 
 +
==== Memory subsystem ====
 +
[[File:m4 data cache.svg|thumb|left]]
 +
Samsung also made an enhancement to the M4 memory subsystem. In the M3, there were three AGUs - two dedicated Load [[AGUs]] and a single dedicated Store [[AGU]]. In the M4, Samsung changed one of the dedicated Load [[AGU]]s into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.
 +
 
 +
{{clear}}
 +
 
 +
== All M4 Processors ==
 +
<!-- NOTE:
 +
          This table is generated automatically from the data in the actual articles.
 +
          If a microprocessor is missing from the list, an appropriate article for it needs to be
 +
          created and tagged accordingly.
 +
 
 +
          Missing a chip? please dump its name here: https://en.wikichip.org/wiki/WikiChip:wanted_chips
 +
-->
 +
{{comp table start}}
 +
<table class="comptable sortable tc5 tc6 tc7">
 +
{{comp table header|main|12:List of M4-based Processors}}
 +
{{comp table header|main|5:Main processor|2:Integrated Graphics|{{abbr|TDP}}|2:TDP down|2:TDP up}}
 +
{{comp table header|cols|Family|Launched|Arch|Cores|%Frequency|GPU|%Frequency|P|P|Frequ.|P|Frequ.}}
 +
{{#ask: [[Category:microprocessor models by samsung]] [[microarchitecture::M4]]
 +
|?full page name
 +
|?model number
 +
|?family
 +
|?first launched
 +
|?microarchitecture
 +
|?core count
 +
|?base frequency#GHz
 +
|?integrated gpu
 +
|?integrated gpu base frequency
 +
|?tdp
 +
|?tdp down
 +
|?tdp down frequency#GHz
 +
|?tdp up
 +
|?tdp up frequency#GHz
 +
|format=template
 +
|template=proc table 3
 +
|userparam=14
 +
|mainlabel=-
 +
|valuesep=,
 +
}}
 +
{{comp table count|ask=[[Category:microprocessor models by samsung]] [[microarchitecture::M4]]}}
 +
</table>
 +
{{comp table end}}
 +
 
 +
== Bibliography ==
 +
* LLVM: lib/Target/AArch64/AArch64SchedExynosM4.td

Latest revision as of 13:43, 16 March 2023

Edit Values
Cheetah µarch
General Info
Arch TypeCPU
DesignerSamsung
ManufacturerSamsung
Introduction2019
Process8 nm
Core Configs4
Pipeline
TypeSuperscalar, Superpipeline
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages16
Decode6-way
Instructions
ISAARMv8.2
Cache
L1I Cache64 KiB/core
4-way set associative
L1D Cache64 KiB/core
8-way set associative
L2 Cache512 KiB/core
8-way set associative
L3 Cache2 MiB/cluster
16-way set associative
Succession

Exynos M4 (Cheetah) is the successor to the M3, an 8 nm ARM microarchitecture designed by Samsung for their consumer electronics.

Process Technology[edit]

The M4 is fabricated on Samsung's 8 nm process (8LPP).

Compiler support[edit]

Compiler Arch-Specific Arch-Favorable
GCC -mcpu=exynos-m4 -mtune=exynos-m4
LLVM -mcpu=exynos-m4 -mtune=exynos-m4


Architecture[edit]

The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.

Key changes from M3[edit]

  • 8 nm process (from 10 nm)
  • ARMv8.2 (from ARMv8)
    • Support for full FP16 scalar extension
    • Support for integer dot product extension
  • Front end
  • Back end
    • LSU execution units reorganized
    • Floating-point execution units reorganized

This list is incomplete; you can help by expanding it.

Block Diagram[edit]

Individual Core[edit]

mongoose 4 block diagram.svg

Memory Hierarchy[edit]

  • Cache
    • L1I Caches
      • 64 KiB, 4-way set associative
        • 128 B line size
        • per core
      • Parity-protected
    • L1D Cache
      • 64 KiB, 8-way set associative
        • 64 B line size
        • per core
      • 4 cycles for fastest load-to-use
      • 32 B/cycle load bandwidth
      • 16 B/cycle store bandwidth
    • L2 Cache
      • 512 KiB, 8-way set associative
      • Inclusive of L1
      • 12 cycles latency
      • 32 B/cycle bandwidth
    • L3 Cache
      • 2 MiB, 16-way set associative
        • 1 MiB slice/core
      • Exlusive of L2
      • ~37-cycle typical (NUCA)
    • BIU
      • 80 outstanding transactions

The M3 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).

  • TLBs
    • ITLB
      • 512-entry
    • DTLB
      • 32-entry
      • 512-entry Mid-level DTLB
    • STLB
      • 4,096-entry
      • Per core
  • BPU
    • 4K-entry main BTB
    • 128-entry µBTB
    • 64-entry return stack
    • 16K-entry L2 BTB

Core[edit]

The core of the M4 is largely the same as M3. A number of buffers have been enlarged and some of the execution units have been reorganized.

Execution engine[edit]

Floating-point cluster[edit]

The execution units on the M4 have been reorganized. In total, three new units were also added - a second FP square root unit, a second vector multiplication unit, and a new horizontal vector arithmetic unit.

Floating-point pipe changes.

Memory subsystem[edit]

m4 data cache.svg

Samsung also made an enhancement to the M4 memory subsystem. In the M3, there were three AGUs - two dedicated Load AGUs and a single dedicated Store AGU. In the M4, Samsung changed one of the dedicated Load AGUs into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.

All M4 Processors[edit]

 List of M4-based Processors
 Main processorIntegrated GraphicsTDPTDP downTDP up
ModelFamilyLaunchedArchCoresFrequencyGPUFrequencyPPFrequ.PFrequ.
9825Exynos2019Cortex-A75, Cortex-A55, M482.73 GHz
2,730 MHz
2,730,000 kHz
, 2.4 GHz
2,400 MHz
2,400,000 kHz
, 1.95 GHz
1,950 MHz
1,950,000 kHz
Mali-G76754 MHz
0.754 GHz
754,000 KHz
5 W
5,000 mW
0.00671 hp
0.005 kW
5 W
5,000 mW
0.00671 hp
0.005 kW
2.73 GHz
2,730 MHz
2,730,000 kHz
8 W
8,000 mW
0.0107 hp
0.008 kW
3.016 GHz
3,016 MHz
3,016,000 kHz
Count: 1

Bibliography[edit]

  • LLVM: lib/Target/AArch64/AArch64SchedExynosM4.td