From WikiChip
Difference between revisions of "samsung/microarchitectures/m4"
< samsung

(fixed)
 
(3 intermediate revisions by 3 users not shown)
Line 2: Line 2:
 
{{microarchitecture
 
{{microarchitecture
 
|atype=CPU
 
|atype=CPU
|name=Cheetah
+
|name=Exynos M4 (Cheetah)
 
|designer=Samsung
 
|designer=Samsung
 
|manufacturer=Samsung
 
|manufacturer=Samsung
Line 28: Line 28:
 
|l3 per=cluster
 
|l3 per=cluster
 
|l3 desc=16-way set associative
 
|l3 desc=16-way set associative
|predecessor=M3
+
|predecessor=M3 (Meerkat)
 
|predecessor link=samsung/microarchitectures/m3
 
|predecessor link=samsung/microarchitectures/m3
|successor=M5
+
|successor=M5 (Lion)
 
|successor link=samsung/microarchitectures/m5
 
|successor link=samsung/microarchitectures/m5
 
}}
 
}}
'''Exynos M4''' ('''Cheetah''') is the successor to the {{\\|M3}}, an [[8 nm]] [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics.
+
'''Exynos M4''' ('''Cheetah''') <aka ''{{\\|Mongoose 4}}'' > is the successor to the [[Exynos]] {{\\|M3}} (Meerkat) <aka ''{{\\|Mongoose 3}}'' >, an [[8 nm]] [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics.
  
 
== Process Technology ==
 
== Process Technology ==
Line 47: Line 47:
 
| [[LLVM]] || <code>-mcpu=exynos-m4</code> || <code>-mtune=exynos-m4</code>
 
| [[LLVM]] || <code>-mcpu=exynos-m4</code> || <code>-mtune=exynos-m4</code>
 
|}
 
|}
 
  
 
== Architecture ==
 
== Architecture ==
 
The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.
 
The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.
  
=== Key changes from {{\\|Mongoose 3|M3}} ===
+
=== Key changes from {{\\|Mongoose 3|M3}} (Meerkat) ===
 
* [[8 nm process]] (from [[10 nm]])
 
* [[8 nm process]] (from [[10 nm]])
 
* [[ARMv8.2]] (from [[ARMv8]])
 
* [[ARMv8.2]] (from [[ARMv8]])
Line 60: Line 59:
 
** Larger [[instruction queue]] (48 entries, up from 40)
 
** Larger [[instruction queue]] (48 entries, up from 40)
 
* Back end
 
* Back end
** LSU executiion units reorganized
+
** LSU execution units reorganized
 
** Floating-point execution units reorganized
 
** Floating-point execution units reorganized
 
{{expand list}}
 
{{expand list}}
Line 70: Line 69:
  
 
=== Memory Hierarchy ===
 
=== Memory Hierarchy ===
 +
{| border="0" cellpadding="5" width="100%"
 +
|-
 +
|width="50%" valign="top" align="left"|
 
* Cache
 
* Cache
 
** L1I Caches
 
** L1I Caches
 
*** 64 KiB, 4-way set associative
 
*** 64 KiB, 4-way set associative
**** 128 B line size
+
**** 128 B line size, per core
**** per core
 
 
*** Parity-protected
 
*** Parity-protected
 
** L1D Cache
 
** L1D Cache
 
*** 64 KiB, 8-way set associative
 
*** 64 KiB, 8-way set associative
**** 64 B line size
+
**** 64 B line size, per core
**** per core
 
 
*** 4 cycles for fastest load-to-use
 
*** 4 cycles for fastest load-to-use
 
*** 32 B/cycle load bandwidth
 
*** 32 B/cycle load bandwidth
Line 95: Line 95:
 
** BIU
 
** BIU
 
*** 80 outstanding transactions
 
*** 80 outstanding transactions
 
+
|width="50%" valign="top" align="left"|
The M3 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).
+
The M3 TLB consists of dedicated L1 TLB for instruction <br>cache (ITLB) and another one for data cache (DTLB). <br>Additionally, there is a unified L2 TLB (STLB).
  
 
* TLBs
 
* TLBs
Line 105: Line 105:
 
*** 512-entry Mid-level DTLB
 
*** 512-entry Mid-level DTLB
 
** STLB
 
** STLB
*** 4,096-entry
+
*** 4,096-entry, per core
*** Per core
 
  
 
* BPU
 
* BPU
Line 113: Line 112:
 
** 64-entry return stack
 
** 64-entry return stack
 
** 16K-entry L2 BTB
 
** 16K-entry L2 BTB
 +
|}
  
 
== Core ==
 
== Core ==
Line 119: Line 119:
 
=== Execution engine ===
 
=== Execution engine ===
 
==== Floating-point cluster ====
 
==== Floating-point cluster ====
The execution units on the M4 have been reorganized. In total, three new units were also added - a second FP square root unit, a second vector multiplication unit, and a new horizontal vector arithmetic unit.
+
The execution units on the M4 have been reorganized. In total, three new units were also added - a second FP square root unit,  
 +
:a second vector multiplication unit, and a new horizontal vector arithmetic unit.
  
 
:[[File:m4 fp eu pipes changes.svg|thumb|left|600px|Floating-point pipe changes.]]
 
:[[File:m4 fp eu pipes changes.svg|thumb|left|600px|Floating-point pipe changes.]]
Line 126: Line 127:
  
 
==== Memory subsystem ====
 
==== Memory subsystem ====
[[File:m4 data cache.svg|thumb|left]]
+
[[File:m4 data cache.svg|thumb|right]]
Samsung also made an enhancement to the M4 memory subsystem. In the M3, there were three AGUs - two dedicated Load [[AGUs]] and a single dedicated Store [[AGU]]. In the M4, Samsung changed one of the dedicated Load [[AGU]]s into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.
+
[[Samsung]] also made an enhancement to the M4 memory subsystem. In the {{\\|M3}}, there were three AGUs - two dedicated ''Load AGUs'' and a single dedicated ''Store AGU''. In the M4, Samsung changed one of the dedicated ''Load AGUs'' into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.
  
 
{{clear}}
 
{{clear}}
Line 141: Line 142:
 
{{comp table start}}
 
{{comp table start}}
 
<table class="comptable sortable tc5 tc6 tc7">
 
<table class="comptable sortable tc5 tc6 tc7">
{{comp table header|main|7:List of M4-based Processors}}
+
{{comp table header|main|12:List of M4-based Processors}}
{{comp table header|main|5:Main processor|2:Integrated Graphics}}
+
{{comp table header|main|5:Main processor|2:Integrated Graphics|{{abbr|TDP}}|2:TDP down|2:TDP up}}
{{comp table header|cols|Family|Launched|Arch|Cores|%Frequency|GPU|%Frequency}}
+
{{comp table header|cols|Family|Launched|Arch|Cores|%Frequency|GPU|%Frequency|P|P|Frequ.|P|Frequ.}}
{{#ask: [[Category:microprocessor models by samsung]] [[microarchitecture::M4]]
+
{{#ask: [[Category:microprocessor models by samsung]] [[microarchitecture::~*M4*||Mongoose 4||Exynos 4]]
 
  |?full page name
 
  |?full page name
 
  |?model number
 
  |?model number
Line 154: Line 155:
 
  |?integrated gpu
 
  |?integrated gpu
 
  |?integrated gpu base frequency
 
  |?integrated gpu base frequency
 +
|?tdp
 +
|?tdp down
 +
|?tdp down frequency#GHz
 +
|?tdp up
 +
|?tdp up frequency#GHz
 
  |format=template
 
  |format=template
 
  |template=proc table 3
 
  |template=proc table 3
  |userparam=9
+
  |userparam=14
 
  |mainlabel=-
 
  |mainlabel=-
 
  |valuesep=,
 
  |valuesep=,
 
}}
 
}}
{{comp table count|ask=[[Category:microprocessor models by samsung]] [[microarchitecture::M4]]}}
+
{{comp table count|ask=[[Category:microprocessor models by samsung]] [[microarchitecture::~*M4*||Mongoose 4||Exynos 4]]}}
 
</table>
 
</table>
 
{{comp table end}}
 
{{comp table end}}

Latest revision as of 13:06, 22 January 2026

Edit Values
Exynos M4 (Cheetah) µarch
General Info
Arch TypeCPU
DesignerSamsung
ManufacturerSamsung
Introduction2019
Process8 nm
Core Configs4
Pipeline
TypeSuperscalar, Superpipeline
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages16
Decode6-way
Instructions
ISAARMv8.2
Cache
L1I Cache64 KiB/core
4-way set associative
L1D Cache64 KiB/core
8-way set associative
L2 Cache512 KiB/core
8-way set associative
L3 Cache2 MiB/cluster
16-way set associative
Succession

Exynos M4 (Cheetah) <aka Mongoose 4 > is the successor to the Exynos M3 (Meerkat) <aka Mongoose 3 >, an 8 nm ARM microarchitecture designed by Samsung for their consumer electronics.

Process Technology[edit]

The M4 is fabricated on Samsung's 8 nm process (8LPP).

Compiler support[edit]

Compiler Arch-Specific Arch-Favorable
GCC -mcpu=exynos-m4 -mtune=exynos-m4
LLVM -mcpu=exynos-m4 -mtune=exynos-m4

Architecture[edit]

The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.

Key changes from M3 (Meerkat)[edit]

  • 8 nm process (from 10 nm)
  • ARMv8.2 (from ARMv8)
    • Support for full FP16 scalar extension
    • Support for integer dot product extension
  • Front end
  • Back end
    • LSU execution units reorganized
    • Floating-point execution units reorganized

This list is incomplete; you can help by expanding it.

Block Diagram[edit]

Individual Core[edit]

mongoose 4 block diagram.svg

Memory Hierarchy[edit]

  • Cache
    • L1I Caches
      • 64 KiB, 4-way set associative
        • 128 B line size, per core
      • Parity-protected
    • L1D Cache
      • 64 KiB, 8-way set associative
        • 64 B line size, per core
      • 4 cycles for fastest load-to-use
      • 32 B/cycle load bandwidth
      • 16 B/cycle store bandwidth
    • L2 Cache
      • 512 KiB, 8-way set associative
      • Inclusive of L1
      • 12 cycles latency
      • 32 B/cycle bandwidth
    • L3 Cache
      • 2 MiB, 16-way set associative
        • 1 MiB slice/core
      • Exlusive of L2
      • ~37-cycle typical (NUCA)
    • BIU
      • 80 outstanding transactions

The M3 TLB consists of dedicated L1 TLB for instruction
cache (ITLB) and another one for data cache (DTLB).
Additionally, there is a unified L2 TLB (STLB).

  • TLBs
    • ITLB
      • 512-entry
    • DTLB
      • 32-entry
      • 512-entry Mid-level DTLB
    • STLB
      • 4,096-entry, per core
  • BPU
    • 4K-entry main BTB
    • 128-entry µBTB
    • 64-entry return stack
    • 16K-entry L2 BTB

Core[edit]

The core of the M4 is largely the same as M3. A number of buffers have been enlarged and some of the execution units have been reorganized.

Execution engine[edit]

Floating-point cluster[edit]

The execution units on the M4 have been reorganized. In total, three new units were also added - a second FP square root unit,

a second vector multiplication unit, and a new horizontal vector arithmetic unit.
Floating-point pipe changes.

Memory subsystem[edit]

m4 data cache.svg

Samsung also made an enhancement to the M4 memory subsystem. In the M3, there were three AGUs - two dedicated Load AGUs and a single dedicated Store AGU. In the M4, Samsung changed one of the dedicated Load AGUs into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.

All M4 Processors[edit]

 List of M4-based Processors
 Main processorIntegrated GraphicsTDPTDP downTDP up
ModelFamilyLaunchedArchCoresFrequencyGPUFrequencyPPFrequ.PFrequ.
9820ExynosJanuary 2019Cortex-A75, Cortex-A55, Exynos M48Mali-G76
9825Exynos2019Cortex-A75, Cortex-A55, Mongoose 482.73 GHz
2,730 MHz
2,730,000 kHz
, 2.4 GHz
2,400 MHz
2,400,000 kHz
, 1.95 GHz
1,950 MHz
1,950,000 kHz
Mali-G76754 MHz
0.754 GHz
754,000 KHz
5 W
5,000 mW
0.00671 hp
0.005 kW
5 W
5,000 mW
0.00671 hp
0.005 kW
2.73 GHz
2,730 MHz
2,730,000 kHz
8 W
8,000 mW
0.0107 hp
0.008 kW
3.016 GHz
3,016 MHz
3,016,000 kHz
Count: 2

Bibliography[edit]

  • LLVM: lib/Target/AArch64/AArch64SchedExynosM4.td
codenameExynos M4 (Cheetah) +
core count4 +
designerSamsung +
first launched2019 +
full page namesamsung/microarchitectures/m4 +
instance ofmicroarchitecture +
instruction set architectureARMv8.2 +
manufacturerSamsung +
microarchitecture typeCPU +
nameExynos M4 (Cheetah) +
pipeline stages16 +
process8 nm (0.008 μm, 8.0e-6 mm) +