Difference between revisions of "samsung/microarchitectures/m4"

	Edit Values
	Exynos M4 (Cheetah) µarch
	General Info
Arch Type	CPU
Designer	Samsung
Manufacturer	Samsung
Introduction	2019
Process	8 nm
Core Configs	4
	Pipeline
Type	Superscalar, Superpipeline
OoOE	Yes
Speculative	Yes
Reg Renaming	Yes
Stages	16
Decode	6-way
	Instructions
ISA	ARMv8.2
	Cache
L1I Cache	64 KiB/core; 4-way set associative
L1D Cache	64 KiB/core; 8-way set associative
L2 Cache	512 KiB/core; 8-way set associative
L3 Cache	2 MiB/cluster; 16-way set associative
	Succession
	M3 (Meerkat) M5 (Lion)

Latest revision as of 13:06, 22 January 2026

Exynos M4 (Cheetah) <aka Mongoose 4 > is the successor to the Exynos M3 (Meerkat) <aka Mongoose 3 >, an 8 nm ARM microarchitecture designed by Samsung for their consumer electronics.

Process Technology[edit]

The M4 is fabricated on Samsung's 8 nm process (8LPP).

Compiler support[edit]

Compiler	Arch-Specific	Arch-Favorable
GCC	`-mcpu=exynos-m4`	`-mtune=exynos-m4`
LLVM	`-mcpu=exynos-m4`	`-mtune=exynos-m4`

Architecture[edit]

The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.

Key changes from M3 (Meerkat)[edit]

8 nm process (from 10 nm)
ARMv8.2 (from ARMv8)
- Support for full FP16 scalar extension
- Support for integer dot product extension
Front end
- Larger instruction queue (48 entries, up from 40)
Back end
- LSU execution units reorganized
- Floating-point execution units reorganized

This list is incomplete; you can help by expanding it.

Block Diagram[edit]

Individual Core[edit]

Memory Hierarchy[edit]

Cache
- L1I Caches
  - 64 KiB, 4-way set associative
    - 128 B line size, per core
  - Parity-protected
- L1D Cache
  - 64 KiB, 8-way set associative
    - 64 B line size, per core
  - 4 cycles for fastest load-to-use
  - 32 B/cycle load bandwidth
  - 16 B/cycle store bandwidth
- L2 Cache
  - 512 KiB, 8-way set associative
  - Inclusive of L1
  - 12 cycles latency
  - 32 B/cycle bandwidth
- L3 Cache
  - 2 MiB, 16-way set associative
    - 1 MiB slice/core
  - Exlusive of L2
  - ~37-cycle typical (NUCA)
- BIU
  - 80 outstanding transactions

The M3 TLB consists of dedicated L1 TLB for instruction
cache (ITLB) and another one for data cache (DTLB).
Additionally, there is a unified L2 TLB (STLB).

TLBs
- ITLB
  - 512-entry
- DTLB
  - 32-entry
  - 512-entry Mid-level DTLB
- STLB
  - 4,096-entry, per core

BPU
- 4K-entry main BTB
- 128-entry µBTB
- 64-entry return stack
- 16K-entry L2 BTB

Core[edit]

The core of the M4 is largely the same as M3. A number of buffers have been enlarged and some of the execution units have been reorganized.

Execution engine[edit]

Floating-point cluster[edit]

The execution units on the M4 have been reorganized. In total, three new units were also added - a second FP square root unit,

a second vector multiplication unit, and a new horizontal vector arithmetic unit.

Floating-point pipe changes.

Memory subsystem[edit]

Samsung also made an enhancement to the M4 memory subsystem. In the M3, there were three AGUs - two dedicated Load AGUs and a single dedicated Store AGU. In the M4, Samsung changed one of the dedicated Load AGUs into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.

All M4 Processors[edit]

	List of M4-based Processors
	Main processor					Integrated Graphics		TDP	TDP down		TDP up
Model	Family	Launched	Arch	Cores	Frequency	GPU	Frequency	P	P	Frequ.	P	Frequ.
9820	Exynos	January 2019	Cortex-A75, Cortex-A55, Exynos M4	8		Mali-G76
9825	Exynos	2019	Cortex-A75, Cortex-A55, Mongoose 4	8	2.73 GHz 2,730 MHz 2,730,000 kHz , 2.4 GHz 2,400 MHz 2,400,000 kHz , 1.95 GHz 1,950 MHz 1,950,000 kHz	Mali-G76	754 MHz 0.754 GHz 754,000 KHz	5 W 5,000 mW 0.00671 hp 0.005 kW	5 W 5,000 mW 0.00671 hp 0.005 kW	2.73 GHz 2,730 MHz 2,730,000 kHz	8 W 8,000 mW 0.0107 hp 0.008 kW	3.016 GHz 3,016 MHz 3,016,000 kHz
Count: 2

Bibliography[edit]

LLVM: lib/Target/AArch64/AArch64SchedExynosM4.td

@@ Line 2: / Line 2: @@
 {{microarchitecture
 |atype=CPU
-|name=Mongoose 4
+|name=Exynos M4 (Cheetah)
 |designer=Samsung
 |manufacturer=Samsung
@@ Line 25: / Line 25: @@
 |l2 per=core
 |l2 desc=8-way set associative
-|l3=4 MiB
+|l3=2 MiB
 |l3 per=cluster
 |l3 desc=16-way set associative
-|predecessor=M3
+|predecessor=M3 (Meerkat)
 |predecessor link=samsung/microarchitectures/m3
-|successor=M5
+|successor=M5 (Lion)
 |successor link=samsung/microarchitectures/m5
 }}
-'''Exynos Mongoose 4''' ('''M4''') is the successor to the {{\\|Mongoose 3}}, an [[8 nm]] [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics.
+'''Exynos M4''' ('''Cheetah''') <aka ''{{\\|Mongoose 4}}'' > is the successor to the [[Exynos]] {{\\|M3}} (Meerkat) <aka ''{{\\|Mongoose 3}}'' >, an [[8 nm]] [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics.
 == Process Technology ==
@@ Line 47: / Line 47: @@
 | [[LLVM]] || <code>-mcpu=exynos-m4</code> || <code>-mtune=exynos-m4</code>
 |}
 == Architecture ==
 The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.
-{{future information}}
+=== Key changes from {{\\|Mongoose 3|M3}} (Meerkat) ===
-=== Key changes from {{\\|Mongoose 3|M3}} ===
 * [[8 nm process]] (from [[10 nm]])
 * [[ARMv8.2]] (from [[ARMv8]])
 ** Support for full FP16 scalar extension
-** Suppot for integer dot product extension
+** Support for integer dot product extension
 * Front end
 ** Larger [[instruction queue]] (48 entries, up from 40)
 * Back end
-** LSU executiion units reorganized
+** LSU execution units reorganized
 ** Floating-point execution units reorganized
 {{expand list}}
@@ Line 68: / Line 65: @@
 === Block Diagram ===
 ==== Individual Core ====
 [[File:mongoose 4 block diagram.svg|900px]]
 === Memory Hierarchy ===
-{{empty section}}
+{| border="0" cellpadding="5" width="100%"
+|-
+|width="50%" valign="top" align="left"|
+* Cache
+** L1I Caches
+*** 64 KiB, 4-way set associative
+**** 128 B line size, per core
+*** Parity-protected
+** L1D Cache
+*** 64 KiB, 8-way set associative
+**** 64 B line size, per core
+*** 4 cycles for fastest load-to-use
+*** 32 B/cycle load bandwidth
+*** 16 B/cycle store bandwidth
+** L2 Cache
+*** 512 KiB, 8-way set associative
+*** Inclusive of L1
+*** 12 cycles latency
+*** 32 B/cycle bandwidth
+** L3 Cache
+*** 2 MiB, 16-way set associative
+**** 1 MiB slice/core
+*** Exlusive of L2
+*** ~37-cycle typical (NUCA)
+** BIU
+*** 80 outstanding transactions
+|width="50%" valign="top" align="left"|
+The M3 TLB consists of dedicated L1 TLB for instruction <br>cache (ITLB) and another one for data cache (DTLB). <br>Additionally, there is a unified L2 TLB (STLB).
+* TLBs
+** ITLB
+*** 512-entry
+** DTLB
+*** 32-entry
+*** 512-entry Mid-level DTLB
+** STLB
+*** 4,096-entry, per core
+* BPU
+** 4K-entry main BTB
+** 128-entry µBTB
+** 64-entry return stack
+** 16K-entry L2 BTB
+|}
 == Core ==
@@ Line 78: / Line 119: @@
 === Execution engine ===
 ==== Floating-point cluster ====
-The execution units on the M4 have been reorganized. In total, three new units were also added  - a second FP square root unit, a second vector multiplication unit, and a new horizontal vector arithmetic unit.
+The execution units on the M4 have been reorganized. In total, three new units were also added - a second FP square root unit,
+:a second vector multiplication unit, and a new horizontal vector arithmetic unit.
 :[[File:m4 fp eu pipes changes.svg|thumb|left|600px|Floating-point pipe changes.]]
@@ Line 85: / Line 127: @@
 ==== Memory subsystem ====
-[[File:m4 data cache.svg|thumb|left]]
+[[File:m4 data cache.svg|thumb|right]]
-A minor enhancement was made to the M4 memory subsystem. In the M3, there were three AGUs - two dedicated Load [[AGUs]] and a single dedicated Store [[AUG]]. In the M4, Samsung changed one of the dedicated Load [[AGU]]s into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.
+[[Samsung]] also made an enhancement to the M4 memory subsystem. In the {{\\|M3}}, there were three AGUs - two dedicated ''Load AGUs'' and a single dedicated ''Store AGU''. In the M4, Samsung changed one of the dedicated ''Load AGUs'' into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.
 {{clear}}
-== All M3 Processors ==
+== All M4 Processors ==
 <!-- NOTE:
             This table is generated automatically from the data in the actual articles.
@@ Line 100: / Line 142: @@
 {{comp table start}}
 <table class="comptable sortable tc5 tc6 tc7">
-{{comp table header|main|7:List of M4-based Processors}}
+{{comp table header|main|12:List of M4-based Processors}}
-{{comp table header|main|5:Main processor|2:Integrated Graphics}}
+{{comp table header|main|5:Main processor|2:Integrated Graphics|{{abbr|TDP}}|2:TDP down|2:TDP up}}
-{{comp table header|cols|Family|Launched|Arch|Cores|%Frequency|GPU|%Frequency}}
+{{comp table header|cols|Family|Launched|Arch|Cores|%Frequency|GPU|%Frequency|P|P|Frequ.|P|Frequ.}}
-{{#ask: [[Category:microprocessor models by samsung]] [[microarchitecture::Mongoose 4]]
+{{#ask: [[Category:microprocessor models by samsung]] [[microarchitecture::~*M4*||Mongoose 4||Exynos 4]]
   |?full page name
   |?model number
@@ Line 113: / Line 155: @@
   |?integrated gpu
   |?integrated gpu base frequency
+ |?tdp
+ |?tdp down
+ |?tdp down frequency#GHz
+ |?tdp up
+ |?tdp up frequency#GHz
   |format=template
   |template=proc table 3
-  |userparam=9
+  |userparam=14
   |mainlabel=-
   |valuesep=,
 }}
-{{comp table count|ask=[[Category:microprocessor models by samsung]] [[microarchitecture::Mongoose 4]]}}
+{{comp table count|ask=[[Category:microprocessor models by samsung]] [[microarchitecture::~*M4*||Mongoose 4||Exynos 4]]}}
 </table>
 {{comp table end}}

codename	Exynos M4 (Cheetah) +
core count	4 +
designer	Samsung +
first launched	2019 +
full page name	samsung/microarchitectures/m4 +
instance of	microarchitecture +
instruction set architecture	ARMv8.2 +
manufacturer	Samsung +
microarchitecture type	CPU +
name	Exynos M4 (Cheetah) +
pipeline stages	16 +
process	8 nm (0.008 μm, 8.0e-6 mm) +

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung