From WikiChip
Difference between revisions of "samsung/microarchitectures/m4"
< samsung

(Architecture)
(28 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{samsung title|Mongoose 4 (M4)|arch}}
+
{{samsung title|Exynos M4|arch}}
 
{{microarchitecture
 
{{microarchitecture
 
|atype=CPU
 
|atype=CPU
|name=Mongoose 4
+
|name=Cheetah
 
|designer=Samsung
 
|designer=Samsung
 
|manufacturer=Samsung
 
|manufacturer=Samsung
 
|introduction=2019
 
|introduction=2019
|process=7 nm
+
|process=8 nm
|isa=ARMv8
+
|cores=4
|predecessor=Mongoose 3
+
|type=Superscalar
|predecessor link=samsung/microarchitectures/mongoose_3
+
|type 2=Superpipeline
|successor=Mongoose 5
+
|oooe=Yes
|successor link=samsung/microarchitectures/mongoose_5
+
|speculative=Yes
 +
|renaming=Yes
 +
|stages=16
 +
|decode=6-way
 +
|isa=ARMv8.2
 +
|l1i=64 KiB
 +
|l1i per=core
 +
|l1i desc=4-way set associative
 +
|l1d=64 KiB
 +
|l1d per=core
 +
|l1d desc=8-way set associative
 +
|l2=512 KiB
 +
|l2 per=core
 +
|l2 desc=8-way set associative
 +
|l3=2 MiB
 +
|l3 per=cluster
 +
|l3 desc=16-way set associative
 +
|predecessor=M3
 +
|predecessor link=samsung/microarchitectures/m3
 +
|successor=M5
 +
|successor link=samsung/microarchitectures/m5
 
}}
 
}}
'''Mongoose 4''' ('''M4''') is the successor to the {{\\|Mongoose 3}}, a planned [[7 nm]]/[[10 nm]](?) [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics.
+
'''Exynos M4''' ('''Cheetah''') is the successor to the {{\\|M3}}, an [[8 nm]] [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics.
  
 +
== Process Technology ==
 +
The M4 is fabricated on Samsung's [[8 nm process]] (8LPP).
 +
 +
== Compiler support ==
 +
{| class="wikitable"
 +
|-
 +
! Compiler !! Arch-Specific || Arch-Favorable
 +
|-
 +
| [[GCC]] || <code>-mcpu=exynos-m4</code> || <code>-mtune=exynos-m4</code>
 +
|-
 +
| [[LLVM]] || <code>-mcpu=exynos-m4</code> || <code>-mtune=exynos-m4</code>
 +
|}
  
{{future information}}
 
  
 
== Architecture ==
 
== Architecture ==
{{empty section}}
+
The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.
 +
 
 
=== Key changes from {{\\|Mongoose 3|M3}} ===
 
=== Key changes from {{\\|Mongoose 3|M3}} ===
{{empty section}}
+
* [[8 nm process]] (from [[10 nm]])
 +
* [[ARMv8.2]] (from [[ARMv8]])
 +
** Support for full FP16 scalar extension
 +
** Support for integer dot product extension
 +
* Front end
 +
** Larger [[instruction queue]] (48 entries, up from 40)
 +
* Back end
 +
** LSU execution units reorganized
 +
** Floating-point execution units reorganized
 +
{{expand list}}
 +
 
 +
=== Block Diagram ===
 +
==== Individual Core ====
 +
 
 +
[[File:mongoose 4 block diagram.svg|900px]]
 +
 
 +
=== Memory Hierarchy ===
 +
* Cache
 +
** L1I Caches
 +
*** 64 KiB, 4-way set associative
 +
**** 128 B line size
 +
**** per core
 +
*** Parity-protected
 +
** L1D Cache
 +
*** 64 KiB, 8-way set associative
 +
**** 64 B line size
 +
**** per core
 +
*** 4 cycles for fastest load-to-use
 +
*** 32 B/cycle load bandwidth
 +
*** 16 B/cycle store bandwidth
 +
** L2 Cache
 +
*** 512 KiB, 8-way set associative
 +
*** Inclusive of L1
 +
*** 12 cycles latency
 +
*** 32 B/cycle bandwidth
 +
** L3 Cache
 +
*** 2 MiB, 16-way set associative
 +
**** 1 MiB slice/core
 +
*** Exlusive of L2
 +
*** ~37-cycle typical (NUCA)
 +
** BIU
 +
*** 80 outstanding transactions
 +
 
 +
The M3 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).
 +
 
 +
* TLBs
 +
** ITLB
 +
*** 512-entry
 +
** DTLB
 +
*** 32-entry
 +
*** 512-entry Mid-level DTLB
 +
** STLB
 +
*** 4,096-entry
 +
*** Per core
 +
 
 +
* BPU
 +
** 4K-entry main BTB
 +
** 128-entry µBTB
 +
** 64-entry return stack
 +
** 16K-entry L2 BTB
 +
 
 +
== Core ==
 +
The core of the M4 is largely the same as {{\\|M3}}. A number of buffers have been enlarged and some of the execution units have been reorganized.
 +
 
 +
=== Execution engine ===
 +
==== Floating-point cluster ====
 +
The execution units on the M4 have been reorganized. In total, three new units were also added  - a second FP square root unit, a second vector multiplication unit, and a new horizontal vector arithmetic unit.
 +
 
 +
:[[File:m4 fp eu pipes changes.svg|thumb|left|600px|Floating-point pipe changes.]]
 +
 
 +
{{clear}}
 +
 
 +
==== Memory subsystem ====
 +
[[File:m4 data cache.svg|thumb|left]]
 +
Samsung also made an enhancement to the M4 memory subsystem. In the M3, there were three AGUs - two dedicated Load [[AGUs]] and a single dedicated Store [[AGU]]. In the M4, Samsung changed one of the dedicated Load [[AGU]]s into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.
 +
 
 +
{{clear}}
 +
 
 +
== All M4 Processors ==
 +
<!-- NOTE:
 +
          This table is generated automatically from the data in the actual articles.
 +
          If a microprocessor is missing from the list, an appropriate article for it needs to be
 +
          created and tagged accordingly.
 +
 
 +
          Missing a chip? please dump its name here: https://en.wikichip.org/wiki/WikiChip:wanted_chips
 +
-->
 +
{{comp table start}}
 +
<table class="comptable sortable tc5 tc6 tc7">
 +
{{comp table header|main|7:List of M4-based Processors}}
 +
{{comp table header|main|5:Main processor|2:Integrated Graphics}}
 +
{{comp table header|cols|Family|Launched|Arch|Cores|%Frequency|GPU|%Frequency}}
 +
{{#ask: [[Category:microprocessor models by samsung]] [[microarchitecture::M4]]
 +
|?full page name
 +
|?model number
 +
|?family
 +
|?first launched
 +
|?microarchitecture
 +
|?core count
 +
|?base frequency#GHz
 +
|?integrated gpu
 +
|?integrated gpu base frequency
 +
|format=template
 +
|template=proc table 3
 +
|userparam=9
 +
|mainlabel=-
 +
|valuesep=,
 +
}}
 +
{{comp table count|ask=[[Category:microprocessor models by samsung]] [[microarchitecture::M4]]}}
 +
</table>
 +
{{comp table end}}
 +
 
 +
== Bibliography ==
 +
* LLVM: lib/Target/AArch64/AArch64SchedExynosM4.td

Revision as of 15:27, 10 August 2020

Edit Values
Cheetah µarch
General Info
Arch TypeCPU
DesignerSamsung
ManufacturerSamsung
Introduction2019
Process8 nm
Core Configs4
Pipeline
TypeSuperscalar, Superpipeline
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages16
Decode6-way
Instructions
ISAARMv8.2
Cache
L1I Cache64 KiB/core
4-way set associative
L1D Cache64 KiB/core
8-way set associative
L2 Cache512 KiB/core
8-way set associative
L3 Cache2 MiB/cluster
16-way set associative
Succession

Exynos M4 (Cheetah) is the successor to the M3, an 8 nm ARM microarchitecture designed by Samsung for their consumer electronics.

Process Technology

The M4 is fabricated on Samsung's 8 nm process (8LPP).

Compiler support

Compiler Arch-Specific Arch-Favorable
GCC -mcpu=exynos-m4 -mtune=exynos-m4
LLVM -mcpu=exynos-m4 -mtune=exynos-m4


Architecture

The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.

Key changes from M3

  • 8 nm process (from 10 nm)
  • ARMv8.2 (from ARMv8)
    • Support for full FP16 scalar extension
    • Support for integer dot product extension
  • Front end
  • Back end
    • LSU execution units reorganized
    • Floating-point execution units reorganized

This list is incomplete; you can help by expanding it.

Block Diagram

Individual Core

mongoose 4 block diagram.svg

Memory Hierarchy

  • Cache
    • L1I Caches
      • 64 KiB, 4-way set associative
        • 128 B line size
        • per core
      • Parity-protected
    • L1D Cache
      • 64 KiB, 8-way set associative
        • 64 B line size
        • per core
      • 4 cycles for fastest load-to-use
      • 32 B/cycle load bandwidth
      • 16 B/cycle store bandwidth
    • L2 Cache
      • 512 KiB, 8-way set associative
      • Inclusive of L1
      • 12 cycles latency
      • 32 B/cycle bandwidth
    • L3 Cache
      • 2 MiB, 16-way set associative
        • 1 MiB slice/core
      • Exlusive of L2
      • ~37-cycle typical (NUCA)
    • BIU
      • 80 outstanding transactions

The M3 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).

  • TLBs
    • ITLB
      • 512-entry
    • DTLB
      • 32-entry
      • 512-entry Mid-level DTLB
    • STLB
      • 4,096-entry
      • Per core
  • BPU
    • 4K-entry main BTB
    • 128-entry µBTB
    • 64-entry return stack
    • 16K-entry L2 BTB

Core

The core of the M4 is largely the same as M3. A number of buffers have been enlarged and some of the execution units have been reorganized.

Execution engine

Floating-point cluster

The execution units on the M4 have been reorganized. In total, three new units were also added - a second FP square root unit, a second vector multiplication unit, and a new horizontal vector arithmetic unit.

Floating-point pipe changes.

Memory subsystem

m4 data cache.svg

Samsung also made an enhancement to the M4 memory subsystem. In the M3, there were three AGUs - two dedicated Load AGUs and a single dedicated Store AGU. In the M4, Samsung changed one of the dedicated Load AGUs into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.

All M4 Processors

 List of M4-based Processors
 Main processorIntegrated Graphics
ModelFamilyLaunchedArchCoresFrequencyGPUFrequency
9825Exynos2019Cortex-A75, Cortex-A55, M482.73 GHz
2,730 MHz
2,730,000 kHz
, 2.4 GHz
2,400 MHz
2,400,000 kHz
, 1.95 GHz
1,950 MHz
1,950,000 kHz
Mali-G76754 MHz
0.754 GHz
754,000 KHz
Count: 1

Bibliography

  • LLVM: lib/Target/AArch64/AArch64SchedExynosM4.td