From WikiChip
Difference between revisions of "samsung/microarchitectures/m4"
< samsung

(Architecture)
(12 intermediate revisions by 5 users not shown)
Line 2: Line 2:
 
{{microarchitecture
 
{{microarchitecture
 
|atype=CPU
 
|atype=CPU
|name=Mongoose 4
+
|name=Cheetah
 
|designer=Samsung
 
|designer=Samsung
 
|manufacturer=Samsung
 
|manufacturer=Samsung
Line 25: Line 25:
 
|l2 per=core
 
|l2 per=core
 
|l2 desc=8-way set associative
 
|l2 desc=8-way set associative
|l3=4 MiB
+
|l3=2 MiB
 
|l3 per=cluster
 
|l3 per=cluster
 
|l3 desc=16-way set associative
 
|l3 desc=16-way set associative
Line 33: Line 33:
 
|successor link=samsung/microarchitectures/m5
 
|successor link=samsung/microarchitectures/m5
 
}}
 
}}
'''Exynos Mongoose 4''' ('''M4''') is the successor to the {{\\|Mongoose 3}}, an [[8 nm]] [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics.
+
'''Exynos M4''' ('''Cheetah''') is the successor to the {{\\|M3}}, an [[8 nm]] [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics.
  
 
== Process Technology ==
 
== Process Technology ==
Line 51: Line 51:
 
== Architecture ==
 
== Architecture ==
 
The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.
 
The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.
 
{{future information}}
 
  
 
=== Key changes from {{\\|Mongoose 3|M3}} ===
 
=== Key changes from {{\\|Mongoose 3|M3}} ===
Line 58: Line 56:
 
* [[ARMv8.2]] (from [[ARMv8]])
 
* [[ARMv8.2]] (from [[ARMv8]])
 
** Support for full FP16 scalar extension
 
** Support for full FP16 scalar extension
** Suppot for integer dot product extension
+
** Support for integer dot product extension
 
* Front end
 
* Front end
 
** Larger [[instruction queue]] (48 entries, up from 40)
 
** Larger [[instruction queue]] (48 entries, up from 40)
Line 68: Line 66:
 
=== Block Diagram ===
 
=== Block Diagram ===
 
==== Individual Core ====
 
==== Individual Core ====
 +
 
[[File:mongoose 4 block diagram.svg|900px]]
 
[[File:mongoose 4 block diagram.svg|900px]]
  
 
=== Memory Hierarchy ===
 
=== Memory Hierarchy ===
 
* Cache
 
* Cache
** L1I Cache
+
** L1I Caches
 
*** 64 KiB, 4-way set associative
 
*** 64 KiB, 4-way set associative
 
**** 128 B line size
 
**** 128 B line size
Line 90: Line 89:
 
*** 32 B/cycle bandwidth
 
*** 32 B/cycle bandwidth
 
** L3 Cache
 
** L3 Cache
*** 4 MiB, 16-way set associative
+
*** 2 MiB, 16-way set associative
 
**** 1 MiB slice/core
 
**** 1 MiB slice/core
 
*** Exlusive of L2
 
*** Exlusive of L2
Line 97: Line 96:
 
*** 80 outstanding transactions
 
*** 80 outstanding transactions
  
The M4 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).
+
The M3 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).
  
 
* TLBs
 
* TLBs
Line 128: Line 127:
 
==== Memory subsystem ====
 
==== Memory subsystem ====
 
[[File:m4 data cache.svg|thumb|left]]
 
[[File:m4 data cache.svg|thumb|left]]
A minor enhancement was made to the M4 memory subsystem. In the M3, there were three AGUs - two dedicated Load [[AGUs]] and a single dedicated Store [[AUG]]. In the M4, Samsung changed one of the dedicated Load [[AGU]]s into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.
+
Samsung also made an enhancement to the M4 memory subsystem. In the M3, there were three AGUs - two dedicated Load [[AGUs]] and a single dedicated Store [[AGU]]. In the M4, Samsung changed one of the dedicated Load [[AGU]]s into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.
  
 
{{clear}}
 
{{clear}}
  
== All M3 Processors ==
+
== All M4 Processors ==
 
<!-- NOTE:  
 
<!-- NOTE:  
 
           This table is generated automatically from the data in the actual articles.
 
           This table is generated automatically from the data in the actual articles.
Line 145: Line 144:
 
{{comp table header|main|5:Main processor|2:Integrated Graphics}}
 
{{comp table header|main|5:Main processor|2:Integrated Graphics}}
 
{{comp table header|cols|Family|Launched|Arch|Cores|%Frequency|GPU|%Frequency}}
 
{{comp table header|cols|Family|Launched|Arch|Cores|%Frequency|GPU|%Frequency}}
{{#ask: [[Category:microprocessor models by samsung]] [[microarchitecture::Mongoose 4]]
+
{{#ask: [[Category:microprocessor models by samsung]] [[microarchitecture::M4]]
 
  |?full page name
 
  |?full page name
 
  |?model number
 
  |?model number
Line 161: Line 160:
 
  |valuesep=,
 
  |valuesep=,
 
}}
 
}}
{{comp table count|ask=[[Category:microprocessor models by samsung]] [[microarchitecture::Mongoose 4]]}}
+
{{comp table count|ask=[[Category:microprocessor models by samsung]] [[microarchitecture::M4]]}}
 
</table>
 
</table>
 
{{comp table end}}
 
{{comp table end}}

Revision as of 10:42, 15 February 2020

Edit Values
Cheetah µarch
General Info
Arch TypeCPU
DesignerSamsung
ManufacturerSamsung
Introduction2019
Process8 nm
Core Configs4
Pipeline
TypeSuperscalar, Superpipeline
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages16
Decode6-way
Instructions
ISAARMv8.2
Cache
L1I Cache64 KiB/core
4-way set associative
L1D Cache64 KiB/core
8-way set associative
L2 Cache512 KiB/core
8-way set associative
L3 Cache2 MiB/cluster
16-way set associative
Succession

Exynos M4 (Cheetah) is the successor to the M3, an 8 nm ARM microarchitecture designed by Samsung for their consumer electronics.

Process Technology

The M4 is fabricated on Samsung's 8 nm process (8LPP).

Compiler support

Compiler Arch-Specific Arch-Favorable
GCC -mcpu=exynos-m4 -mtune=exynos-m4
LLVM -mcpu=exynos-m4 -mtune=exynos-m4


Architecture

The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.

Key changes from M3

  • 8 nm process (from 10 nm)
  • ARMv8.2 (from ARMv8)
    • Support for full FP16 scalar extension
    • Support for integer dot product extension
  • Front end
  • Back end
    • LSU executiion units reorganized
    • Floating-point execution units reorganized

This list is incomplete; you can help by expanding it.

Block Diagram

Individual Core

mongoose 4 block diagram.svg

Memory Hierarchy

  • Cache
    • L1I Caches
      • 64 KiB, 4-way set associative
        • 128 B line size
        • per core
      • Parity-protected
    • L1D Cache
      • 64 KiB, 8-way set associative
        • 64 B line size
        • per core
      • 4 cycles for fastest load-to-use
      • 32 B/cycle load bandwidth
      • 16 B/cycle store bandwidth
    • L2 Cache
      • 512 KiB, 8-way set associative
      • Inclusive of L1
      • 12 cycles latency
      • 32 B/cycle bandwidth
    • L3 Cache
      • 2 MiB, 16-way set associative
        • 1 MiB slice/core
      • Exlusive of L2
      • ~37-cycle typical (NUCA)
    • BIU
      • 80 outstanding transactions

The M3 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).

  • TLBs
    • ITLB
      • 512-entry
    • DTLB
      • 32-entry
      • 512-entry Mid-level DTLB
    • STLB
      • 4,096-entry
      • Per core
  • BPU
    • 4K-entry main BTB
    • 128-entry µBTB
    • 64-entry return stack
    • 16K-entry L2 BTB

Core

The core of the M4 is largely the same as M3. A number of buffers have been enlarged and some of the execution units have been reorganized.

Execution engine

Floating-point cluster

The execution units on the M4 have been reorganized. In total, three new units were also added - a second FP square root unit, a second vector multiplication unit, and a new horizontal vector arithmetic unit.

Floating-point pipe changes.

Memory subsystem

m4 data cache.svg

Samsung also made an enhancement to the M4 memory subsystem. In the M3, there were three AGUs - two dedicated Load AGUs and a single dedicated Store AGU. In the M4, Samsung changed one of the dedicated Load AGUs into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.

All M4 Processors

 List of M4-based Processors
 Main processorIntegrated Graphics
ModelFamilyLaunchedArchCoresFrequencyGPUFrequency
9825Exynos2019Cortex-A75, Cortex-A55, M482.73 GHz
2,730 MHz
2,730,000 kHz
, 2.4 GHz
2,400 MHz
2,400,000 kHz
, 1.95 GHz
1,950 MHz
1,950,000 kHz
Mali-G76754 MHz
0.754 GHz
754,000 KHz
Count: 1

Bibliography

  • LLVM: lib/Target/AArch64/AArch64SchedExynosM4.td