From WikiChip
Exynos M4 - Microarchitectures - Samsung
< samsung(Redirected from samsung/cores/mongoose m4)

Edit Values
Cheetah µarch
General Info
Arch TypeCPU
DesignerSamsung
ManufacturerSamsung
Introduction2019
Process8 nm
Core Configs4
Pipeline
TypeSuperscalar, Superpipeline
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages16
Decode6-way
Instructions
ISAARMv8.2
Cache
L1I Cache64 KiB/core
4-way set associative
L1D Cache64 KiB/core
8-way set associative
L2 Cache512 KiB/core
8-way set associative
L3 Cache2 MiB/cluster
16-way set associative
Succession

Exynos M4 (Cheetah) is the successor to the M3, an 8 nm ARM microarchitecture designed by Samsung for their consumer electronics.

Process Technology[edit]

The M4 is fabricated on Samsung's 8 nm process (8LPP).

Compiler support[edit]

Compiler Arch-Specific Arch-Favorable
GCC -mcpu=exynos-m4 -mtune=exynos-m4
LLVM -mcpu=exynos-m4 -mtune=exynos-m4


Architecture[edit]

The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.

Key changes from M3[edit]

  • 8 nm process (from 10 nm)
  • ARMv8.2 (from ARMv8)
    • Support for full FP16 scalar extension
    • Support for integer dot product extension
  • Front end
  • Back end
    • LSU execution units reorganized
    • Floating-point execution units reorganized

This list is incomplete; you can help by expanding it.

Block Diagram[edit]

Individual Core[edit]

mongoose 4 block diagram.svg

Memory Hierarchy[edit]

  • Cache
    • L1I Caches
      • 64 KiB, 4-way set associative
        • 128 B line size
        • per core
      • Parity-protected
    • L1D Cache
      • 64 KiB, 8-way set associative
        • 64 B line size
        • per core
      • 4 cycles for fastest load-to-use
      • 32 B/cycle load bandwidth
      • 16 B/cycle store bandwidth
    • L2 Cache
      • 512 KiB, 8-way set associative
      • Inclusive of L1
      • 12 cycles latency
      • 32 B/cycle bandwidth
    • L3 Cache
      • 2 MiB, 16-way set associative
        • 1 MiB slice/core
      • Exlusive of L2
      • ~37-cycle typical (NUCA)
    • BIU
      • 80 outstanding transactions

The M3 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).

  • TLBs
    • ITLB
      • 512-entry
    • DTLB
      • 32-entry
      • 512-entry Mid-level DTLB
    • STLB
      • 4,096-entry
      • Per core
  • BPU
    • 4K-entry main BTB
    • 128-entry µBTB
    • 64-entry return stack
    • 16K-entry L2 BTB

Core[edit]

The core of the M4 is largely the same as M3. A number of buffers have been enlarged and some of the execution units have been reorganized.

Execution engine[edit]

Floating-point cluster[edit]

The execution units on the M4 have been reorganized. In total, three new units were also added - a second FP square root unit, a second vector multiplication unit, and a new horizontal vector arithmetic unit.

Floating-point pipe changes.

Memory subsystem[edit]

m4 data cache.svg

Samsung also made an enhancement to the M4 memory subsystem. In the M3, there were three AGUs - two dedicated Load AGUs and a single dedicated Store AGU. In the M4, Samsung changed one of the dedicated Load AGUs into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.

All M4 Processors[edit]

 List of M4-based Processors
 Main processorIntegrated GraphicsTDPTDP downTDP up
ModelFamilyLaunchedArchCoresFrequencyGPUFrequencyPPFrequ.PFrequ.
9825Exynos2019Cortex-A75, Cortex-A55, M482.73 GHz
2,730 MHz
2,730,000 kHz
, 2.4 GHz
2,400 MHz
2,400,000 kHz
, 1.95 GHz
1,950 MHz
1,950,000 kHz
Mali-G76754 MHz
0.754 GHz
754,000 KHz
5 W
5,000 mW
0.00671 hp
0.005 kW
5 W
5,000 mW
0.00671 hp
0.005 kW
2.73 GHz
2,730 MHz
2,730,000 kHz
8 W
8,000 mW
0.0107 hp
0.008 kW
3.016 GHz
3,016 MHz
3,016,000 kHz
Count: 1

Bibliography[edit]

  • LLVM: lib/Target/AArch64/AArch64SchedExynosM4.td