From WikiChip
Exynos M4 - Microarchitectures - Samsung
< samsung
Revision as of 23:58, 13 January 2019 by David (talk | contribs) (Memory subsystem)

Edit Values
Mongoose 4 µarch
General Info
Arch TypeCPU
DesignerSamsung
ManufacturerSamsung
Introduction2018
Process8 nm
Instructions
ISAARMv8
Succession

Exynos Mongoose 4 (M4) is the successor to the Mongoose 3, an 8 nm ARM microarchitecture designed by Samsung for their consumer electronics.

Process Technology

The M4 is fabricated on Samsung's 8 nm process (8LPP).

Compiler support

Compiler Arch-Specific Arch-Favorable
GCC -mcpu=exynos-m4 -mtune=exynos-m4
LLVM -mcpu=exynos-m4 -mtune=exynos-m4


Architecture

The M4 is an incremental microarchitecture that brought a die shrink and minor enhancements.

Key changes from M3

  • 8 nm process (from 10 nm)
  • ARMv8.2 (from ARMv8)
    • Support for full FP16 scalar extension
    • Suppot for integer dot product extension
  • Front end
  • Back end
    • LSU reorganized
    • Floating-point execution units reorganized

This list is incomplete; you can help by expanding it.

Block Diagram

Individual Core

mongoose 4 block diagram.svg

Memory Hierarchy

  • Cache
    • L1I Cache
      • 64 KiB, 4-way set associative
        • 128 B line size
        • per core
      • Parity-protected
    • L1D Cache
      • 64 KiB, 8-way set associative
        • 64 B line size
        • per core
      • 4 cycles for fastest load-to-use
      • 32 B/cycle load bandwidth
      • 16 B/cycle store bandwidth
    • L2 Cache
      • 512 KiB, 8-way set associative
      • Inclusive of L1
      • 12 cycles latency
      • 32 B/cycle bandwidth
    • L3 Cache
      • 4 MiB, 16-way set associative
        • 1 MiB slice/core
      • Exlusive of L2
      • ~37-cycle typical (NUCA)
    • BIU
      • 80 outstanding transactions

Mongoose 1 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).

  • TLBs
    • ITLB
      • 512-entry
    • DTLB
      • 32-entry
      • 512-entry Mid-level DTLB
    • STLB
      • 4,096-entry
      • Per core
  • BPU
    • 4K-entry main BTB
    • 128-entry µBTB
    • 64-entry return stack
    • 16K-entry L2 BTB

Core

The core of the M4 is largely the same as M3. A number of buffers have been enlarged and some of the execution units have been reorganized.

Execution engine

Floating-point cluster

The execution units on the M4 have been reorganized. In total, three new units were also added - a second FP square root unit, a second vector multiplication unit, and a new horizontal vector arithmetic unit.

Floating-point pipe changes.

Memory subsystem

m4 data cache.svg

A minor enhancement was made to the M4 memory subsystem. In the M3, there were three AGUs - two dedicated Load AGUs and a single dedicated Store AUG. In the M4, Samsung changed one of the dedicated Load AGUs into a generic AGU capable of handling both loads and stores. In other words, the M4 can now schedule both load and store µOPs on two ports.

All M3 Processors

 List of M4-based Processors
 Main processorIntegrated Graphics
ModelFamilyLaunchedArchCoresFrequencyGPUFrequency
Count: 0

Bibliography

  • LLVM: lib/Target/AArch64/AArch64SchedExynosM4.td
codenameMongoose 4 +
designerSamsung +
first launched2018 +
full page namesamsung/microarchitectures/m4 +
instance ofmicroarchitecture +
instruction set architectureARMv8 +
manufacturerSamsung +
microarchitecture typeCPU +
nameMongoose 4 +
process8 nm (0.008 μm, 8.0e-6 mm) +