Editing samsung/microarchitectures/m1

{{samsung title|Mongoose 1 (M1)|arch}}
{{microarchitecture}}
'''Mongoose 1''' ('''M1''') is an [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics. This was Samsung's first in-house developed microarchitecture.
@@ Line 1: / Line 1: @@
-{{samsung title|Exynos M1|arch}}
+{{samsung title|Mongoose 1 (M1)|arch}}
-{{microarchitecture
+{{microarchitecture}}
-|atype=CPU
+'''Mongoose 1''' ('''M1''') is an [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics. This was Samsung's first in-house developed microarchitecture.
-|name=Mongoose 1
-|designer=Samsung
-|manufacturer=Samsung
-|introduction=November 12, 2015
-|phase-out=2017
-|process=14 nm
-|cores=4
-|oooe=Yes
-|speculative=Yes
-|renaming=Yes
-|stages=14
-|decode=4-way
-|isa=ARMv8
-|l1i=64 KiB
-|l1i per=core
-|l1i desc=4-way set associative
-|l1d=32 KiB
-|l1d per=core
-|l1d desc=8-way set associative
-|l2=2 MiB
-|l2 per=cluster
-|l2 desc=16-way set associative
-|successor=M2
-|successor link=samsung/microarchitectures/m2
-}}
-'''Exynos Mongoose 1''' ('''M1''') is an [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics. This was Samsung's first in-house developed high-performance low-power [[ARM]] microarchitecture.
-== History ==
-The Mongoose 1 (M1) microarchitecture was Samsung's first in-house design which was done entirely from scratch. A design team was assembled and in roughly 3 years, they've gone from requirements to [[tape-out]]. The design was done at Samsung's Austin R&D Center (SARC) which was founded in [[2010]] for the sole purpose of developing high-performance, low-power, complex CPU and System IPs. A large portion of the design team consists of many ex-[[AMD]] Austin engineers as well as ex-[[IBM]]ers. The Mongoose 1 microarchitecture was led by chief architect Brad Burgess. Previously Burgess was an AMD fellow and the chief architect of [[AMD]]'s {{amd|Bobcat|l=arch}} microarchitecture, a low-power [[x86]] design.
-== Process Technology ==
-M1 was fabricated on Samsung's [[14 nm process]].
-== Compiler support ==
-{| class="wikitable"
-|-
-! Compiler !! Arch-Specific || Arch-Favorable
-|-
-| [[GCC]] || <code>-mcpu=exynos-m1</code> || <code>-mtune=exynos-m1</code>
-|-
-| [[LLVM]] || <code>-mcpu=exynos-m1</code> || <code>-mtune=exynos-m1</code>
-|}
-== Architecture ==
-The M1 is Samsung's first in-house design from scratch.
-* ARM v8.0
-** {{arm|AARch64}} (A64)
-** {{arm|AARch32}} (A32)
-*** {{arm|Thumb}} support
-* 2.6 GHz clock frequency
-** 2.3 GHz for multi-core workloads
-* Sub 3-watt/core
-* [[14 nm process]] ([[FinFET]])
-* Core
-** Advanced [[branch predictor]]
-** 4-way instruction decode
-*** Most instructions map to a single µOP, with a few exceptions
-** 4-way µOP dispatch and retire
-** [[Out-of-order]] execution
-*** Out-of-order load and stores
-** Multistride/multistream prefetcher
-** Low-latency and low-power caches
-=== Block Diagram ===
-==== Core Cluster Overview ====
-[[File:mongoose 1 soc block diagram.svg|500px]]
-==== Individual Core ====
-[[File:mongoose 1 block diagram.svg|900px]]
-=== Memory Hierarchy ===
-* Cache
-** L1I Cache
-*** 64 KiB, 4-way set associative
-**** 128 B line size
-**** per core
-*** Parity-protected
-** L1D Cache
-*** 32 KiB, 8-way set associative
-**** 64 B line size
-**** per core
-*** 4 cycles for fastest load-to-use
-*** 16 B/cycle load bandwidth
-*** 16 B/cycle store bandwidth
-** L2 Cache
-*** 2 MiB, 16-way set associative
-**** 4x banks (512 KiB each)
-*** Inclusive of L1
-*** 22 cycles latency
-*** 16 B/cycle/CPU bandwidth
-Mongoose 1 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally there is a unified L2 TLB (STLB).
-* TLBs
-** ITLB
-*** 256-entry
-** DTLB
-*** 32-entry
-** STLB
-*** 1,024-entry
-*** Per core
-* BPU
-** 4K-entry main BTB
-** 64-entry µBTB
-** 64-entry return stack
-** 8K-entry L2 BTB
-== Overview ==
-Mongoose 1 was an entirely brand new architecture from the ground up that implemented the {{arm|ARMv8}} ISA. The architecture supports both {{arm|AArch64}} and {{arm|AArch32}} including {{arm|Thumb}} modes. The architecture is a [[quad-core]] design which was intended to be paired with another low-power [[arm holdings|ARM]] [[IP core]].
-== Core ==
-=== Pipeline ===
-The M1 has a [[speculative execution|speculatively executing]] [[out-of-order]] 4-wide pipeline.
-: [[File:mongoose 1 pipeline.svg|900px]]
-There are two pipeline stages for the branch predictor for generating addresses. There are three cycles for [[instruction fetch|fetching instructions]] from the [[instruction cache]] and delivering them to the [[instruction queue]]. There are three [[instruction decode|decode]] stages, two [[register renaming|renaming stages]], and a single [[instruction dispatch|dispatch]] stage.
-Both pipes for the execution stage goes through a scheduling cycle followed by a register read cycle. Instruction execution may take a cycle or more depending on the [[ARM]] instruction being executed. There is a single cycle for the [[write back]] and [[forwarding]]. In the case of a load operation, there is an additional translation tag cycle, data cycle, and an alignment and write back cycle.
-The last row (in blue) are the 22 cycles latency path for loading data from the [[level 2 cache]].
-==== Front-end ====
-[[File:mongoose 1 fetch.svg|right|250px]]
-The front-end is responsible for fetching the correct stream of bytes from the cache based on the traffic flow predictions by the [[branch predictor]] and feeding the decoders which decode the instructions into simple signals. The goal of the front-end is the ultimately keep the back-end busy executing as much instructions as possible.
-===== Fetch & pre-decoding =====
-With the help of the [[branch predictor]], the instructions should already be found in the [[level 1 instruction cache]]. The L1I cache is 64 KiB, 4-way [[set associative]] and has its own [[iTLB]] consisting of 256 entries. Up to 24 bytes are read from it each cycle into the [[instruction queue]] which allows them to hide very short [[branch bubbles]]. The [[instruction queue]] is a slightly more complex component than a simple buffer. The byte stream gets split up into the [[ARM]] instructions its made off, including dealing with the various mis-aligned ARM instructions such as in the case of {{arm|thumb|thumb mode}}.
-====== Branch Predictor ======
-The M1 has a [[perceptron branch predictor]] (Samsung advertises it as a "neural network" predictor) with a couple of [[perceptrons]] which can handle two branches per cycle. The unit is capable of indirect predictions as well as loop and stream predictions when it detects those traffic patterns. The unit has a 64-entry µBTB that can be used for caching small tight loops and hot kernels. Additionally, there is a 4K main BTB which covers the entire footprint of the instruction cache and allows for higher accuracy. There is a 64-entry call [[return stack]].
-The branch predictor feeds a decoupled instruction address queue which in turn feeds the [[instruction cache]].
-===== Decoding =====
-[[File:mongoose 1 decode.svg|right|350px]]
-From the [[instruction queue]] the instructions are sent to decode. Decode is a 4-way decoder which can handle both the [[ARM]] {{arm|AArch64}} and {{arm|AArch32}} instructions. Up to four µOPs are decoded and sent to the [[re-order buffer]].
-====== Micro-Sequencer ======
-For some complex ARM instructions such as the {{arm|ARMv7}} load-store multiples instructions which result in multiple µOPs being emitted, M1 has a side micro-sequencer that will get invoked and emit the appropriate µOPs.
-==== Execution engine ====
-===== Renaming & Allocation =====
-[[File:mongoose 1 rob.svg|right|400px]]
-As with the [[instruction decode|decode]], up to four µOPs can be renamed each cycle. For some special cases such as in {{arm|ARMv7}} where four single-precision registers can alias into a single quad register or a pair of doubles, M1 has special logic in the rename area to handle the renaming of those [[floating point]] µOPs. For the case of a branch misprediction, M1 has a perform fast map recovery ability as a branch misprediction recovery mechanism.
-M1 has a 96-entry [[reorder buffer]] which feeds the [[dispatch queue]] 4 µOPs per cycle.
-===== Dispatch =====
-From the dispatch queue, up to 7 µOPs may be issued to the integer cluster and up to 2 µOPs may be issued to the floating point cluster.
-===== Integer cluster =====
-[[File:mongoose 1 integer scheduler.svg|right|500px]]
-The [[dispatch queue]] feeds the execution units. For integers, there is a 96-entry integer [[physical register file]] (roughly 35-36 of them is architected) which means data movement is not necessary. In the integer execution cluster, up to 7 µOPs per cycle may be dispatched to the [[schedulers]]. The schedulers are distributed across the various pipes. In total the integer schedulers have 58 entries and those entries are distributed in mixed sizes across the 7 schedulers.
-For the first pipe the M1 has a branch resolution unit. The next three pipes have integer ALUs with only the first one capable of also multiplication and division. It's worth pointing out that while all three pipes are capable of execution the typical integer ALU µOPs (i.e., two source µOPs), only ALUC (complex ALU) can also execute three source µOPs. This includes some of the {{arm|ARMv7}} special predicate forms. Generally speaking, most of the simple classes of instructions (e.g., normal add) should be a single cycle while the more complex operations (e.g., add with [[barrel shift]] rotate) would be a cycle or two more.
-The last three pipes are for the [[load address]] adder, [[store address]] adder, and store data.
-===== Floating-point cluster =====
-[[File:mongoose 1 fp scheduler.svg|right|200px]]
-On the floating point cluster side, up to 2 µOPs may be issued to a unified scheduler which consists of 32 entries. There is a 96-entry floating point (vector) [[physical register file]] (roughly 35-36 architected). There are two pipes and both pipes have an integer SIMD unit ({{arm|NEON}}). The first 0 features a 5-cycle [[FMAC]] and a 4-cycle multiplier while the second pipe incorporates a 3-cycle [[floating-point adder]]. Additionally, the first pipe also includes a [[cryptography]] unit and a floating point conversion unit while the second pipe incorporates a floating point store unit.
-===== Retirement =====
-Once execution is complete, µOPs may retire at a rate of up to 4 µOPs per cycle.
-==== Memory subsystem ====
-[[File:mongoose 1 data cache.svg|left|150px]]
-The M1 has a 32 KiB 8-way [[set associative]] level 1 [[data cache]] which is [[ECC]] protected. [[Loads]] and [[stores]] are done fully [[out of order]] with all the typical forwarding and light prediction that prevents thrashing. The M1 is capable of a single 128-bit load each cycle and a single 128-bit store each cycle. Note that both operations can be done at the same cycle. The level 1 data cache has a 4-cycle load latency and can support 8 outstanding misses to the [[L2]] hierarchy.
-It's worth noting that the [[dTLB]] is only 32 entries which is considerably smaller than the 256-entry [[iTLB]]. Brad Burgess, Chief CPU Architect for the M1, explained that the reason for this is because the front-end is designed with a lot more room in mind as far as handling a larger TLB capacity natively in its pipeline. It's also physically laid out much further on the floor plan. This allows the L2 TLB to service the dTLB more aggressively.
-The M1 has a [[multi-stride]] prefetcher which allows it to detect patterns and start the fetching request ahead of execution. While Samsung hasn't gone into too much detail, they noted that there is also some stream/copy optimizations as well which accelerate certain observable traffic patterns.
-== Core cluster ==
-[[File:mongoose 1 cluster cache.svg|right|300px]]
-The M1 consists of four cores and a shared [[level 2]] cache. Samsung has gone with a four-bank hierarchy with the L2. It is 2 MiB, 16-way [[set associative]] and ECC protected broken into 4 banks. Being inclusive of the L1, there are filtering being done for the L1 first. The bank storing the data depends on the address. The exact organization was not detailed. Each bank can return up to 16 bytes per cycle to the cores and has a 22-cycle latency.
-{{clear}}
-== Die ==
-=== Core Floorplan ===
-:[[File:mongoose 1 core floorplan.png|500px]]
-:[[File:mongoose 1 core floorplan (annotated).png|500px]]
-=== Core Cluster Floorplan ===
-:[[File:mongoose 1 core cluster.png|class=wikichip_ogimage|500px]]
-:[[File:mongoose 1 core cluster (annotated).png|500px]]
-== All M1 Processors ==
-<!-- NOTE:
-           This table is generated automatically from the data in the actual articles.
-           If a microprocessor is missing from the list, an appropriate article for it needs to be
-           created and tagged accordingly.
-           Missing a chip? please dump its name here: https://en.wikichip.org/wiki/WikiChip:wanted_chips
--->
-{{comp table start}}
-<table class="comptable sortable tc5 tc6 tc7">
-{{comp table header|main|8:List of M1-based Processors}}
-{{comp table header|main|6:Main processor|2:Integrated Graphics}}
-{{comp table header|cols|Family|Launched|Arch|Cores|%Frequency|%Turbo|GPU|%Frequency}}
-{{#ask: [[Category:microprocessor models by samsung]] [[microarchitecture::Mongoose 1]]
- |?full page name
- |?model number
- |?family
- |?first launched
- |?microarchitecture
- |?core count
- |?base frequency#GHz
- |?turbo frequency (1 core)#GHz
- |?integrated gpu
- |?integrated gpu base frequency
- |format=template
- |template=proc table 3
- |userparam=10
- |mainlabel=-
- |valuesep=,
-}}
-{{comp table count|ask=[[Category:microprocessor models by samsung]] [[microarchitecture::Mongoose 1]]}}
-</table>
-{{comp table end}}
-== Bibliography ==
-* Burgess, Brad. "Samsung exynos M1 processor." Hot Chips 28 Symposium (HCS), 2016 IEEE. IEEE, 2016.
-* LLVM: lib/Target/AArch64/AArch64SchedExynosM1.td
codename	Mongoose 1 +
core count	4 +
designer	Samsung +
first launched	November 12, 2015 +
full page name	samsung/microarchitectures/m1 +
instance of	microarchitecture +
instruction set architecture	ARMv8 +
manufacturer	Samsung +
microarchitecture type	CPU +
name	Mongoose 1 +
phase-out	2017 +
pipeline stages	14 +
process	14 nm (0.014 μm, 1.4e-5 mm) +