From WikiChip
Editing samsung/microarchitectures/m3
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.
Latest revision | Your text | ||
Line 1: | Line 1: | ||
− | {{samsung title| | + | {{samsung title|Mongoose 3 (M3)|arch}} |
{{microarchitecture | {{microarchitecture | ||
|atype=CPU | |atype=CPU | ||
− | |name= | + | |name=Mongoose 3 |
|designer=Samsung | |designer=Samsung | ||
|manufacturer=Samsung | |manufacturer=Samsung | ||
Line 8: | Line 8: | ||
|process=10 nm | |process=10 nm | ||
|cores=4 | |cores=4 | ||
− | |||
− | |||
|oooe=Yes | |oooe=Yes | ||
|speculative=Yes | |speculative=Yes | ||
|renaming=Yes | |renaming=Yes | ||
− | |||
|decode=6-way | |decode=6-way | ||
|isa=ARMv8 | |isa=ARMv8 | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|predecessor=M2 | |predecessor=M2 | ||
|predecessor link=samsung/microarchitectures/m2 | |predecessor link=samsung/microarchitectures/m2 | ||
Line 33: | Line 18: | ||
|successor link=samsung/microarchitectures/m4 | |successor link=samsung/microarchitectures/m4 | ||
}} | }} | ||
− | ''' | + | '''Mongoose 3''' ('''M3''') is the successor to the {{\\|Mongoose 2}}, a [[10 nm]] [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics. |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Process Technology == | == Process Technology == | ||
Line 92: | Line 69: | ||
****** crypto EU, simple vector EU, vector shuffle/shift/mul, new FP store, new FP conversion | ****** crypto EU, simple vector EU, vector shuffle/shift/mul, new FP store, new FP conversion | ||
** Memory subsystem | ** Memory subsystem | ||
− | *** | + | *** 4x Larger L2 BTB (4k-entry, up from 1k-entry) |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
*** New L3 Cache | *** New L3 Cache | ||
− | **** 4 MiB | + | **** 4 MiB |
*** Double bandwidth (32B (2x16B)/cycle from 16B/cycle) | *** Double bandwidth (32B (2x16B)/cycle from 16B/cycle) | ||
**** fast paired 128-bit loads and stores | **** fast paired 128-bit loads and stores | ||
* branch misprediction penalty increased (16 cycles, from 14) | * branch misprediction penalty increased (16 cycles, from 14) | ||
+ | |||
+ | {{expand list}} | ||
=== Block Diagram === | === Block Diagram === | ||
− | |||
− | |||
==== Individual Core ==== | ==== Individual Core ==== | ||
[[File:mongoose 3 block diagram.svg|900px]] | [[File:mongoose 3 block diagram.svg|900px]] | ||
=== Memory Hierarchy === | === Memory Hierarchy === | ||
− | + | {{empty section}} | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Overview == | == Overview == | ||
Line 162: | Line 90: | ||
== Core == | == Core == | ||
The M3 is an [[out-of-order]] microprocessor with a 6-way decode and a 12-way dispatch. | The M3 is an [[out-of-order]] microprocessor with a 6-way decode and a 12-way dispatch. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
=== Front end === | === Front end === | ||
Line 179: | Line 96: | ||
==== Fetch & pre-decoding ==== | ==== Fetch & pre-decoding ==== | ||
− | With the help of the [[branch predictor]], the instructions should already be found in the [[level 1 instruction cache]]. The L1I cache is 64 KiB, 4-way [[set associative]]. Samsung kept the L1I cache the same as prior generations. The L1I cache and has its own [[iTLB]] consisting of 512 entries, double the prior generation. A large change in the M3 is the instruction fetch bandwidth. Previously, up to 24 bytes could be read each cycle into the [[instruction queue]]. In the M3, now 48 bytes (up to 12 [[ARM]] instructions) are read each cycle into the [[instruction queue]] which allows them to hide very short [[branch bubbles]] and deliver a large number of instructions to be decoded by a larger decoder | + | With the help of the [[branch predictor]], the instructions should already be found in the [[level 1 instruction cache]]. The L1I cache is 64 KiB, 4-way [[set associative]]. Samsung kept the L1I cache the same as prior generations. The L1I cache and has its own [[iTLB]] consisting of 512 entries, double the prior generation. A large change in the M3 is the instruction fetch bandwidth. Previously, up to 24 bytes could be read each cycle into the [[instruction queue]]. In the M3, now 48 bytes (up to 12 [[ARM]] instructions) are read each cycle into the [[instruction queue]] which allows them to hide very short [[branch bubbles]] and deliver a large number of instructions to be decoded by a larger decoder. The [[instruction queue]] is a slightly more complex component than a simple buffer. The byte stream gets split up into the [[ARM]] instructions its made off, including dealing with the various misaligned ARM instructions such as in the case of {{arm|thumb|thumb mode}}. If the queue is filled to capacity, the fetch is gated for a cycle in order to allow the queue to naturally decrease. |
===== Branch Predictor ===== | ===== Branch Predictor ===== | ||
Line 189: | Line 106: | ||
The branch predictor feeds a decoupled instruction address queue which in turn feeds the instruction cache. | The branch predictor feeds a decoupled instruction address queue which in turn feeds the instruction cache. | ||
− | ==== Decoding ==== | + | ===== Decoding ===== |
[[File:m3 decode.svg|thumb|right|M3 features a 6-way decoder.]] | [[File:m3 decode.svg|thumb|right|M3 features a 6-way decoder.]] | ||
From the [[instruction queue]] the instructions are sent to decode. The decode unit on the M3 was increased significantly to 6-way (from 4), allowing up to six instructions to be decoded each cycle. The Decoder which can handle both the [[ARM]] {{arm|AArch64}} and {{arm|AArch32}} instructions. All in all, up to six µOPs are decoded and sent to the [[re-order buffer]] each cycle. One of the new features on the M3 is the introduction of some initial [[fusion idioms]] support which allows the [[decoder]] to [[decode]] two instructions and if they meet a certain criteria, they can be [[µOP fusion|fused]] into a single µOP which remains that way for the remainder of the pipeline, alleviating various resources that would other require two entries. | From the [[instruction queue]] the instructions are sent to decode. The decode unit on the M3 was increased significantly to 6-way (from 4), allowing up to six instructions to be decoded each cycle. The Decoder which can handle both the [[ARM]] {{arm|AArch64}} and {{arm|AArch32}} instructions. All in all, up to six µOPs are decoded and sent to the [[re-order buffer]] each cycle. One of the new features on the M3 is the introduction of some initial [[fusion idioms]] support which allows the [[decoder]] to [[decode]] two instructions and if they meet a certain criteria, they can be [[µOP fusion|fused]] into a single µOP which remains that way for the remainder of the pipeline, alleviating various resources that would other require two entries. | ||
Line 195: | Line 112: | ||
For some complex ARM instructions such as the {{arm|ARMv7}} load-store multiples instructions which result in multiple µOPs being emitted, M3 has a side micro-sequencer that will get invoked and emit the appropriate µOPs. | For some complex ARM instructions such as the {{arm|ARMv7}} load-store multiples instructions which result in multiple µOPs being emitted, M3 has a side micro-sequencer that will get invoked and emit the appropriate µOPs. | ||
− | === Execution engine === | + | ==== Execution engine ==== |
− | ==== Renaming & Allocation ==== | + | ===== Renaming & Allocation ===== |
[[File:m3 rob.svg|thumb|right|M3 ROB|450px]] | [[File:m3 rob.svg|thumb|right|M3 ROB|450px]] | ||
As with the [[instruction decode|decode]], up to six µOPs can be renamed each cycle. This is up from four µOPs in all the prior generations. For some special cases such as in {{arm|ARMv7}} where four single-precision registers can alias into a single quad register or a pair of doubles, the M3 has special logic in the rename area to handle the renaming of those [[floating point]] µOPs. For the case of a branch misprediction, the M3 has a perform fast map recovery ability as a branch misprediction recovery mechanism. | As with the [[instruction decode|decode]], up to six µOPs can be renamed each cycle. This is up from four µOPs in all the prior generations. For some special cases such as in {{arm|ARMv7}} where four single-precision registers can alias into a single quad register or a pair of doubles, the M3 has special logic in the rename area to handle the renaming of those [[floating point]] µOPs. For the case of a branch misprediction, the M3 has a perform fast map recovery ability as a branch misprediction recovery mechanism. | ||
Line 202: | Line 119: | ||
Matching the wider pipeline, the [[reoder buffer]] has been substantially increased in size to 228 µOPs that can be in flight. The ROB feeds the [[dispatch queue]] at the rate of up to 6 µOPs per cycle. | Matching the wider pipeline, the [[reoder buffer]] has been substantially increased in size to 228 µOPs that can be in flight. The ROB feeds the [[dispatch queue]] at the rate of up to 6 µOPs per cycle. | ||
− | ==== Dispatch ==== | + | ===== Dispatch ===== |
From the dispatch queue, up to 9 µOPs may be issued to the integer cluster and up to 3 µOPs may be issued to the floating point cluster. This is a large change from the M1 and M2 where up to 7 and 2 micro-operations could be sent to the integer and floating point clusters respectively. | From the dispatch queue, up to 9 µOPs may be issued to the integer cluster and up to 3 µOPs may be issued to the floating point cluster. This is a large change from the M1 and M2 where up to 7 and 2 micro-operations could be sent to the integer and floating point clusters respectively. | ||
− | ==== Integer cluster ==== | + | ===== Integer cluster ===== |
[[File:m3 integer scheduler.svg|thumb|right|500px]] | [[File:m3 integer scheduler.svg|thumb|right|500px]] | ||
− | The [[dispatch queue]] feeds the execution units. The M3 doubles the [[physical register file]]s. For integers, there is now a 192-entry integer [[physical register file]] (roughly 35-36 of them is architected) which means data movement is not necessary. In the integer execution cluster, up to 9 µOPs per cycle may be dispatched to the [[schedulers]]. The schedulers are distributed across the various pipes. M3 more than doubled the depth of those schedulers. In total | + | The [[dispatch queue]] feeds the execution units. The M3 doubles the [[physical register file]]s. For integers, there is now a 192-entry integer [[physical register file]] (roughly 35-36 of them is architected) which means data movement is not necessary. In the integer execution cluster, up to 9 µOPs per cycle may be dispatched to the [[schedulers]]. The schedulers are distributed across the various pipes. M3 more than doubled the depth of those schedulers. In total the integer schedulers now have 126 entries and those entries are distributed in mixed sizes across the 9 schedulers. |
− | For the first pipe the M3, like it's predecessors, has a [[branch resolution]] unit. The next four pipes have integer ALUs. Whereas in the prior design there was one complex ALU pipe and three simple [[ALU]] pipes, in the M3 the newly added pipe is also a [[complex ALU]]. In other words, while all four pipes are capable of | + | For the first pipe the M3, like it's predecessors, has a [[branch resolution]] unit. The next four pipes have integer ALUs. Whereas in the prior design there was one complex ALU pipe and three simple [[ALU]] pipes, in the M3 the newly added pipe is also a [[complex ALU]]. In other words, while all four pipes are capable of execution the typical integer ALU µOPs (i.e., two source µOPs), only the ALUCs (complex ALUs) can also execute three source µOPs. This includes some of the {{arm|ARMv7}} special predicate forms. Generally speaking, most of the simple classes of instructions (e.g., normal add) should be a single cycle while the more complex operations (e.g., add with [[barrel shift]] rotate) would be a cycle or two more. Compared to the M1, Samsung was able to reduce the latency for some of the shift+add and similar µOPs down from two cycles to just one. For a few special cases, Samsung was also able to reduce the latency down to zero cycles. |
For the integer [[divider]] unit, the M3 implements a radix 16 (4 bits/cycle), halving the latency in the iterative portion from the prior generation which implemented a radix 4 (2 bits/cycle) divider unit. | For the integer [[divider]] unit, the M3 implements a radix 16 (4 bits/cycle), halving the latency in the iterative portion from the prior generation which implemented a radix 4 (2 bits/cycle) divider unit. | ||
Line 215: | Line 132: | ||
The last four pipes are for the [[AGU]]s and store data execution units (discussed later). | The last four pipes are for the [[AGU]]s and store data execution units (discussed later). | ||
− | ==== Floating-point cluster ==== | + | ===== Floating-point cluster ===== |
− | [[File:m3 fp scheduler.svg|thumb|right| | + | [[File:m3 fp scheduler.svg|thumb|right|500px]] |
− | On the floating point cluster side, the adds another FP pipe and close to doubled the scheduler. Here, up to 3 µOPs may be issued to a unified scheduler which consists of 62 entries. Like the integer side, the FP FRP has also doubled in capacity with a 192-entry floating point (vector) [[physical register file]] (roughly 35-36 architected). There are three pipes and all three pipes have an integer SIMD unit ({{arm|NEON}}). | + | On the floating point cluster side, the adds another FP pipe and close to doubled the scheduler. Here, up to 3 µOPs may be issued to a unified scheduler which consists of 62 entries. Like the integer side, the FP FRP has also doubled in capacity with a 192-entry floating point (vector) [[physical register file]] (roughly 35-36 architected). There are three pipes and all three pipes have an integer SIMD unit ({{arm|NEON}}). The first pipe features a 4-cycle [[FMAC]] and a 3-cycle multiplier while the second pipe incorporates a 2-cycle [[floating-point adder]]. In all three units, Samsung reduced the latency by a one cycle (from 5, 4, and 3 cycle latencies respectively). Overall, all three pipes are fuller and more capable. All three pipes have an FMAC and FADD units, doubling the [[FLOPS]] of the prior design. Additionally, there is a second pipe with a [[cryptography]] floating point conversion unit. There are now two pipes that incorporates a floating point store unit. |
− | |||
− | The first pipe features a 4-cycle [[FMAC]] and a 3-cycle multiplier while the second pipe incorporates a 2-cycle [[floating-point adder]]. In all three units, Samsung reduced the latency by one cycle (from 5, 4, and 3 cycle latencies respectively). Overall, all three pipes are fuller and more capable. All three pipes have an FMAC and FADD units, doubling the [[FLOPS]] of the prior design. Additionally, there is a second pipe with a [[cryptography]] floating point conversion unit. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | ==== Retirement ==== | + | ===== Retirement ===== |
Once execution is complete, µOPs may retire at a rate of up to 6 µOPs per cycle. | Once execution is complete, µOPs may retire at a rate of up to 6 µOPs per cycle. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== All M3 Processors == | == All M3 Processors == | ||
Line 281: | Line 150: | ||
{{comp table start}} | {{comp table start}} | ||
<table class="comptable sortable tc5 tc6 tc7"> | <table class="comptable sortable tc5 tc6 tc7"> | ||
− | {{comp table header|main| | + | {{comp table header|main|8:List of M3-based Processors}} |
− | {{comp table header|main| | + | {{comp table header|main|6:Main processor|2:Integrated Graphics}} |
− | {{comp table header|cols|Family|Launched|Arch|Cores|%Frequency|GPU|%Frequency}} | + | {{comp table header|cols|Family|Launched|Arch|Cores|%Frequency|%Turbo|GPU|%Frequency}} |
{{#ask: [[Category:microprocessor models by samsung]] [[microarchitecture::Mongoose 3]] | {{#ask: [[Category:microprocessor models by samsung]] [[microarchitecture::Mongoose 3]] | ||
|?full page name | |?full page name | ||
Line 292: | Line 161: | ||
|?core count | |?core count | ||
|?base frequency#GHz | |?base frequency#GHz | ||
+ | |?turbo frequency (1 core)#GHz | ||
|?integrated gpu | |?integrated gpu | ||
|?integrated gpu base frequency | |?integrated gpu base frequency | ||
|format=template | |format=template | ||
|template=proc table 3 | |template=proc table 3 | ||
− | |userparam= | + | |userparam=10 |
|mainlabel=- | |mainlabel=- | ||
|valuesep=, | |valuesep=, | ||
Line 303: | Line 173: | ||
</table> | </table> | ||
{{comp table end}} | {{comp table end}} | ||
+ | |||
== Bibliography == | == Bibliography == | ||
− | * {{ | + | * {{hcbib|30}} |
* LLVM: lib/Target/AArch64/AArch64SchedExynosM3.td | * LLVM: lib/Target/AArch64/AArch64SchedExynosM3.td |
Facts about "Exynos M3 - Microarchitectures - Samsung"
codename | Meerkat + |
core count | 4 + |
designer | Samsung + |
first launched | 2018 + |
full page name | samsung/microarchitectures/m3 + |
instance of | microarchitecture + |
instruction set architecture | ARMv8 + |
manufacturer | Samsung + |
microarchitecture type | CPU + |
name | Meerkat + |
pipeline stages | 16 + |
process | 10 nm (0.01 μm, 1.0e-5 mm) + |