From WikiChip
Editing samsung/microarchitectures/m3

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 2: Line 2:
 
{{microarchitecture
 
{{microarchitecture
 
|atype=CPU
 
|atype=CPU
|name=Meerkat
+
|name=Mongoose 3
 
|designer=Samsung
 
|designer=Samsung
 
|manufacturer=Samsung
 
|manufacturer=Samsung
Line 33: Line 33:
 
|successor link=samsung/microarchitectures/m4
 
|successor link=samsung/microarchitectures/m4
 
}}
 
}}
'''Exynos M3''' ('''Meerkat''') is the successor to the {{\\|Mongoose 2}}, a [[10 nm]] [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics.
+
'''Exynos Mongoose 3''' ('''M3''') is the successor to the {{\\|Mongoose 2}}, a [[10 nm]] [[ARM]] microarchitecture designed by [[Samsung]] for their consumer electronics.
  
 
== History ==
 
== History ==
Line 107: Line 107:
  
 
=== Block Diagram ===
 
=== Block Diagram ===
==== Core Cluster Overview ====
 
[[File:m3 soc block diagram.svg|700px]]
 
 
==== Individual Core ====
 
==== Individual Core ====
 
[[File:mongoose 3 block diagram.svg|900px]]
 
[[File:mongoose 3 block diagram.svg|900px]]
Line 114: Line 112:
 
=== Memory Hierarchy ===
 
=== Memory Hierarchy ===
 
* Cache
 
* Cache
** L1I Caches
+
** L1I Cache
 
*** 64 KiB, 4-way set associative
 
*** 64 KiB, 4-way set associative
 
**** 128 B line size
 
**** 128 B line size
Line 139: Line 137:
 
*** 80 outstanding transactions
 
*** 80 outstanding transactions
  
The M3 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).
+
Mongoose 1 TLB consists of dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).
  
 
* TLBs
 
* TLBs
Line 179: Line 177:
  
 
==== Fetch & pre-decoding ====
 
==== Fetch & pre-decoding ====
With the help of the [[branch predictor]], the instructions should already be found in the [[level 1 instruction cache]]. The L1I cache is 64 KiB, 4-way [[set associative]]. Samsung kept the L1I cache the same as prior generations. The L1I cache and has its own [[iTLB]] consisting of 512 entries, double the prior generation. A large change in the M3 is the instruction fetch bandwidth. Previously, up to 24 bytes could be read each cycle into the [[instruction queue]]. In the M3, now 48 bytes (up to 12 [[ARM]] instructions) are read each cycle into the [[instruction queue]] which allows them to hide very short [[branch bubbles]] and deliver a large number of instructions to be decoded by a larger decoder. With up to 12 instructions fetched each time, the M3 is effectively fetching at twice the decoding rate and is usually larger than a branch is encountered. The [[instruction queue]] is a slightly more complex component than a simple buffer. The byte stream gets split up into the [[ARM]] instructions its made off, including dealing with the various misaligned ARM instructions such as in the case of {{arm|thumb|thumb mode}}. If the queue is filled to capacity, the fetch is clock gated for a cycle or two in order to allow the queue to naturally drain.
+
With the help of the [[branch predictor]], the instructions should already be found in the [[level 1 instruction cache]]. The L1I cache is 64 KiB, 4-way [[set associative]]. Samsung kept the L1I cache the same as prior generations. The L1I cache and has its own [[iTLB]] consisting of 512 entries, double the prior generation. A large change in the M3 is the instruction fetch bandwidth. Previously, up to 24 bytes could be read each cycle into the [[instruction queue]]. In the M3, now 48 bytes (up to 12 [[ARM]] instructions) are read each cycle into the [[instruction queue]] which allows them to hide very short [[branch bubbles]] and deliver a large number of instructions to be decoded by a larger decoder. The [[instruction queue]] is a slightly more complex component than a simple buffer. The byte stream gets split up into the [[ARM]] instructions its made off, including dealing with the various misaligned ARM instructions such as in the case of {{arm|thumb|thumb mode}}. If the queue is filled to capacity, the fetch is gated for a cycle in order to allow the queue to naturally decrease.
  
 
===== Branch Predictor =====
 
===== Branch Predictor =====
Line 236: Line 234:
 
In the prior generations, the M1 was capable of a single 128-bit load each cycle and a single 128-bit store each cycle. The M3 supports two 128-bit loads each cycle and one 128-bit store per cycle. Note that both operations can be done at the same cycle. With the [[floating-point]] stores in [[ARM]], the M3 can match and load-store bandwidth in many copy scenarios. Despite doubling the cache size, the level 1 data cache still maintains a 4-cycle load latency and can support 12 outstanding misses to the [[L2]] hierarchy. Additionally, the M3 LSU schedulers are larger and the store buffer was doubled in capacity.  
 
In the prior generations, the M1 was capable of a single 128-bit load each cycle and a single 128-bit store each cycle. The M3 supports two 128-bit loads each cycle and one 128-bit store per cycle. Note that both operations can be done at the same cycle. With the [[floating-point]] stores in [[ARM]], the M3 can match and load-store bandwidth in many copy scenarios. Despite doubling the cache size, the level 1 data cache still maintains a 4-cycle load latency and can support 12 outstanding misses to the [[L2]] hierarchy. Additionally, the M3 LSU schedulers are larger and the store buffer was doubled in capacity.  
  
The M3 has a [[multi-stride]] prefetcher which allows it to detect patterns and start the fetching request ahead of execution. There is also some stream/copy optimizations as well which accelerate certain observable traffic patterns. In the M3, Samsung added new stream and copy optimizations. For cases such as <code>{{c|memcpy|memcpy()}}</code> where you'd want an even amount of loads and stores, the M3 can do 2x128-bit loads and stores. In the cases of load-pair quads, operations are cracked into two 128-bit loads. On the store side, the single store [[AGU]] can be used for both pairs (calculating a single address) and the store data unit can be doubled up. They also added new patterns to the prefetcher in order to address addition scenarios.  
+
The M3 has a [[multi-stride]] prefetcher which allows it to detect patterns and start the fetching request ahead of execution. There is also some stream/copy optimizations as well which accelerate certain observable traffic patterns. In the M3, Samsung added new stream and copy optimizations. They also added new patterns to the prefetcher in order to address addition scenarios.
  
 
As with the M1, the [[dTLB]] is still 32 entries which remain considerably smaller than the 512-entry [[iTLB]]. The reason remains similar to that of the M1 which is because the front-end is designed with a lot more room in mind as far as handling a larger TLB capacity natively in its pipeline. It's also physically laid out much further on the floor plan. This allows the L2 TLB to service the dTLB more aggressively. The TLB on the M3 has been slightly remapped. In addition to the 32-entry primary dTLB, there is a new mid-level dTLB that's 512-entry deep. The second-level STLB has also quadrupled in size from 1K in the prior generation to 4K on the M3.
 
As with the M1, the [[dTLB]] is still 32 entries which remain considerably smaller than the 512-entry [[iTLB]]. The reason remains similar to that of the M1 which is because the front-end is designed with a lot more room in mind as far as handling a larger TLB capacity natively in its pipeline. It's also physically laid out much further on the floor plan. This allows the L2 TLB to service the dTLB more aggressively. The TLB on the M3 has been slightly remapped. In addition to the 32-entry primary dTLB, there is a new mid-level dTLB that's 512-entry deep. The second-level STLB has also quadrupled in size from 1K in the prior generation to 4K on the M3.
Line 266: Line 264:
  
  
:[[File:m3 core cluster floorplan.png|class=wikichip_ogimage|800px]]
+
:[[File:m3 core cluster floorplan.png|800px]]
  
  
Line 305: Line 303:
  
 
== Bibliography ==
 
== Bibliography ==
* {{bib|hc|30|Samsung}}
+
* {{hcbib|30}}
 
* LLVM: lib/Target/AArch64/AArch64SchedExynosM3.td
 
* LLVM: lib/Target/AArch64/AArch64SchedExynosM3.td

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)