From WikiChip
Editing arm holdings/microarchitectures/mlp

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 36: Line 36:
 
The MLP microarchitecture is sold as [[synthesizable]] [[RTL]] IP with various SKUs under the {{armh|Ethos|Ethos family}}. Those SKUs come preconfigured with a fixed number of compute engines (CEs) and cache sizes designed to meet a certain performance level and power envelopes.
 
The MLP microarchitecture is sold as [[synthesizable]] [[RTL]] IP with various SKUs under the {{armh|Ethos|Ethos family}}. Those SKUs come preconfigured with a fixed number of compute engines (CEs) and cache sizes designed to meet a certain performance level and power envelopes.
  
The MLP is fully statically scheduled by the compiler which takes a given [[neural network]] and maps it to a command stream. The toolchain also performs a number of additional optimizations ahead of time including compression (weights and feature maps are loaded into memory and into the MLP SRAM banks already compressed) and tiling. The command stream includes that necessary [[DMA]] operations such as the block fetching operations along with the accompanying compute operations. At a very high level, the MLP itself comprises a [[DMA engine]], the network control unit (NCU), and a configurable number of compute engines (CEs). During runtime, the host processor loads the command stream onto the control unit which parses the stream and executes the operations by controlling the various functional blocks. The DMA engine is capable of talking to external memory while being aware of the various supported neural network layouts - allowing it to handle strides and other predictable NN memory operations, fetching the data for compute ahead of time. The compute engines are the workhorse component of the system. Each CE comprises two functional blocks, the MAC Compute Engine (MCE) and the Programmable Layer Engine (PLE). The MCE performs fixed-function multiply-accumulate operations efficiently on 8-bit integers while the PLE offers a more flexible programmable processor that supports vector operations and can implement more complex or less common operations. The architecture design relies on careful co-design with the compiler but yields more simplified hardware while enabling more deterministic performance characteristics.
+
The MLP is fully statically scheduled by the compiler which takes a given [[neural network]] and maps it to a command stream. The command stream includes that necessary [[DMA]] operations such as the block fetching operations along with the accompanying compute operations. At a very high level, the MLP itself comprises a [[DMA engine]], the network control unit (NCU), and a configurable number of compute engines (CEs). During runtime, the host processor loads the command stream onto the control unit which parses the stream and executes the operations by controlling the various functional blocks. The DMA engine is capable of talking to external memory while being aware of the various supported neural network layouts - allowing it to handle strides and other predictable NN memory operations, fetching the data for compute ahead of time. The compute engines are the workhorse component of the system. Each CE comprises two functional blocks, the MAC Compute Engine (MCE) and the Programmable Layer Engine (PLE). The MCE performs fixed-function multiply-accumulate operations efficiently on 8-bit integers while the PLE offers a more flexible programmable processor that supports vector operations and can implement more complex or less common operations. The architecture design relies on careful co-design with the compiler but yields more simplified hardware while enabling more deterministic performance characteristics.
  
 
== Compute Engine (CE) ==
 
== Compute Engine (CE) ==
Line 57: Line 57:
  
 
=== Programmable Layer Engine (PLE) ===
 
=== Programmable Layer Engine (PLE) ===
[[File:mlp ple overview.svg|right|thumb|400px|PLE]]
+
{{empty section}}
From the MCE, 8-bit results are fed into the PLE vector register file. The programmable layer engine (PLE) is designed for performing post-processing as well as flexibility for implementing custom functionalities. Once values arrive at the register file, the CPU kicks and operates the vector engine to perform the appropriate operations on the register file. The PLE features an Arm {{armh|Cortex|Cortex-M}} CPU with a 16-lane Vector Engine which supports the vector and neural network extensions, allowing it to do non-convolution operations more efficiently.
 
 
 
Following processing, the final results are DMA'ed back to the main SRAM bank. It's also worth noting that data need not only come from the MCE. For various operations, it's entirely possible for the PLE to fetch the data directly from the main SRAM bank to operate on certain data.
 
  
 
== Die ==
 
== Die ==

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)
codenameMLP +
designerArm Holdings +
first launched2018 +
full page namearm holdings/microarchitectures/mlp +
instance ofmicroarchitecture +
manufacturerTSMC +, Samsung + and UMC +
nameMLP +
process16 nm (0.016 μm, 1.6e-5 mm) + and 7 nm (0.007 μm, 7.0e-6 mm) +
processing element count4 +, 12 +, 8 + and 16 +