From WikiChip
Editing nvidia/microarchitectures/nvdla

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 17: Line 17:
  
 
== Overview ==
 
== Overview ==
NVDLA is a microarchitecture designed by [[Nvidia]] for the acceleration of deep learning workloads. Since the original implementation targeted Nvidia's own {{nvidia|Xavier}} SoC, the architecture is specifically optimized for [[convolutional neural networks]] (CNNs) as the main types of workloads deal with images and videos, although other [[neural networks|networks]] are also supported. NVDLA primarily targets edge devices, IoT applications, and other lower-power inference designs.  
+
NVDLA is a microarchitecture designed by [[Nvidia]] for the acceleration of deep learning workloads. Since the original implementation targeted Nvidia's own {{nvidia|Xavier}} SoC, the architecture is specifically optimized for [[convolutional neural networks]] (CNNs) as the main types of workloads deal with images and videos, although other [[neural networks|networks]] are also support. NVDLA primarily targets edge devices, IoT applications, and other lower-power inference designs.  
  
 
At a high level, NVDLA stores both the activation and the inputs in a convolutional [[buffer]]. Both are fed into a convolutional core which consists of a large array of [[multiply-accumulate]] units. The final result gets sent into a post-processing unit which writes it back to memory. The processing elements are encapsulated by control logic as well as a memory interface ([[DMA]]).
 
At a high level, NVDLA stores both the activation and the inputs in a convolutional [[buffer]]. Both are fed into a convolutional core which consists of a large array of [[multiply-accumulate]] units. The final result gets sent into a post-processing unit which writes it back to memory. The processing elements are encapsulated by control logic as well as a memory interface ([[DMA]]).
Line 27: Line 27:
 
=== Convolution Core ===
 
=== Convolution Core ===
 
For the convolutional core there is usually one input activation along with a set of kernels. As with the memory interface, the number of pixels taken from the input is parameterizable along with the number of kernels. Typically, a strip of 16-32 outputs is calculated at a time. In order to safe power, the one weights of the MACs remain constant for a number of cycles. This also helps reduces data transfers.
 
For the convolutional core there is usually one input activation along with a set of kernels. As with the memory interface, the number of pixels taken from the input is parameterizable along with the number of kernels. Typically, a strip of 16-32 outputs is calculated at a time. In order to safe power, the one weights of the MACs remain constant for a number of cycles. This also helps reduces data transfers.
 
=== TCM Reuse ===
 
Across layers in Neural Networks, data is consumed by the next layer if there is sufficient TCM/CVSRAM allocated for the next layer. Layers can run back to back if there is enough TCM for the tensors to reuse the buffers. Once there is shortage of memory, the data is written to DRAM and then executed in tiles.
 
 
== Configuration ==
 
NVDLA comes in two main configurations: large and small. The configurations provide a balanced tradeoff between area, performance, and power. Generally, the small configuration gets rid of most of the advanced features. It's worth noting that for {{nvidia|Xavier}}, the large configuration is being used.
 
 
{| class="wikitable"
 
|-
 
! Small Config !! Large Config
 
|-
 
| 8-bit data path || 16-bit data path
 
|-
 
| Int8 || Int8, Int16, FP16
 
|-
 
| 1 RAM Interface || 2 RAM Interface
 
|-
 
| - || Programmable Control<br>(auto sequencing)
 
|-
 
| - || Weight compression
 
|}
 
 
=== Small Configuration ASIC ===
 
For the small configuration on [[TSMC]] [[16 nm process]] at 1 GHz:
 
 
{| class="wikitable"
 
|-
 
! rowspan="2" | INT8 MACs<br>(# instances) !! rowspan="2" | Conv. Buffer<br>(KB) !! rowspan="2" | Area<br>(mm2) !! rowspan="2" | Memory BW<br>(GB/s) !! colspan="4" | ResNet50
 
|-
 
! Perf<br>(frames/s) !! Power<br>(mW) !! Power Eff.<br>(DL TOPS/W)
 
|-
 
|2048 || 512 || 3.3 || 20 || 269 || 388 || 5.4
 
|-
 
|1024 || 256 || 1.8 || 15 || 153 || 185 || 6.3
 
|-
 
|512 || 256 || 1.4 || 10 || 93 || 107 || 6.8
 
|-
 
|256 || 256 || 1.0 || 5 || 46 || 64 || 5.6
 
|-
 
|128 || 256 || 0.84 || 2 || 20 || 41 || 3.8
 
|-
 
|64 || 128 || 0.55 || 1 || 7.3 || 28 || 2.0
 
|}
 
 
* Note: Area is synthesis area + internal RAMs, does not account for layout inefficiencies. Power is for DLA including internal RAMs, excluding SOC & external RAMs. Calibrated to Xavier silicon - NVIDIA flows, libraries, RAM compilers, etc. DL TOPS == #convolutional MAC operations * 2
 
 
=== Large Configuration ASIC ===
 
For the large configuration on [[TSMC]] [[16 nm process]] at 1 GHz:
 
 
{| class="wikitable"
 
|-
 
! colspan="2" | Configuration !! !! rowspan="3" | Data Type !! rowspan="3" | Internal RAM Size !! colspan="3" | ResNet50
 
|-
 
| INT16/FP16 || 512 MACs || || rowspan="2" | Perf<br>(frames/s) || rowspan="2" | Power<br>(mW) || rowspan="2" | Power Eff.<br>(DL TOPS/W)
 
|-
 
| INT8 || 1024 MACs
 
|-
 
| Conv Buffer || 256 KB || || INT8 || none || 165 || 267 || 4.8
 
|-
 
| Area || 2.4 mm2 || || FP16 || none || 59 || 276 || 1.6
 
|-
 
| DRAM BW || 15 GB/s || || INT8 || 2M || 230 || 348 || 5.1
 
|-
 
| TCM R/W BW || 25/25 GB/s || || FP16 || 2M || 115 || 475 || 1.9
 
|}
 
 
* Note: Area and power do not include Tightly Coupled Memory (TCM)
 
  
 
== Bibliography ==
 
== Bibliography ==
 
* IEEE Hot Chips 30 Symposium (HCS) 2018.
 
* IEEE Hot Chips 30 Symposium (HCS) 2018.

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)
codenameNVDLA +
designerNvidia +
first launched2018 +
full page namenvidia/microarchitectures/nvdla +
instance ofmicroarchitecture +
manufacturerTSMC +
nameNVDLA +