Difference between revisions of "movidius/microarchitectures/shave v2.0"

	Edit Values
	SHAVE v2.0 µarch
	General Info
Arch Type	Accelerator
Designer	Movidius
Manufacturer	TSMC
Introduction	2011
	Pipeline
Type	VLIW
	Cache
L1 Cache	1 KiB/core
L2 Cache	128 KiB/chip; 2-way set associative
Side Cache	8-64 MiB SDRAM/chip
	Succession
	SHAVE v1.0 SHAVE v3.0

Revision as of 23:43, 11 March 2018

Streaming Hybrid Architecture Vector Engine v2.0 (SHAVE v2.0) is an accelerator microarchitecture designed by Movidius for their vision processors. SHAVE-based products are branded as the Myriad family of vision processors.

History

The original SHAVE architecture was designed primarily for the acceleration of game physics. Low demand for expensive physics acceleration in smartphones has forced to re-focused on image and vision processing. Their architecture was versatile enough that it allowed for fairly simple modification to target machine vision processing.

Process Technology

Main article: 65 nm lithography process

This microarchitecture was designed for TSMC's 65 nm process.

Architecture

Hybrid RISC-DSP-GPU VLIW architecture
20 GFLOPS computational power
- 180 MHz
- At 300 mW
Predicated execution
Branch delay slots
Tailored to streaming workloads
128-bit vector arithmetic
- 8/16/32-bit integer
- 16/32-bit floating point
Full support for sparse data structures (matrix/array, random access)

Instruction Set

SHAVE supports a mixture of many different types of instructions belonging to a number of different classes of architectures.

RISC style
- Instruction predication
- Large set of integer operations
VLIW style
- Parallel functional units controlled by VLIW instructions
- 8/16/32-bit x 1-4 SIMD int
DSP style
- Zero overhead looping
- Modulo addressing
- Transparent DMA modes
- FFT, Viterbi, etc..
- Parallel comparisons
GPU style
- Streaming operations
- 16/32-bit FP operations
- Texture management unit

Block Diagram

Entire SoC

Individual Core

Overview

Architecturally, SHAVE is organized similar to IBM's CELL architecture. There are independent SHAVE cores, with up to eight in this generation may be chained together. Cores benefit from zero penalty from their two neighbors closest, an intrinsic property of architecture that inherently benefits most code. The chip features an L2 cache that is shared by all the cores as well as an integrated DDR2 memory controller that is connected to an on-package KGD stacked die ranging from 8 to 64 MiB of SDRAM.

A large set of peripherals are attached to the parameter of the chip which communicate with the cores via the AXI bus. Those peripherals include support for two high-resolution cameras (up to 12 megapixel) at high rate and high-resolution LCD controllers. The various peripherals can be software-multiplexed via the limited number of I/O pins in the package. The overall management controller core is a synthesizable SPARC V8 LEON3.

Core

The underlying concept behind the SHAVE core is extracting significant data-level parallelism. Each SHAVE core is a VLIW processor that can operates on native 32-bit integer and 128-bit vector values.

Bandwidth

Each of the registers files in the core incorporate a large number of ports. That, along with wide buses allow for very high sustainable throughput to be achieved at very low clock frequencies (180 MHz).

Bandwidth
Parameter	VRF	SRF	IRF	LSU	IDC	L1	ISB	L2	SDRAM
Clk	180 MHz
Bytes	16	4	4	8	16	8	16	16	4
Ports	12	12	17	2	1	1	2	1	2
Bandwidth	34.56	8.64	12.24	2.88	2.88	1.44	5.76	2.88	1.44
SHAVE cores	8
Total BW	276.48	69.12	97.92	23.04	23.04	11.52	46.08
Total	547.2							2.88	1.44

Sparse Data Acceleration

The SHAVE cores support sparse data operations with the load-store unit using eight 4-bit fields to generate the address.

Performance claims

Movidius reported very high performance numbers for their chip. Fabricated on a 65 nm process and operating at 180 MHz and consuming 300 milliwatt, the full chip is capable of doing 300 GOPS or just over 1 TOPS per watt for 8-bit arithmetic for their integer arithmetic operations and 60 GOPS for 8-bit vector operations.

Package

Movidius packaged those chips in an 8x8 mm BGA package with 225 balls. The die is then bumpped on top of a custom FR-4 substrate. The SDRAM is then wire bond on top of the Myriad die.

Floorplan

The full chip consists of just two macros - the CMX block and the SHAVE core block. The whole die was place and routed with those two marcros with the rest routed flat at the end.

Myriad Die

TSMC's 65nm Low Power (65LP) process
< 64 mm² die size

References

Some information was obtained directly from Movidius
HotChips 23 (HC23), 2011

@@ Line 7: / Line 7: @@
 |introduction=2011
 |type=VLIW
+|l1=1 KiB
+|l1 per=core
+|l2=128 KiB
+|l2 per=chip
+|l2 desc=2-way set associative
+|side cache=8-64 MiB SDRAM
+|side cache per=chip
+|predecessor=SHAVE v1.0
+|predecessor link=movidius/microarchitectures/shave_v1.0
+|successor=SHAVE v3.0
+|successor link=movidius/microarchitectures/shave_v3.0
 }}
-'''Streaming Hybrid Architecture Vector Engine v2.0''' ('''SHAVE v2.0''') is an accelerator microarchitecture designed by [[Movidius]] for their vision processors. SHAVE is incorporated into Movidius {{movidius|Myriad}} family of vision processors.
+'''Streaming Hybrid Architecture Vector Engine v2.0''' ('''SHAVE v2.0''') is an accelerator microarchitecture designed by [[Movidius]] for their vision processors. SHAVE-based products are branded as the {{movidius|Myriad}} family of vision processors.
+== History ==
+The original SHAVE architecture was designed primarily for the [[hardware acceleration|acceleration]] of game physics. Low demand for expensive physics acceleration in smartphones has forced  to re-focused on image and vision processing. Their architecture was versatile enough that it allowed for fairly simple modification to target machine vision processing.
+== Process Technology ==
+{{main|65 nm lithography process}}
+This microarchitecture was designed for [[TSMC]]'s [[65 nm process]].
 == Architecture ==
 * Hybrid [[RISC]]-[[DSP]]-[[GPU]] [[VLIW]] architecture
+* 20 GFLOPS computational power
+** 180 MHz
+** At 300 mW
 * [[Predicated execution]]
 * [[Branch delay slots]]
@@ Line 26: / Line 47: @@
 ** Instruction predication
 ** Large set of integer operations
-** C-compiler support
 * VLIW style
 ** Parallel functional units controlled by VLIW instructions
@@ Line 42: / Line 62: @@
 === Block Diagram ===
+==== Entire SoC ====
+[[File:shave v2 soc block.svg|800px]]
 ==== Individual Core ====
 [[File:shave v2 block diagram.svg|900px]]
+== Overview ==
+[[File:shave v2 overview.svg|right|300px]]
+Architecturally, SHAVE is organized similar to [[IBM]]'s {{ibm|CBEA|l=arch|CELL}} architecture. There are independent SHAVE cores, with up to eight in this generation may be chained together. Cores benefit from zero penalty from their two neighbors closest, an intrinsic property of architecture that inherently benefits most code. The chip features an [[L2 cache]] that is shared by all the cores as well as an integrated [[DDR2]] [[integrated memory controller|memory controller]] that is connected to an on-package [[known good die|KGD]] [[3d ic|stacked die]] ranging from 8 to 64 [[MiB]] of [[SDRAM]].
+A large set of peripherals are attached to the parameter of the chip which communicate with the cores via the [[AXI bus]]. Those peripherals include support for two high-resolution cameras (up to 12 megapixel) at high rate and high-resolution LCD controllers. The various peripherals can be software-multiplexed via the limited number of I/O pins in the package. The overall management controller core is a [[synthesizable]] [[SPARC]] [[V8]] [[LEON3]].
+== Core ==
+The underlying concept behind the SHAVE core is extracting significant [[data-level parallelism]]. Each SHAVE core is a VLIW processor that can operates on native 32-bit [[integer]] and 128-bit [[vector]] values.
+=== Bandwidth ===
+Each of the registers files in the core incorporate a large number of ports. That, along with wide buses allow for very high sustainable throughput to be achieved at very low clock frequencies (180 MHz).
+:[[File:shave v2 block diagram bandwidth.svg|600px]]
+{| class="wikitable" style="text-align: center;"
+|-
+! colspan="10" | Bandwidth
+|-
+! Parameter !! VRF !! SRF !! IRF !! LSU !! IDC !!  L1 !!  ISB !! L2 !! SDRAM
+|-
+| Clk || colspan="9" | 180 MHz
+|-
+| Bytes || 16 || 4 || 4 || 8 || 16 || 8 || 16  || 16 || 4
+|-
+| Ports || 12 || 12 || 17 || 2 || 1 || 1 || 2 || 1 || 2
+|-
+| Bandwidth || 34.56 || 8.64 || 12.24 || 2.88 || 2.88 || 1.44 || 5.76 || 2.88 ||  1.44
+|-
+| SHAVE cores || colspan="7" | 8 || colspan="2" | &nbsp;
+|-
+| Total BW || 276.48 ||  69.12 ||  97.92 || 23.04 || 23.04 || 11.52 || 46.08
+|-
+| Total || colspan="7" | 547.2 || 2.88 || 1.44
+|}
+=== Sparse Data Acceleration ===
+[[File:movidius shave v2.0 sparse data example.png|right|400px]]
+The SHAVE cores support sparse data operations with the [[load-store unit]] using eight 4-bit fields to generate the address.
+:[[File:movidius shave v2.0 sparse data support.png|700px]]
+== Performance claims ==
+Movidius reported very high performance numbers for their chip. Fabricated on a [[65 nm process]] and operating at 180 MHz and consuming 300 milliwatt, the full chip is capable of doing 300 GOPS or just over 1 TOPS per watt for 8-bit arithmetic for their integer arithmetic operations and 60 GOPS for 8-bit vector operations.
+:[[File:shave v2 performance claims.png|600px]]
+== Package ==
+[[File:myriad 1 bga.png|right|200px]]
+Movidius packaged those chips in an 8x8 mm [[BGA]] package with 225 [[solder ball|balls]]. The die is then bumpped on top of a custom [[FR-4 substrate]]. The SDRAM is then [[wire bond|wire bond]] on top of the Myriad die.
+:[[File:myriad package diagram.svg|600px]]
+== Floorplan ==
+The full chip consists of just two macros - the CMX block and the SHAVE core block. The whole die was place and routed with those two marcros with the rest routed flat at the end.
+:[[File:shave v2 floorplan.svg|400px]]
+== Myriad Die ==
+* [[TSMC]]'s [[65 nm process|65nm Low Power (65LP) process]]
+* < 64 mm² die size
+:[[File:myriad 1 (shave v2.0) die shot.png|600px]]
+:[[File:myriad 1 (shave v2.0) die shot (annotated).png|600px]]
+== References ==
+* Some information was obtained directly from Movidius
+* HotChips 23 (HC23), 2011

codename	SHAVE v2.0 +
designer	Movidius +
first launched	2011 +
full page name	movidius/microarchitectures/shave v2.0 +
instance of	microarchitecture +
manufacturer	TSMC +
name	SHAVE v2.0 +

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung