SHAVE v2.0 - Microarchitectures - Intel Movidius

	Edit Values
	SHAVE v2.0 µarch
	General Info
Arch Type	Accelerator
Designer	Movidius
Manufacturer	TSMC
Introduction	2011
Phase-out	2014
Process	65 nm
	Pipeline
Type	VLLIW
	Instructions
ISA	SHAVE, SPARC v8
	Cache
L1 Cache	1 KiB/core
L2 Cache	128 KiB/chip; 2-way set associative
Side Cache	8-64 MiB SDRAM/chip
	Succession
	SHAVE v1.0 SHAVE v3.0

Streaming Hybrid Architecture Vector Engine v2.0 (SHAVE v2.0) is an accelerator microarchitecture designed by Movidius for their vision processors. SHAVE-based products are branded as the Myriad family of vision processors.

History[edit]

The original SHAVE architecture was designed primarily for the acceleration of game physics. Low demand for expensive physics acceleration in smartphones has forced a re-focus on image and vision processing. Their architecture was versatile enough that it allowed for fairly simple modification to target machine vision processing.

Process Technology[edit]

Main article: 65 nm lithography process

This microarchitecture was designed for TSMC's 65 nm process.

Architecture[edit]

Hybrid RISC-DSP-GPU VLIW architecture
20 GFLOPS computational power
- 180 MHz
- At 300 mW
Predicated execution
Branch delay slots
Tailored to streaming workloads
128-bit vector arithmetic
- 8/16/32-bit integer
- 16/32-bit floating point
Full support for sparse data structures (matrix/array, random access)

Instruction Set[edit]

SHAVE supports a mixture of many different types of instructions belonging to a number of different classes of architectures.

RISC style
- Instruction predication
- Large set of integer operations
VLIW style
- Parallel functional units controlled by VLIW instructions
- 8/16/32-bit x 1-4 SIMD int
DSP style
- Zero overhead looping
- Modulo addressing
- Transparent DMA modes
- FFT, Viterbi, etc..
- Parallel comparisons
GPU style
- Streaming operations
- 16/32-bit FP operations
- Texture management unit

Block Diagram[edit]

Entire SoC[edit]

Individual Core[edit]

Overview[edit]

Architecturally, SHAVE is organized similar to IBM's CELL architecture. There are independent SHAVE cores, with up to eight in this generation may be chained together. Cores benefit from zero penalty from their two neighbors closest, an intrinsic property of architecture that inherently benefits most code. The chip features an L2 cache that is shared by all the cores as well as an integrated DDR2 memory controller that is connected to an on-package KGD stacked die ranging from 8 to 64 MiB of SDRAM.

A large set of peripherals are attached to the parameter of the chip which communicate with the cores via the AXI bus. Those peripherals include support for two high-resolution cameras (up to 12 megapixel) at high rate and high-resolution LCD controllers. The various peripherals can be software-multiplexed via the limited number of I/O pins in the package. The overall management controller core is a synthesizable SPARC V8 LEON3.

Core[edit]

The underlying concept behind the SHAVE core is extracting significant data-level parallelism. Each SHAVE core is a variable-length very long instruction word (VLLIW) processor that can operates on native 32-bit integer and 128-bit vector values. Each core incorporates a number of different types of register files as well as a large set of execution units that make use of the multiple register files with a large number of ports in order to operate on a relatively big set of values at once.

Every SHAVE core resides on an independent power island. Although we were not able to verify this, presumably you can enable and disable cores based on your power requirements through software.

Register Files[edit]

There are three register files: Integer Register File (IRF), Scalar Register File (SRF), Vector Register File (VRF).

The SHAVE's integer register file (IRF) has 17 ports in order enable a large set of parallel data operations each cycle. The IRF has 32 entries of 32-bit integers which are predominantly operated on by the IAU and the SAU. In addition to the IRF is the scalar register file (SRF) which has another set of 32 entries of 32-bit integers. The reason for the second register is to allow the SHAVE to do various three-operand operations and generate predicates for the execution units.

The SHAVE core also has a dedicated vector register file (VRF) which contains 32 entries, each 128-bit wide.

Execution Units[edit]

The three major arithmetic execution units are the vector arithmetic unit (VAU), the scalar arithmetic unit (SAU), and the integer arithmetic unit (IAU). Additionally, every SHAVE core also incorporates a its own texture mapping unit (TMU), allowing for powerful bitmap transformations.

The integer arithmetic unit (IAU) performs all arithmetic instructions that operate on 32-bit integer numbers and access the IRF. The scalar arithmetic unit (SAU) is far more versatile and can perform all integer (8-32 bit) and floating point (HP/FP) operations. The vector arithmetic unit (VAU) supports 128-bit vector operations of all the integer (8-32 bit) and floating point (HP/FP) types. Since those operations can be done at in the same cycle, it's possible to perform combination of the two in order to do various DSP-like operations such as matrix switching.

In addition to those units, there is also a compare-move unit (CMU) which is used to generate predicates. This unit can do things such as three comparison operations per 16-bit/32-bit/8bit entry in the vector register file in parallel with the vector operations which can generate predicates for predicated execution. The CMU effectively interfaces with all the register files and is capable of moving data between them.

The branch and repeat unit (BRU) is another execution unit with the ability to do continuous iterations with zero overhead. This applies to single or multiple VLIW words and can be used for things usch as parameterizable finite impulse response (FIR) filters, allowing for significant instruction fetch bandwidth reduction.

Both the CMU and the BRU can work in conjunction with the predicated execution unit (PEU) and they work with the VAU all in the same cycle, meaning applications such as pixel decision making can be done incredibly fast in the same cycle.

Memory Subsystem[edit]

Each SHAVE core has two load-store units, each is capable of doing a single 64-bit operation per cycle from the SRAM tile.

Each SHAVE core has a local 128 KiB slice of SRAM the LSU operates on Movidius calls the Connection MatriX (CMX) block. The cache is split between instruction and data. The exact amount is software configurable with a granularity of 8 KiB intervals. Each local cache tile is directly linked to two of its closest neighbors (presumably to the cores located to the west and to the east), allowing for zero-penalty accesses from those memory banks as well. For cache tiles located further on do have a slight latency penalty. Movidius noted that most of the software they've tested does almost all of its communication with its neighboring cores, allowing them to take advantage of this.

The entire chip also has a share 128 KiB of L2 cache and a integrated DDR2 memory controller which is connected to an on-package stacked 8-64 MiB of SDRAM.

Bandwidth[edit]

Each of the registers files in the core incorporate a large number of ports. That, along with wide buses allow for very high sustainable throughput to be achieved at very low clock frequencies (180 MHz).

Bandwidth
Parameter	VRF	SRF	IRF	LSU	IDC	L1	ISB	L2	SDRAM
Clk	180 MHz
Bytes	16	4	4	8	16	8	16	16	4
Ports	12	12	17	2	1	1	2	1	2
Bandwidth	34.56	8.64	12.24	2.88	2.88	1.44	5.76	2.88	1.44
SHAVE cores	8
Total BW	276.48	69.12	97.92	23.04	23.04	11.52	46.08
Total	547.2							2.88	1.44

Sparse Data Acceleration[edit]

The SHAVE cores support sparse data operations with the load-store unit using eight 4-bit fields to generate the address.

Inter-core communication[edit]

This architecture does not have support for hardware synchronization primitives, however it does have a light communication system.

Each of the SHAVE cores has a small arbitration messaging queue consisting of four entries of 64-bit words which is only accessibly (read-wise) from that core. Each SHAVE core can send a message to any other core. When this happens the message is pushed onto that core's queue. If that core's queue is full, the sender will stall. This allows for very efficient SHAVE-SHAVE synchronization and communication without using any shared memory techniques.

It's worth noting that the control/management LEON3 SPARC core does not have access to this communication system.

Performance claims[edit]

Movidius reported very high performance numbers for their chip. Fabricated on a 65 nm process and operating at 180 MHz and consuming 300 milliwatt, the full chip is capable of doing 300 GOPS or just over 1 TOPS per watt for 8-bit arithmetic for their integer arithmetic operations and 60 GOPS for 8-bit vector operations.

Package[edit]

Movidius packaged those chips in an 8x8 mm BGA package with 225 balls. The die is then bumped on top of a custom FR-4 substrate. The SDRAM is then wire bond on top of the Myriad die.

Floorplan[edit]

The full chip consists of just two macros - the CMX block and the SHAVE core block. The whole die was place and routed with those two marcros with the rest routed flat at the end.

Myriad Die[edit]

TSMC's 65nm Low Power (65LP) process
< 64 mm² die size

All SHAVE v2.0 Chips[edit]

	List of SHAVE v2.0-based Processors
	Main processor
Model	Price	Launched	Cores	Frequency
MA1100		2009	9	180 MHz 0.18 GHz 180,000 kHz
MA1101		2010	9	180 MHz 0.18 GHz 180,000 kHz
MA1102		2010	9	180 MHz 0.18 GHz 180,000 kHz
MA1110		June 2009	9	180 MHz 0.18 GHz 180,000 kHz
MA1133		February 2011	9	180 MHz 0.18 GHz 180,000 kHz
MA1135		2011	9	180 MHz 0.18 GHz 180,000 kHz
SABRE			9	180 MHz 0.18 GHz 180,000 kHz
Count: 7

References[edit]

Some information was obtained directly from Movidius
HotChips 23 (HC23), 2011

codename	SHAVE v2.0 +
designer	Movidius +
first launched	2011 +
full page name	movidius/microarchitectures/shave v2.0 +
instance of	microarchitecture +
instruction set architecture	SHAVE + and SPARC v8 +
manufacturer	TSMC +
name	SHAVE v2.0 +
phase-out	2014 +
process	65 nm (0.065 μm, 6.5e-5 mm) +

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung