



#### Mike Butts

mike@ambric.com

# 20th Century Reconfigurable Computing



1987 2 μm CMOS

- In 1987, a great new way to spend 0.5 mm² of silicon was:
  - A 4-LUT, a flip-flop, and reconfigurable wires
- But the FPGA was never an ideal computing platform:
  - RTL productivity is not scaling with Moore's Law
  - High-level synthesis has had limited success
  - Developer must be mindful of HW issues such as timing closure
- RC developer must be application expert, SW and HW engineer

Ambrio

# 21st Century Reconfigurable Computing



- 2007 130 nm CMOS
- What is the best way to use 0.5 mm<sup>2</sup> of silicon today?
  - 32-bit CPU, several KB of RAM, and reconfigurable buses
- So just fab a chip with CPUs and buses, and throw it over the wall at the programmers. Not likely to succeed!
- Pick a good programming model for reconfigurable computing first.
   Then build silicon and tools to implement that model.



Copyright © 2003-2007 Ambric, Inc.

\_

## **Ambric Introduction**

- Fabless Semiconductor Company
- Founded in 2003 in Beaverton, Oregon
- Veteran team 60+ employees and growing
- Production silicon August 2007
- Product releases: Chip, IDE, applications, board January 2008



Ambric

# **Ambric Objectives**

- Maximum possible performance and performance/watt for embedded and accelerated applications
  - streaming media, image processing, networking, software radio
  - superior to FPGAs, DSPs, multicores, even approaching ASICs
- Reasonable and reliable application development
  - write software not hardware, with reliable reuse
- Hardware and software scalability to track Moore's Law
  - future silicon processes
  - development productivity



Copyright © 2003-2007 Ambric, Inc.

\_

#### A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing

- Structural Object Programming Model
- Architecture
- Chip
- Tools
- Applications
- University Program

Ambric



- What should be in a programming model?
  - What is familiar, productive, scalable to any size and speed?
- Software languages (C, Java, ...) are familiar and productive
  - Array of sequential processors
- Block diagrams are familiar, scalable, encapsulated and hierarchical
  - Reconfigurable interconnect
- Strict encapsulation and hierarchy with standard interfaces enables strong design reuse, necessary for scalable development cost.





Ambric

Copyright © 2003-2007 Ambric, Inc.

# **Structural Object Programming Model**

- Objects are software programs running concurrently on an asynchronous array of Ambric processors and memories
- Objects exchange data and control through a structure of self-synchronizing asynchronous Ambric channels
- Objects are mixed and matched hierarchically to create new objects, snapped together through a simple common interface
- Easier development, high performance and scalability



Ambric

#### **Ambric Channels**



- Chains of Ambric registers form Ambric channels
  - Word-wide, unidirectional, point-to-point, strictly ordered
  - Inter-stage object throttles its channels with Ambric protocol
    - v downstream, a upstream
  - Fully encapsulated, fully scalable for control and data between objects
- Objects linked through channels are asynchronous to each other
  - Each operates when it can, on its own, according to its channels
  - Objects are synchronized with one another only through channels
- Globally Asynchronous Local Synchronous (GALS) clocking
  - Physically scalable, no low-skew long wires



Copyright © 2003-2007 Ambric, Inc.

\_

## **SOPM Realized in Silicon**



- Objects exchange data and control thru a structure of Ambric channels
  - Each stage has forward and backward flow control, and buffering
- Standard interface between objects
  - Encapsulation, reuse
- Self-synchronizing on each transfer
  - Asynchronous system
- Channels can be any length or speed: no scheduling, no timing closure
  - Easier on tools, easier to program, easier to debug, reliable

Ambrio

Copyright © 2003-2007 Ambric, Inc.

10

# **Model of Computation: not quite CSP**

#### **Process Domains**

- CSP
  - · C.A.R. Hoare, "Communicating Sequential Processes", Communications of the ACM, vol. 21, no. 8, August
- · Components are sequential processes that run concurrently
- X Synchronous message passing
  - · Good for resource management problems
  - · Dining Philosophers
  - Hardware bus contention
  - Nondeterminism
  - Liveness
- Fairness Deadlock

slides from "A Brief Tutorial on Models of Computation" The MFSCAL Team. UC Berkeley,

- Ambric MoC was inspired by CSP, but is not quite CSP.
  - Message passing is buffered, not strictly synchronous.



# **Model of Computation: Process Network**

#### **Process Domains**

- PN
  - Kahn-MacQueen Process Network
  - G. Kahn, "The Semantics of a Simple Language for Parallel Programming", Prof. of the IFIP Congress
- · Components are sequential processes that run concurrently
- Communication channels are whounded FIFOs Get operation blocks until data is available.

  Processes cannot poll for data

  - Deterministic execution

 Bounded memory with blocking writes Good for streaming signal processing applications slides from "A Brief Tutorial on Models of Computation<sup>a</sup> The MESCAL Team, UC Berkeley, Fall 2001

- Ambric MoC is a Process Network with bounded FIFOs.
  - FIFO-like primitive register, streaming RAMs for bigger FIFOs.
  - Channels carry data and control, and strictly preserve sequence.



- Structural Object Programming Model
- Architecture
- Chip
- Tools
- Applications
- University Program

Ambrio













- Structural Object Programming Model
- Architecture
- Chip
- Tools
- Applications
- University Program



Convight © 2002 2007 Ambrie Inc

Am2045 Chip 130nm standard-cell ASIC 180 million transistors 45 brics, 1.03 teraOPS 336 32-bit processors - 7.1 Mbits dist. SRAM 8 μ–engine VLIW accelerators High-bandwidth I/O PCI Express DDR2-400 x 2 - 128 bits GPIO Serial flash, JTAG, μP I/O Package \_ 31 x 31 mm 896-balls DDR2 DDR2 Flip-Chip Ctlr Ctlr CU CU Compute Unit RAM Unit habric

## **Performance Metrics**

#### Am2045 @ 300 MHz:

- 1.03 trillion operations per second (8-bit, 16-bit Sum of Abs. Diff.)
  - 60 GMACS (16x16, 32 bit sum)
- 792 Gbps interconnect bisection bandwidth
- 26 Gbps DRAM + 16 Gbps high-speed serial + 13 Gbps parallel

| <u>Kernel</u>      | Instances @ Rate Each          |                                      |
|--------------------|--------------------------------|--------------------------------------|
| 32-tap FIR filters | 168 @ 4.7 Msps<br>5 @ 223 Msps | 16 bit data                          |
| Dot Product        | 168 @ 200 Msps                 | 16 bits in, 32 bit sum               |
| Maximum Value      | 168 @ 343 Msps                 | n=100, 16 bits, 2 wide               |
| Saturators         | 168 @ 600 Msps                 | signed 16 to unsigned 8, 2 wide      |
| Viterbi ACS        | 336 @ 600 Mbps                 | 16-bit                               |
| 1K point FFT       | 84 @ 8.8 Msps                  | complex 16-bit radix-2               |
| AES                | 56 @ 181 Mbps<br>7 @ 1.1 Gbps  | feedback modes<br>non-feedback modes |

Ambric

Copyright © 2003-2007 Ambric, Inc.

23

# **Ambric Development Boards**

#### Am2045 Software Development Board

- 1 production Am2045 + SDRAM
- PCI Express interface to host
- For rapid software development and application acceleration

#### Am2045 Integrated Development Board

- 1 production Am2045 + SDRAM
- 4 32-bit GPIO connectors, USB
- Stand-alone capable on the benchtop, or in a PCIe slot with PC cover off
- Serial Flash, power connector
- For embedded development







Copyright © 2003-2007 Ambric, Inc.

24

- Structural Object Programming Model
- Architecture
- Chip
- Tools
- Applications
- University Program



Convight © 2002 2007 Ambrie Inc

**Ambric Tool Chain** Eclipse IDE (Integrated Design Env.) All tools in the open IDE **5** Structure 7 Conceive your application as a structure of objects and the messages Library 2 4 they exchange Divide-and-conquer using hierarchy Reuse Compile - Encapsulated library objects Each Code and Test - Write your new objects in Java or Assembler Verify with functional simulation Map & Route Realize on HW Compile each object separately Run mapper-router, configure chip Debug, Tune on HW Debug, profile and tune performance **Ambric** 









- Structural Object Programming Model
- Architecture
- Chip
- Tools
- Applications
- University Program



Copyright © 2003-2007 Ambric, Inc.

...

# **Library Objects**

- Video Compression
- Motion Estimation
  - Full-search
  - · Hierarchical-search
- H.264 I-frame Decoder Module
  - Deblocking Filter
  - Inverse Transform
  - · Intra-prediction
  - Macroblock assembler
  - Motion Compensation
  - Cache-controller
  - CABAC decode
- DV Decoder Module
- DVCPRO-HD modules
  - Variable length codec
  - Forward & Inverse DCT
- Pixel Processing
  - HD Video scaler
  - HD De-interlacer

- Signal Processing
  - FFT radix-2, radix-4
  - FIR, IIR Filters
  - Vector Saturation
  - Maximum value search
  - Dot Product
  - Matrix Math
- Communications
- Turbo CTC
- Viterbi
- AES encryption
- Regular expression search

Ambric





- Structural Object Programming Model
- Architecture
- Chip
- Tools
- Applications
- University Program



Copyright © 2003-2007 Ambric, Inc.

٥.

# **Ambric University Program**

- Ambric offers development tools, documentation, hardware, and limited technical support, at no cost, to University Program partners.
- Benefits to University
  - Access to a real massively-parallel embedded-systems architecture
  - Very high performance execution with energy efficiency
  - Easier and faster software-only development effort
  - Real execution target for research tools, languages, etc.
- Benefits to Ambric
  - Real development experience with more application areas
  - Promote innovative tools, methodologies, libraries
  - Get to know the best future graduates
- Current Members as of January 2008
  - U. Wash EE Dept.: Prof. Scott Hauck
  - Portland State U. ECE Dept.: Prof. Dan Hammerstrom
  - Halmstad U. CERES (Sweden): Prof. Bertil Svenson



Copyright © 2003-2007 Ambric, Inc.

36



# www. Anbric.com

#### **Publications**

- "Synchronization through Communication in a Massively Parallel Processor Array", Mike Butts, IEEE Micro, Sept/Oct 2007.
- "A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing", Mike Butts, Anthony Mark Jones, Paul Wasson, IEEE Symposium on Field-Programmable Custom Computing Machines, April 2007.

#### A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing

- Structural Object Programming Model
- Architecture
- Chip
- Tools
- Applications
- University Program
- Additional Background Material

Ambric

# The Energy Efficiency of Parallelism

- Strict power budgets at all levels limit power scalability
  - 1W handheld, 10W portable, 100W desktop/server
- Minimize energy per operation
  - Lower voltage: slower but far less power
- Get performance back with parallelism
  - This makes the most power-efficient use of silicon area
- Example:
  - One cool processor: 75% speed, 42% power\*
  - Two in parallel: 150% speed, 84% power
- The catch is making parallelism practical. Ambric's programming model opens this door.
  - High performance made scalable



