Goya is designed as a microarchitecture for the acceleration of inference. Since the target market is the data center, the thermal design point for those chips was relatively high - at around 200 W. Goya relies on PCIe 4.0 to interface to a host processor. Habana's software compiles the models and associated instructions into independent recipes which can then be sent to the accelerator for execution. The design itself uses a heterogenous approach comprising of a large General Matrix Multiply (GMM) engine, Tensor Processor Cores (TPCs), and a large shared memory pool.
Tensor Processor Cores (TPC)
There are eight TPCs. Each TPC also incorporates its own local memory but omits caches. The on-die caches and memory can be either hardware-managed or fully software-managed, allowing the compiler to optimize the residency of data and reducing movement. Each of the individual TPCs is a VLIW DSP design that has been optimized for AI applications. This includes AI-specific instructions and operations. The TPCs are designed for flexibility and can be programmed in plain C. The TPC supports mixed-prevision operations including 8-bit, 16-bit, and 32-bit SIMD vector operations for both integer and floating-point. This was done in order to allow accuracy loss tolerance to be controlled on a per-model design by the programmer. Goya offers both coarse-grained precision control and fine-grained down to the tensor level.
- Habana, IEEE Hot Chips 31 Symposium (HCS) 2019.
- Habana, AI Hardware Summit 2019
- Habana, Linley Fall Processor Conference 2019