Difference between revisions of "ibm/microarchitectures/power9"

	Edit Values
	POWER9 µarch
	General Info
Arch Type	CPU
Designer	IBM
Manufacturer	GlobalFoundries
Introduction	August, 2017
Phase-out	August, 2018
Process	14 nm
Core Configs	24
	Pipeline
Type	Superscalar
Speculative	Yes
Reg Renaming	Yes
Stages	12-16
	Instructions
ISA	Power ISA v3.0
	Cache
L1I Cache	32 KiB/core
L1D Cache	32 KiB/core
L2 Cache	512 KiB/core
L3 Cache	120 MiB/chip
	Succession
	POWER8 POWER10

Revision as of 23:54, 4 February 2017

POWER9 is IBM's successor to POWER8, a 14 nm microarchitecture for Power-based server microprocessors that is set to be introduced in the 2nd half of 2017. POWER9-based processors are branded under the POWER9 family.

Process Technology

POWER9 is set to be fabricated on GlobalFoundries' 14 nm FinFET process, the same process that's used by AMD for their Zen microarchitecture.

Compatibility

Initial support for POWER9 started with Linux Kernel 4.8.

Vendor	OS	Version	Notes
IBM	AIX	7.?	Support
IBM	IBM i	?	Support
Linux	Linux	Kernel 4.8	Initial Support
Wind River	VxWorks	VxWorks 7.?	Support

Compiler support

Compiler	CPU	Arch-Favorable
GCC	`-mcpu=pwr9`	`-mtune=pwr9`
LLVM	`-mcpu=pwr9`	`-mtune=pwr9`
XL C/C++	`-mcpu=pwr9`	`-mtune=pwr9`

Variations

IBM offers POWER9 in two flavors: Scale-Out (SO) and Scale-Up (SU). The Scale-Out variations are design for traditional datacenter clusters utilizing single- and -dual sockets setups. The Scale-Up variations are designed for NUMA servers with four sockets and up, supporting large memory and throughput.

For both the Scale-Out and the Scale-Up there are two variations, a 12-core SMT8 model and a 24-core SMT4 model. The SMT4 is optimized for Linux Ecosystem whereas the SMT8 is said to be optimized for the PowerVM Ecosystem community (AIX / IBM i customers). Those models support up to 8 channels of DDR4 memory for up to 128 GiB of memory.

	Linux Ecosystem	PowerVM Ecosystem
	24-core / 96 Threads	12-core / 96 Threads
Scale-Out (SO)
Scale-Up (SU)

Architecture

Key changes from POWER8

14 nm process (from 22 nm)
- 17-layer metal stack
- 8,000,000,000 transistors
Support for Power ISA v3.0
Higher single-thread performance
New highly modular architecture
Pipeline
- Shorter pipeline
  - 5 stages eliminated from fetch to compute vs POWER8
  - Roughly 5 stages were also eliminated for fixed-point operations
  - Up to 8 cycles were eliminated for floating-point operations
- Instruction grouping at dispatch has been removed
- Improved hazard avoidance / reduced hazard disruption
Improved branch prediction
Cache
- 120 MiB NUCA L3
  - eDRAM
  - 7 TB/s on-chip bandwidth
Hardware Acceleration
- Enhanced on-chip acceleration
- Nvidia NVLINK 2.0
- CAPI 2.0
I/O Subsystem
- PCIe Gen4
- Local SMP - 16 GT/s per lane interface
- Remote SMP - 25 GT/s per lane interface
  - 48-96 lanes capability
  - IBM's SMP connect for their scale-up systems
  - Also available for the accelerators
Virtualization
- QoS assistance
- New Interrupt architecture
- Workload-optimized frequency
- Hardware enforced trusted execution

Execution Slice Microarchitecture

Execution Slice Microarchitecture is POWER9's entirely new refactored core modular design. The same modules were used to build both the SMT4 and SMT8 cores (and in theory scale further to higher thread count although that's not going to happen in this iteration). These modules allow IBM to address the various processor models with support for the different configurations such as bandwidth/lines (from 128 to 64 byte sectors).

A Slice is the basic 64-bit computing block incorporating a single Vector and Scalar Unit (VSU) coupled with Load/Store Unit (LSU). VSU has a heterogeneous mix of computing capabilities including integer and floating point supporting scalar and vector operations. IBM claims this setup allows for higher utilization of resources while providing efficient exchanges of data between the individual slices. Two slices coupled together make up the Super-Slice, a 128-bit POWER9 physical design building block. Two super-slices together along with an Instruction Fetch Unit (IFU) and an Instruction Sequencing Unit (ISU) form a single POWER9 SMT4 core. The SMT8 variant is effectively two SMT4 units.

POWER8	P9 SMT8 (4x Super-Slice)	P9 SMT4 (2x Super-Slice)	Super-Slice	Slice

Pipeline

POWER9 modular design allowed IBM to reduce fetch-to-compute latency by 5 cycles. Similar number of cycles were also cut from fixed-point operations from fetch to retire. Additional 8 cycles were cut from fetch-to-retire for floating point instructions. POWER9 furthered increased fusion and reduced the number of instructions cracked (POWER handles complex instructions by 'cracking' them into two or three simple µOPs). Instruction grouping at dispatch that was done in POWER8 has also been entirely removed from POWER9.

B0

B1

RES

IF

IC

D1

D2

Crack/Fuse

PD0

PD1

XFER

MAP

VS0

VS1

F2

F3

F4

F5

LS0

LS1

AGEN

BRD

CA

FMT

CA

SMT4 core

Fetch/Branch	Slices issue VSU & AGEN	VSU Pipe	LSU Slices
32 KiB L1I$ 8 fetch, 6 decode 1x branch execution	4x scalar-64b / 2x vector-128b 4x load/store AGEN	4x ALU 4x FP + FX-MUL + Complex (64b) 2x Permute (128b) 2x Quad Fixed (128b) 2x Fixed Divide (64b) 1x Quad FP & Decimal FP 1x Cryptography	32 KiB L1D$ Up to 4 DW Load or Store

Die Shot

Tetracosa-Core

GlobalFoundries 14 nm FinFET Process
17-layer metal stack
8,000,000,000 transistors

800px

@@ Line 164: / Line 164: @@
 === Pipeline ===
-{{empty section}}
+POWER9 modular design allowed IBM to reduce fetch-to-compute latency by 5 cycles. Similar number of cycles were also cut from fixed-point operations from [[fetch]] to [[retire]]. Additional 8 cycles were cut from fetch-to-retire for floating point instructions. POWER9 furthered increased fusion and reduced the number of instructions cracked (POWER handles complex instructions by 'cracking' them into two or three simple µOPs). Instruction grouping at dispatch that was done in {{\\|POWER8}} has also been entirely removed from POWER9.
+{| style="overflow-x: scroll; white-space: nowrap; font-size: 1.2em; border-spacing: 10px; border-collapse: separate; "
+| colspan="9" | || B0 || B1 || RES
+|-
+| IF || IC  || D1 || D2 || Crack/Fuse || PD0 || PD1 || XFER || MAP || VS0 || VS1 || F2 || F3 || F4 || F5
+|-
+| colspan="9" | || LS0 || LS1 || AGEN || BRD || CA || FMT || CA
+|}
+==== SMT4 core ====
+[[File:p9smt4core.png|700px]]
+{| class="wikitable"
+! Fetch/Branch || Slices issue VSU & AGEN || VSU Pipe || LSU Slices
+|-
+|
+* 32 KiB L1I$
+* 8 fetch, 6 decode
+* 1x branch execution
+||
+* 4x scalar-64b / 2x vector-128b
+* 4x load/store AGEN
+||
+* 4x [[ALU]]
+* 4x [[FP]] + FX-MUL + Complex (64b)
+* 2x Permute (128b)
+* 2x Quad Fixed (128b)
+* 2x Fixed Divide (64b)
+* 1x Quad FP & Decimal FP
+* 1x Cryptography
+||
+* 32 KiB L1D$
+* Up to 4 DW Load or Store
+|}
 == Die Shot ==

codename	POWER9 +
core count	4 +, 8 +, 12 +, 16 +, 20 + and 24 +
designer	IBM +
first launched	August 2017 +
full page name	ibm/microarchitectures/power9 +
instance of	microarchitecture +
instruction set architecture	Power ISA v3.0B +
manufacturer	GlobalFoundries +
microarchitecture type	CPU +
name	POWER9 +
phase-out	2020 +
pipeline stages (max)	16 +
pipeline stages (min)	12 +
process	14 nm (0.014 μm, 1.4e-5 mm) +

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung

Intel

AMD

Ampere

Apple