From WikiChip
Difference between revisions of "ibm/microarchitectures/power9"
< ibm

(Key changes from {{\\|POWER8}})
(Code names)
 
(56 intermediate revisions by 7 users not shown)
Line 1: Line 1:
 
{{ibm title|POWER9|arch}}
 
{{ibm title|POWER9|arch}}
 
{{microarchitecture
 
{{microarchitecture
| atype         = CPU
+
|atype=CPU
| name         = POWER9
+
|name=POWER9
| designer     = IBM
+
|designer=IBM
| manufacturer = GlobalFoundries
+
|manufacturer=GlobalFoundries
| introduction = August, 2017
+
|introduction=August, 2017
| phase-out     = August, 2018
+
|phase-out=2020
| process       = 14 nm
+
|process=14 nm
| cores         = 24
+
|cores=4
| cores 2       =  
+
|cores 2=8
 +
|cores 3=12
 +
|cores 4=16
 +
|cores 5=20
 +
|cores 6=24
 +
|type=Superscalar
 +
|oooe=Yes
 +
|speculative=Yes
 +
|renaming=Yes
 +
|stages min=12
 +
|stages max=16
 +
|isa=Power ISA v3.0B
 +
|l1i=32 KiB
 +
|l1i per=core
 +
|l1i desc=8-way set associative
 +
|l1d=32 KiB
 +
|l1d per=core
 +
|l1d desc=8-way set associative
 +
|l2=512 KiB
 +
|l2 per=core duplex
 +
|l2 desc=8-way set associative
 +
|l3=10 MiB
 +
|l3 per=core duplex
 +
|l3 desc=20-way set associative
 +
|core name=Sforza
 +
|core name 2=Monza
 +
|core name 3=LaGrange
 +
|predecessor=POWER8+
 +
|predecessor link=ibm/microarchitectures/power8+
 +
|successor=POWER10
 +
|successor link=ibm/microarchitectures/power10
 +
}}
 +
'''POWER9''' is [[IBM]]'s successor to {{\\|POWER8}}, a [[14 nm]] microarchitecture for [[Power]]-based server microprocessors first introduced in the 2nd half of [[2017]]. POWER9-based processors are branded under the {{ibm|POWER}} family.
  
| pipeline      = Yes
+
== Code names ==
| type          = Superscalar
+
IBM introduced three flavors of POWER9.
| type 2        =
 
| type N        =
 
| OoOE          = Yes
 
| speculative  = Yes
 
| renaming      = Yes
 
| stages        =  
 
| stages min    = 12
 
| stages max    = 16
 
| issues        =
 
  
| inst          = Yes
+
{| class="wikitable tc1 tc2 tc3 tc4 tc5 tc6 tc7"
| isa          = Power ISA v3.0
+
|-
| isa 2        =  
+
! SoC Codename || SoC Description || Module || Memory Channels || PCIe || {{ibm|XBUS}} || [[OpenCAPI]]
| isa N        =  
+
|-
| feature      =  
+
| rowspan="3" | Nimbus || rowspan="3" | Scale Out
| extension    =
+
| {{ibm|Sforza|l=core}} || 4 || 48 || 1 || {{tchk|no}}
| extension 2  =
+
|-
| extension N  =
+
| {{ibm|Monza|l=core}} || 8 || 34 || 1 || 48
 +
|-
 +
| {{ibm|LaGrange|l=core}} || 8 || 42 || 2 || 16
 +
|-
 +
| Cumulus || Scale Up || ? || {{ibm|Centaur}} || ? || ? || ?
 +
|-
 +
| Axone || Advanced I/O || ? || OMI || 48 || 3 || 48
 +
|}
  
| cache        = Yes
+
== Process Technology ==
| l1i          = 32 KiB
+
POWER9-based microprocessors are fabricated on [[GlobalFoundries]]'s High-Performance [[14 nm process|14 nm]] (14HP) [[FinFET]] [[Silicon-On-Insulator]] (SOI) process. The process was designed by IBM at what used to be their East Fishkill, New York fab which has since been sold to GlobalFoundries.
| l1i per      = core
 
| l1i desc      =
 
| l1d          = 32 KiB
 
| l1d per      = core
 
| l1d desc      =
 
| l2            = 512 KiB
 
| l2 per        = core
 
| l2 desc      =
 
| l3            = 120 MiB
 
| l3 per        = chip
 
| l3 desc      =
 
 
 
| core names      = <!-- Yes if specify -->
 
| core name        =
 
| core name 2      =
 
| core name N      =
 
  
| succession      = Yes
+
== Introduction ==
| predecessor      = POWER8
+
IBM introduced the POWER9 scale out variant of POWER in December 2017. Scale up POWER9 processors were introduced in August 2018. The third variant for high I/O will be introduced in 2019.
| predecessor link = ibm/microarchitectures/power8
 
| successor        = POWER10
 
| successor link  = ibm/microarchitectures/power10
 
}}
 
'''POWER9''' is [[IBM]]'s successor to {{\\|POWER8}}, a [[14 nm]] microarchitecture for [[Power]]-based server microprocessors that is set to be introduced in the 2nd half of [[2017]]. POWER9-based processors are branded under the {{ibm|POWER9}} family.
 
 
 
== Process Technology ==
 
POWER9 is set to be fabricated on [[GlobalFoundries]]' [[14 nm process|14 nm FinFET process]], the same process that's used by [[AMD]] for their {{amd|Zen|l=arch}} microarchitecture.
 
  
 
== Compatibility ==
 
== Compatibility ==
Line 82: Line 88:
 
! Compiler !! CPU !! Arch-Favorable
 
! Compiler !! CPU !! Arch-Favorable
 
|-
 
|-
| [[GCC]] || style="background-color: #ffdad6;" | <code>-mcpu=pwr9</code> || style="background-color: #ffdad6;" | <code>-mtune=pwr9</code>
+
| [[GCC]] || style="background-color: #ffdad6;" | <code>-mcpu=power9</code> || style="background-color: #ffdad6;" | <code>-mtune=power9</code>
 
|-
 
|-
| [[LLVM]] || <code>-mcpu=pwr9</code> || style="background-color: #ffdad6;" | <code>-mtune=pwr9</code>
+
| [[LLVM]] || <code>-mcpu=power9</code> || style="background-color: #ffdad6;" | <code>-mtune=power9</code>
 
|-
 
|-
 
| {{ibm|XL C/C++}} || <code>-mcpu=pwr9</code> || <code>-mtune=pwr9</code>
 
| {{ibm|XL C/C++}} || <code>-mcpu=pwr9</code> || <code>-mtune=pwr9</code>
|}
 
 
== Variations ==
 
IBM offers POWER9 in two flavors: '''Scale-Out''' ('''SO''') and '''Scale-Up''' ('''SU'''). The Scale-Out variations are design for traditional datacenter clusters utilizing [[uniprocessor|single-]] and [[multiprocessor|-dual]] sockets setups. The Scale-Up variations are designed for [[NUMA]] servers with four sockets and up, supporting large memory and throughput.
 
 
For both the Scale-Out and the Scale-Up there are two variations, a [[12-core]] SMT8 model and a [[24-core]] SMT4 model. The SMT4 is optimized for Linux Ecosystem whereas the SMT8 is said to be optimized for the [[PowerVM]] Ecosystem community ({{ibm|AIX}} / {{ibm|IBM i}} customers). Those models support up to 8 channels of [[DDR4]] memory for up to 128 [[GiB]] of memory.
 
 
{| class="wikitable" style="text-align: center;"
 
|-
 
!  !! Linux Ecosystem !! PowerVM Ecosystem
 
|-
 
| || [[24-core]] / 96 Threads || [[12-core]] / 96 Threads
 
|-
 
! Scale-Out (SO)
 
| [[File:p9sosmt4.png|300px]] || [[File:p9sosmt8.png|300px]]
 
|-
 
! Scale-Up (SU)
 
| [[File:p9susmt4.png|300px]] || [[File:p9susmt8.png|300px]]
 
 
|}
 
|}
  
 
== Architecture ==
 
== Architecture ==
=== Key changes from {{\\|POWER8}} ===
+
=== Key changes from {{\\|POWER8}}/{{\\|POWER8+|+}} ===
 
* [[14 nm process]] (from [[22 nm]])
 
* [[14 nm process]] (from [[22 nm]])
 
** 17-layer metal stack
 
** 17-layer metal stack
Line 115: Line 103:
 
* Higher single-thread performance
 
* Higher single-thread performance
 
* New highly modular architecture
 
* New highly modular architecture
* Shorter pipeline
+
* Pipeline
** 5 stages eliminated from fetch to compute vs {{\\|POWER8}}
+
** Shorter pipeline
** Roughly 5 stages were also eliminated for fixed-point operations
+
*** 5 stages eliminated from fetch to compute vs {{\\|POWER8}}
** Up to 8 cycles were eliminated for floating-point operations
+
*** Roughly 5 stages were also eliminated for fixed-point operations
 +
*** Up to 8 cycles were eliminated for floating-point operations
 +
** Instruction grouping at dispatch has been removed
 +
** Improved hazard avoidance / reduced hazard disruption
 +
* Improved branch prediction
 
* Cache
 
* Cache
 
** 120 MiB NUCA L3
 
** 120 MiB NUCA L3
Line 124: Line 116:
 
*** 7 TB/s on-chip bandwidth
 
*** 7 TB/s on-chip bandwidth
 
* Hardware Acceleration
 
* Hardware Acceleration
** Enhanced on-chip acceleration
+
** {{ibm|PowerAXON}}
** [[Nvidia]] [[NVLINK]] 2.0
+
*** Enhanced on-chip acceleration
** CAPI 2.0
+
*** [[Nvidia]] [[NVLink]] 2.0
 +
*** CAPI 2.0
 
* I/O Subsystem
 
* I/O Subsystem
 
** [[PCIe]] Gen4
 
** [[PCIe]] Gen4
 
** Local [[SMP]] - 16 GT/s per lane interface
 
** Local [[SMP]] - 16 GT/s per lane interface
 
** Remote SMP  - 25 GT/s per lane interface
 
** Remote SMP  - 25 GT/s per lane interface
*** 48-96 lanes capability
+
*** 48 PCIe lanes
 
*** IBM's SMP connect for their scale-up systems
 
*** IBM's SMP connect for their scale-up systems
 
*** Also available for the accelerators
 
*** Also available for the accelerators
Line 140: Line 133:
 
** Hardware enforced trusted execution
 
** Hardware enforced trusted execution
  
=== Execution Slice Microarchitecture ===
+
=== Block Diagram ===
'''Execution Slice Microarchitecture''' is POWER9's entirely new refactored core modular design. The same modules were used to build both the SMT4 and SMT8 cores (and in theory scale further to higher thread count although that's not going to happen in this iteration). These modules allow IBM to address the various processor models with support for the different configurations such as bandwidth/lines (from 128 to 64 byte sectors).
+
{{empty section}}
 +
 
 +
=== Memory Hierarchy ===
 +
* Cache
 +
** L1I Cache
 +
*** 32 [[KiB]], 8-way set associative
 +
*** 128-byte lines (broken into four 32-byte sectors)
 +
*** Per SMT4 Core
 +
*** Critical-sector-first reload policy
 +
** L1D Cache
 +
*** 32 KiB, 8-way set associative
 +
*** 128-byte cache line with support for 64-byte sectors
 +
*** Per SMT4 Core
 +
*** Pseudo-LRU replacement policy
 +
** L2 Cache
 +
*** 512 KiB 8-way set associative
 +
*** 128-byte line
 +
*** Per core pair
 +
*** Inclusive of L1I/L1D
 +
** L3 Cache
 +
*** 120 MiB [[eDRAM]]
 +
**** 10 MiB/core pair
 +
*** 12 chunks (regions) of 10 MiB 20-way set associative
 +
*** 7 TB/s on-chip bandwidth
 +
 
 +
== Overview ==
 +
POWER9 succeeds {{\\|POWER8}}, introducing many core enhancements as well as large architectural changes. POWER9 has taken a highly modular design approach, with the same design supporting up to 12 [[physical cores|cores]] with 96 [[logical cores|threads]] (SMT8) or up to 24 cores with 96 threads (SMT4). IBM offers POWER9 as both [[scale up]] and [[scale out]] solutions. In total, there are four targeted chip implementations (24C/SO, 24C/SU, 12C/SO, and 12C/SU).
 +
 
 +
POWER9 comes in two flavors - [[scale out]] (SO) and [[scale up]] (SU). The scale out variations are designed for traditional datacenter clusters utilizing [[uniprocessor|single-socket]] and [[multiprocessor|dual-socket]] setups. The Scale-Up variations are designed for [[NUMA]] servers with four or more sockets, supporting large amounts of memory capacity and throughput.
 +
 
 +
=== Scale out ===
 +
[[File:power9 so overview.svg|right|thumb|Scale-out overview]]
 +
For the scale out there are two variations, a [[12-core]] SMT8 model and a [[24-core]] SMT4 model. The SMT4 is optimized for the Linux ecosystem whereas the SMT8 model is said to be optimized for the [[PowerVM]] ecosystem ({{ibm|AIX}} / {{ibm|IBM i}} customers). Those models support up to 8 channels of [[DDR4]] memory for up to 4 [[TiB]] of DDR4-2667 memory (per socket). Those models offer up to 120 GiB/s of sustained bandwidth.
 +
 
 +
Scale out processors have 48 {{ibm|PowerAXON}} lines (x48) and come with two [[SMP links]].
 +
 
 +
=== Scale up ===
 +
[[File:power9 su overview.svg|right|thumb|Scale-up overview]]
 +
The POWER9 [[scale up]] is designed for their enterprise servers and come with two variations, a [[12-core]] SMT8 model and a [[24-core]] SMT4 model. The SMT4 is optimized for Linux Ecosystem whereas the SMT8 is said to be optimized for the [[PowerVM]] Ecosystem community ({{ibm|AIX}} / {{ibm|IBM i}} customers). POWER9 inherits the same buffered memory architecture first introduced with {{\\|POWER8}}. POWER9 has two memory controllers capable of driving four differential memory interface (DMI) channels, each with a maximum signaling rate of 9.6 GT/s for a sustained bandwidth of up to 28.8 GB/s. Each of the DMI channels connects to one dedicated {{ibm|Centaur}} memory buffer chip which, in turn, provides four DDR4 memory channels running at up to 3200 MT/s as well as 16 MiB of L4 cache. All in all, POWER9 scale-up can use eight buffered memory channels to access up to 32 channels of DDR memory and provides an additional 128 MiB of level 4 cache.
 +
 
 +
:[[File:power9 memory buff.svg|700px]]
 +
 
 +
Scale up processors have a different set of I/O interfaces. The two memory controllers drive eight memory-agnostic interfaces, come with four times as many {{ibm|PowerAXON}} lines (x96), and 3 [[SMP]] links.
 +
 
 +
=== Slice Design  ===
 +
'''Execution Slice Microarchitecture''' is POWER9's entirely new refactored core modular design. The same modules were used to build both the SMT4 and SMT8 cores (and in theory scale further to higher thread count although that's not offered this iteration). These modules allow IBM to address the various processor models with support for the different configurations such as bandwidth/lines (from 128 to 64 byte sectors).
  
 
A '''Slice''' is the basic 64-bit computing block incorporating a single '''[[Vector and Scalar Unit]]''' ('''VSU''') coupled with '''Load/Store Unit''' ('''LSU'''). VSU has a heterogeneous mix of computing capabilities including [[integer]] and [[floating point]] supporting [[scalar]] and [[vector]] operations. IBM claims this setup allows for higher utilization of resources while providing efficient exchanges of data between the individual slices.  Two slices coupled together make up the '''Super-Slice''', a 128-bit POWER9 physical design building block. Two super-slices together along with an '''Instruction Fetch Unit''' ('''IFU''') and an '''Instruction Sequencing Unit''' ('''ISU''') form a single POWER9 SMT4 core. The SMT8 variant is effectively two SMT4 units.
 
A '''Slice''' is the basic 64-bit computing block incorporating a single '''[[Vector and Scalar Unit]]''' ('''VSU''') coupled with '''Load/Store Unit''' ('''LSU'''). VSU has a heterogeneous mix of computing capabilities including [[integer]] and [[floating point]] supporting [[scalar]] and [[vector]] operations. IBM claims this setup allows for higher utilization of resources while providing efficient exchanges of data between the individual slices.  Two slices coupled together make up the '''Super-Slice''', a 128-bit POWER9 physical design building block. Two super-slices together along with an '''Instruction Fetch Unit''' ('''IFU''') and an '''Instruction Sequencing Unit''' ('''ISU''') form a single POWER9 SMT4 core. The SMT8 variant is effectively two SMT4 units.
Line 158: Line 196:
 
| [[File:p9slice.png|50px]]
 
| [[File:p9slice.png|50px]]
 
|}
 
|}
 +
 +
=== Acceleration Platform (POWERAccel) ===
 +
[[File:p9links.png|250px|right]]
 +
'''POWERAccel''' is the collective name for all the interfaces and acceleration protocols provided by the POWER microarchitecture. POWER9 offers two sets of acceleration attachments: [[PCIe]] Gen4 which offers 48 lanes at 192 GiB/s duplex bandwidth and a new 25G link which offers an additional 48 lanes delivering up to 300 GiB/s of duplex bandwidth. On top of the two physical interfaces are a set of open standard protocols that integrated onto those signaling interfaces. The four prominent standards are:
 +
 +
* [[CAPI]] 2.0 - POWER9 introduces CAPI 2.0 over [[PCIe]] which quadruples the bandwidth offered by the original CAPI protocol offered in {{\\|POWER8}}.
 +
* New CAPI - A new interface that runs on top of the POWER9 25G link (300 GiB/s) interface, designed for CPU-Accelerators applications
 +
* [[NVLink]] 2.0 - High bandwidth and integration between the [[GPU]] and CPU.
 +
* On-Chip Acceleration - An array of accelerators offered by the POWER9 architecture itself
 +
** 1x [[GZip]]
 +
** 2x [[842 Compression]]
 +
** 2x [[AES]]/[[SHA]]
  
 
=== Pipeline ===
 
=== Pipeline ===
{{empty section}}
+
POWER9 modular design allowed IBM to reduce fetch-to-compute latency by 5 cycles. Similar number of cycles were also cut from fixed-point operations from [[fetch]] to [[retire]]. Additional 8 cycles were cut from fetch-to-retire for floating point instructions. POWER9 furthered increased fusion and reduced the number of instructions cracked (POWER handles complex instructions by 'cracking' them into two or three simple µOPs). Instruction grouping at dispatch that was done in {{\\|POWER8}} has also been entirely removed from POWER9.
 +
 
 +
{| style="overflow-x: scroll; white-space: nowrap; font-size: 1.2em; border-spacing: 10px; border-collapse: separate; "
 +
| colspan="9" | || B0 || B1 || RES
 +
|-
 +
| IF || IC  || D1 || D2 || Crack/Fuse || PD0 || PD1 || XFER || MAP || VS0 || VS1 || F2 || F3 || F4 || F5
 +
|-
 +
| colspan="9" | || LS0 || LS1 || AGEN || BRD || CA || FMT || CA
 +
|}
 +
 
 +
==== SMT4 core ====
 +
[[File:p9smt4core.png|700px]]
 +
 
 +
 
 +
{| class="wikitable"
 +
! Fetch/Branch || Slices issue VSU & AGEN || VSU Pipe || LSU Slices
 +
|-
 +
|
 +
* 32 KiB L1I$
 +
* 8 fetch, 6 decode
 +
* 1x branch execution
 +
||
 +
* 4x scalar-64b / 2x vector-128b
 +
* 4x load/store AGEN
 +
||
 +
* 4x [[ALU]]
 +
* 4x [[FP]] + FX-MUL + Complex (64b)
 +
* 2x Permute (128b)
 +
* 2x Quad Fixed (128b)
 +
* 2x Fixed Divide (64b)
 +
* 1x Quad FP & Decimal FP
 +
* 1x Cryptography
 +
||
 +
* 32 KiB L1D$
 +
* Up to 4 DW Load or Store
 +
|}
 +
 
 +
== Performance Claims ==
 +
IBM claims a range of performance improvements for a wide array of workloads. The graph below (provided by IBM) compares POWER9 performance using POWER8 as a baseline. The graph represents a scale-out model of similar specs at a constant frequency.
 +
 
 +
[[File:p9performance.png|700px]]
 +
 
 +
== Die ==
 +
=== Scale out ===
 +
* GlobalFoundries [[14 nm process|14 nm FinFET on SOI Process]]
 +
* 17-layer metal stack
 +
* 8,000,000,000 transistors
 +
** 15 miles of wire
 +
* 693.37 mm² die size
 +
* 25.228 mm x 27.48416 mm
 +
 
 +
[[File:power9 so die.png|class=wikichip_ogimage|600px]]
 +
 
 +
 
 +
[[File:power9 so die (annotated).png|600px]]
  
== Die Shot ==
+
=== Scale up ===
=== [[Tetracosa-Core]] ===
+
* GlobalFoundries [[14 nm process|14 nm FinFET on SOI Process]]
* GlobalFoundries [[14 nm process|14 nm FinFET Process]]
 
 
* 17-layer metal stack
 
* 17-layer metal stack
 
* 8,000,000,000 transistors
 
* 8,000,000,000 transistors
 +
** 15 miles of wire
 +
* 693.37 mm² die size
 +
* 25.228 mm x 27.48416 mm
 +
 +
[[File:power9 su die.png|600px]]
  
[[File:power9 die shot.jpg|800px]]
 
  
[[File:power9 die shot (annotated).png|800px]]
+
[[File:power9 su die (annotated).png|600px]]
 +
 
 +
== All POWER9 Processors ==
 +
<!-- NOTE:
 +
          This table is generated automatically from the data in the actual articles.
 +
          If a microprocessor is missing from the list, an appropriate article for it needs to be
 +
          created and tagged accordingly.
 +
 
 +
          Missing a chip? please dump its name here: https://en.wikichip.org/wiki/WikiChip:wanted_chips
 +
-->
 +
{{comp table start}}
 +
<table class="comptable sortable tc4 tc5">
 +
{{comp table header|main|9:List of POWER9-based Processors}}
 +
{{comp table header 1|cols=Launched, Codename, Cores, Threads, %L2$, %L3$, %TDP, %Frequency, Turbo}}
 +
{{#ask: [[Category:microprocessor models by ibm]] [[instance of::microprocessor]] [[microarchitecture::POWER9]]
 +
|?full page name
 +
|?model number
 +
|?first launched
 +
|?core name
 +
|?core count
 +
|?thread count
 +
|?l2$ size
 +
|?l3$ size
 +
|?tdp
 +
|?base frequency#GHz
 +
|?turbo frequency#GHz
 +
|format=template
 +
|template=proc table 3
 +
|searchlabel=
 +
|sort=core count
 +
|order=desc
 +
|userparam=11
 +
|mainlabel=-
 +
|limit=100
 +
|valuesep=,
 +
}}
 +
{{comp table count|ask=[[Category:microprocessor models by ibm]] [[instance of::microprocessor]] [[microarchitecture::POWER9]]}}
 +
</table>
 +
{{comp table end}}
 +
 
 +
== Bibliography ==
 +
* {{bib|hc|28|IBM}}
 +
* {{bib|hc|30|IBM}}
  
 
== See also ==
 
== See also ==
* [[Intel]]'s {{intel|Skylake|l=arch}} & {{intel|Kaby Lake|l=arch}}
+
* [[Intel]]'s {{intel|Skylake|l=arch}} & {{intel|Cascade Lake|l=arch}}
 
* [[AMD]]'s {{amd|Zen|l=arch}}
 
* [[AMD]]'s {{amd|Zen|l=arch}}
 
* [[Qualcomm]]'s {{qualcomm|Falkor|l=arch}}
 
* [[Qualcomm]]'s {{qualcomm|Falkor|l=arch}}

Latest revision as of 22:38, 22 May 2020

Edit Values
POWER9 µarch
General Info
Arch TypeCPU
DesignerIBM
ManufacturerGlobalFoundries
IntroductionAugust, 2017
Phase-out2020
Process14 nm
Core Configs4, 8, 12, 16, 20, 24
Pipeline
TypeSuperscalar
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages12-16
Instructions
ISAPower ISA v3.0B
Cache
L1I Cache32 KiB/core
8-way set associative
L1D Cache32 KiB/core
8-way set associative
L2 Cache512 KiB/core duplex
8-way set associative
L3 Cache10 MiB/core duplex
20-way set associative
Cores
Core NamesSforza,
Monza,
LaGrange
Succession

POWER9 is IBM's successor to POWER8, a 14 nm microarchitecture for Power-based server microprocessors first introduced in the 2nd half of 2017. POWER9-based processors are branded under the POWER family.

Code names[edit]

IBM introduced three flavors of POWER9.

SoC Codename SoC Description Module Memory Channels PCIe XBUS OpenCAPI
Nimbus Scale Out Sforza 4 48 1
Monza 8 34 1 48
LaGrange 8 42 2 16
Cumulus Scale Up  ? Centaur  ?  ?  ?
Axone Advanced I/O  ? OMI 48 3 48

Process Technology[edit]

POWER9-based microprocessors are fabricated on GlobalFoundries's High-Performance 14 nm (14HP) FinFET Silicon-On-Insulator (SOI) process. The process was designed by IBM at what used to be their East Fishkill, New York fab which has since been sold to GlobalFoundries.

Introduction[edit]

IBM introduced the POWER9 scale out variant of POWER in December 2017. Scale up POWER9 processors were introduced in August 2018. The third variant for high I/O will be introduced in 2019.

Compatibility[edit]

Initial support for POWER9 started with Linux Kernel 4.8.

Vendor OS Version Notes
IBM AIX 7.? Support
IBM i  ? Support
Linux Linux Kernel 4.8 Initial Support
Wind River VxWorks VxWorks 7.? Support

Compiler support[edit]

Compiler CPU Arch-Favorable
GCC -mcpu=power9 -mtune=power9
LLVM -mcpu=power9 -mtune=power9
XL C/C++ -mcpu=pwr9 -mtune=pwr9

Architecture[edit]

Key changes from POWER8/+[edit]

  • 14 nm process (from 22 nm)
    • 17-layer metal stack
    • 8,000,000,000 transistors
  • Support for Power ISA v3.0
  • Higher single-thread performance
  • New highly modular architecture
  • Pipeline
    • Shorter pipeline
      • 5 stages eliminated from fetch to compute vs POWER8
      • Roughly 5 stages were also eliminated for fixed-point operations
      • Up to 8 cycles were eliminated for floating-point operations
    • Instruction grouping at dispatch has been removed
    • Improved hazard avoidance / reduced hazard disruption
  • Improved branch prediction
  • Cache
    • 120 MiB NUCA L3
      • eDRAM
      • 7 TB/s on-chip bandwidth
  • Hardware Acceleration
  • I/O Subsystem
    • PCIe Gen4
    • Local SMP - 16 GT/s per lane interface
    • Remote SMP - 25 GT/s per lane interface
      • 48 PCIe lanes
      • IBM's SMP connect for their scale-up systems
      • Also available for the accelerators
  • Virtualization
    • QoS assistance
    • New Interrupt architecture
    • Workload-optimized frequency
    • Hardware enforced trusted execution

Block Diagram[edit]

New text document.svg This section is empty; you can help add the missing info by editing this page.

Memory Hierarchy[edit]

  • Cache
    • L1I Cache
      • 32 KiB, 8-way set associative
      • 128-byte lines (broken into four 32-byte sectors)
      • Per SMT4 Core
      • Critical-sector-first reload policy
    • L1D Cache
      • 32 KiB, 8-way set associative
      • 128-byte cache line with support for 64-byte sectors
      • Per SMT4 Core
      • Pseudo-LRU replacement policy
    • L2 Cache
      • 512 KiB 8-way set associative
      • 128-byte line
      • Per core pair
      • Inclusive of L1I/L1D
    • L3 Cache
      • 120 MiB eDRAM
        • 10 MiB/core pair
      • 12 chunks (regions) of 10 MiB 20-way set associative
      • 7 TB/s on-chip bandwidth

Overview[edit]

POWER9 succeeds POWER8, introducing many core enhancements as well as large architectural changes. POWER9 has taken a highly modular design approach, with the same design supporting up to 12 cores with 96 threads (SMT8) or up to 24 cores with 96 threads (SMT4). IBM offers POWER9 as both scale up and scale out solutions. In total, there are four targeted chip implementations (24C/SO, 24C/SU, 12C/SO, and 12C/SU).

POWER9 comes in two flavors - scale out (SO) and scale up (SU). The scale out variations are designed for traditional datacenter clusters utilizing single-socket and dual-socket setups. The Scale-Up variations are designed for NUMA servers with four or more sockets, supporting large amounts of memory capacity and throughput.

Scale out[edit]

Scale-out overview

For the scale out there are two variations, a 12-core SMT8 model and a 24-core SMT4 model. The SMT4 is optimized for the Linux ecosystem whereas the SMT8 model is said to be optimized for the PowerVM ecosystem (AIX / IBM i customers). Those models support up to 8 channels of DDR4 memory for up to 4 TiB of DDR4-2667 memory (per socket). Those models offer up to 120 GiB/s of sustained bandwidth.

Scale out processors have 48 PowerAXON lines (x48) and come with two SMP links.

Scale up[edit]

Scale-up overview

The POWER9 scale up is designed for their enterprise servers and come with two variations, a 12-core SMT8 model and a 24-core SMT4 model. The SMT4 is optimized for Linux Ecosystem whereas the SMT8 is said to be optimized for the PowerVM Ecosystem community (AIX / IBM i customers). POWER9 inherits the same buffered memory architecture first introduced with POWER8. POWER9 has two memory controllers capable of driving four differential memory interface (DMI) channels, each with a maximum signaling rate of 9.6 GT/s for a sustained bandwidth of up to 28.8 GB/s. Each of the DMI channels connects to one dedicated Centaur memory buffer chip which, in turn, provides four DDR4 memory channels running at up to 3200 MT/s as well as 16 MiB of L4 cache. All in all, POWER9 scale-up can use eight buffered memory channels to access up to 32 channels of DDR memory and provides an additional 128 MiB of level 4 cache.

power9 memory buff.svg

Scale up processors have a different set of I/O interfaces. The two memory controllers drive eight memory-agnostic interfaces, come with four times as many PowerAXON lines (x96), and 3 SMP links.

Slice Design[edit]

Execution Slice Microarchitecture is POWER9's entirely new refactored core modular design. The same modules were used to build both the SMT4 and SMT8 cores (and in theory scale further to higher thread count although that's not offered this iteration). These modules allow IBM to address the various processor models with support for the different configurations such as bandwidth/lines (from 128 to 64 byte sectors).

A Slice is the basic 64-bit computing block incorporating a single Vector and Scalar Unit (VSU) coupled with Load/Store Unit (LSU). VSU has a heterogeneous mix of computing capabilities including integer and floating point supporting scalar and vector operations. IBM claims this setup allows for higher utilization of resources while providing efficient exchanges of data between the individual slices. Two slices coupled together make up the Super-Slice, a 128-bit POWER9 physical design building block. Two super-slices together along with an Instruction Fetch Unit (IFU) and an Instruction Sequencing Unit (ISU) form a single POWER9 SMT4 core. The SMT8 variant is effectively two SMT4 units.

POWER8 P9 SMT8 (4x Super-Slice) P9 SMT4 (2x Super-Slice) Super-Slice Slice
p8smt8comp.png p94xsuper-slice.png p92xsuper-slice.png p9super-slice.png p9slice.png

Acceleration Platform (POWERAccel)[edit]

p9links.png

POWERAccel is the collective name for all the interfaces and acceleration protocols provided by the POWER microarchitecture. POWER9 offers two sets of acceleration attachments: PCIe Gen4 which offers 48 lanes at 192 GiB/s duplex bandwidth and a new 25G link which offers an additional 48 lanes delivering up to 300 GiB/s of duplex bandwidth. On top of the two physical interfaces are a set of open standard protocols that integrated onto those signaling interfaces. The four prominent standards are:

  • CAPI 2.0 - POWER9 introduces CAPI 2.0 over PCIe which quadruples the bandwidth offered by the original CAPI protocol offered in POWER8.
  • New CAPI - A new interface that runs on top of the POWER9 25G link (300 GiB/s) interface, designed for CPU-Accelerators applications
  • NVLink 2.0 - High bandwidth and integration between the GPU and CPU.
  • On-Chip Acceleration - An array of accelerators offered by the POWER9 architecture itself

Pipeline[edit]

POWER9 modular design allowed IBM to reduce fetch-to-compute latency by 5 cycles. Similar number of cycles were also cut from fixed-point operations from fetch to retire. Additional 8 cycles were cut from fetch-to-retire for floating point instructions. POWER9 furthered increased fusion and reduced the number of instructions cracked (POWER handles complex instructions by 'cracking' them into two or three simple µOPs). Instruction grouping at dispatch that was done in POWER8 has also been entirely removed from POWER9.

B0 B1 RES
IF IC D1 D2 Crack/Fuse PD0 PD1 XFER MAP VS0 VS1 F2 F3 F4 F5
LS0 LS1 AGEN BRD CA FMT CA

SMT4 core[edit]

p9smt4core.png


Fetch/Branch Slices issue VSU & AGEN VSU Pipe LSU Slices
  • 32 KiB L1I$
  • 8 fetch, 6 decode
  • 1x branch execution
  • 4x scalar-64b / 2x vector-128b
  • 4x load/store AGEN
  • 4x ALU
  • 4x FP + FX-MUL + Complex (64b)
  • 2x Permute (128b)
  • 2x Quad Fixed (128b)
  • 2x Fixed Divide (64b)
  • 1x Quad FP & Decimal FP
  • 1x Cryptography
  • 32 KiB L1D$
  • Up to 4 DW Load or Store

Performance Claims[edit]

IBM claims a range of performance improvements for a wide array of workloads. The graph below (provided by IBM) compares POWER9 performance using POWER8 as a baseline. The graph represents a scale-out model of similar specs at a constant frequency.

p9performance.png

Die[edit]

Scale out[edit]

  • GlobalFoundries 14 nm FinFET on SOI Process
  • 17-layer metal stack
  • 8,000,000,000 transistors
    • 15 miles of wire
  • 693.37 mm² die size
  • 25.228 mm x 27.48416 mm

power9 so die.png


power9 so die (annotated).png

Scale up[edit]

  • GlobalFoundries 14 nm FinFET on SOI Process
  • 17-layer metal stack
  • 8,000,000,000 transistors
    • 15 miles of wire
  • 693.37 mm² die size
  • 25.228 mm x 27.48416 mm

power9 su die.png


power9 su die (annotated).png

All POWER9 Processors[edit]

 List of POWER9-based Processors
ModelLaunchedCodenameCoresThreadsL2$L3$TDPFrequencyTurbo
02CY296November 2017Sforza22885.5 MiB
5,632 KiB
5,767,168 B
0.00537 GiB
110 MiB
112,640 KiB
115,343,360 B
0.107 GiB
190 W
190,000 mW
0.255 hp
0.19 kW
2.75 GHz
2,750 MHz
2,750,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
02CY227November 2017Sforza22885.5 MiB
5,632 KiB
5,767,168 B
0.00537 GiB
110 MiB
112,640 KiB
115,343,360 B
0.107 GiB
190 W
190,000 mW
0.255 hp
0.19 kW
2.6 GHz
2,600 MHz
2,600,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
02CY414November 2017Sforza22885.5 MiB
5,632 KiB
5,767,168 B
0.00537 GiB
110 MiB
112,640 KiB
115,343,360 B
0.107 GiB
160 W
160,000 mW
0.215 hp
0.16 kW
2.25 GHz
2,250 MHz
2,250,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
02CY228November 2017Sforza20805 MiB
5,120 KiB
5,242,880 B
0.00488 GiB
100 MiB
102,400 KiB
104,857,600 B
0.0977 GiB
190 W
190,000 mW
0.255 hp
0.19 kW
2.7 GHz
2,700 MHz
2,700,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
02CY415November 2017Sforza20805 MiB
5,120 KiB
5,242,880 B
0.00488 GiB
100 MiB
102,400 KiB
104,857,600 B
0.0977 GiB
160 W
160,000 mW
0.215 hp
0.16 kW
2.4 GHz
2,400 MHz
2,400,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
02CY416November 2017Sforza18724.5 MiB
4,608 KiB
4,718,592 B
0.00439 GiB
90 MiB
92,160 KiB
94,371,840 B
0.0879 GiB
130 W
130,000 mW
0.174 hp
0.13 kW
2.25 GHz
2,250 MHz
2,250,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
02CY489November 2017Sforza18724.5 MiB
4,608 KiB
4,718,592 B
0.00439 GiB
90 MiB
92,160 KiB
94,371,840 B
0.0879 GiB
190 W
190,000 mW
0.255 hp
0.19 kW
2.8 GHz
2,800 MHz
2,800,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
02CY230November 2017Sforza16644 MiB
4,096 KiB
4,194,304 B
0.00391 GiB
80 MiB
81,920 KiB
83,886,080 B
0.0781 GiB
190 W
190,000 mW
0.255 hp
0.19 kW
2.9 GHz
2,900 MHz
2,900,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
02AA986November 2017Sforza16644 MiB
4,096 KiB
4,194,304 B
0.00391 GiB
80 MiB
81,920 KiB
83,886,080 B
0.0781 GiB
190 W
190,000 mW
0.255 hp
0.19 kW
2.9 GHz
2,900 MHz
2,900,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
02CY417November 2017Sforza16644 MiB
4,096 KiB
4,194,304 B
0.00391 GiB
80 MiB
81,920 KiB
83,886,080 B
0.0781 GiB
130 W
130,000 mW
0.174 hp
0.13 kW
2.3 GHz
2,300 MHz
2,300,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
02CY771November 2017Sforza12483 MiB
3,072 KiB
3,145,728 B
0.00293 GiB
60 MiB
61,440 KiB
62,914,560 B
0.0586 GiB
105 W
105,000 mW
0.141 hp
0.105 kW
2.2 GHz
2,200 MHz
2,200,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
02CY089November 2017Sforza8324 MiB
4,096 KiB
4,194,304 B
0.00391 GiB
80 MiB
81,920 KiB
83,886,080 B
0.0781 GiB
160 W
160,000 mW
0.215 hp
0.16 kW
3.5 GHz
3,500 MHz
3,500,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
02CY297November 2017Sforza4162 MiB
2,048 KiB
2,097,152 B
0.00195 GiB
40 MiB
40,960 KiB
41,943,040 B
0.0391 GiB
90 W
90,000 mW
0.121 hp
0.09 kW
3.2 GHz
3,200 MHz
3,200,000 kHz
3.8 GHz
3,800 MHz
3,800,000 kHz
Count: 13

Bibliography[edit]

  • IBM, IEEE Hot Chips 28 Symposium (HCS) 2016.
  • IBM, IEEE Hot Chips 30 Symposium (HCS) 2018.

See also[edit]

codenamePOWER9 +
core count4 +, 8 +, 12 +, 16 +, 20 + and 24 +
designerIBM +
first launchedAugust 2017 +
full page nameibm/microarchitectures/power9 +
instance ofmicroarchitecture +
instruction set architecturePower ISA v3.0B +
manufacturerGlobalFoundries +
microarchitecture typeCPU +
namePOWER9 +
phase-out2020 +
pipeline stages (max)16 +
pipeline stages (min)12 +
process14 nm (0.014 μm, 1.4e-5 mm) +