From WikiChip
Editing amd/microarchitectures/zen

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 10: Line 10:
 
|cores 2=6
 
|cores 2=6
 
|cores 3=8
 
|cores 3=8
|cores 4=12
+
|cores 4=32
|cores 5=16
 
|cores 6=24
 
|cores 7=32
 
 
|type=Superscalar
 
|type=Superscalar
 
|oooe=Yes
 
|oooe=Yes
Line 20: Line 17:
 
|stages=19
 
|stages=19
 
|decode=4-way
 
|decode=4-way
|isa=x86-64
+
|isa=x86-16
 +
|isa 2=x86-32
 +
|isa 3=x86-64
 
|extension=MOVBE
 
|extension=MOVBE
 
|extension 2=MMX
 
|extension 2=MMX
Line 57: Line 56:
 
|l3 per=core
 
|l3 per=core
 
|l3 desc=16-way set associative
 
|l3 desc=16-way set associative
|core name=Naples
+
|core name=Raven Ridge
|core name 2=Whitehaven
+
|core name 2=Summit Ridge
|core name 3=Summit Ridge
+
|core name 3=Snowy Owl
|core name 4=Raven Ridge
+
|core name 4=Naples
|core name 5=Snowy Owl
 
|core name 6=Great Horned Owl
 
|core name 7=Banded Kestrel
 
 
|predecessor=Excavator
 
|predecessor=Excavator
 
|predecessor link=amd/microarchitectures/excavator
 
|predecessor link=amd/microarchitectures/excavator
 
|predecessor 2=Puma
 
|predecessor 2=Puma
 
|predecessor 2 link=amd/microarchitectures/puma
 
|predecessor 2 link=amd/microarchitectures/puma
|successor=Zen+
+
|successor=Zen 2
|successor link=amd/microarchitectures/zen+
+
|successor link=amd/microarchitectures/zen 2
 
|pipeline=Yes
 
|pipeline=Yes
 
|issues=4
 
|issues=4
 +
|inst=Yes
 +
|cache=Yes
 
|core names=Yes
 
|core names=Yes
 +
|succession=Yes
 
}}
 
}}
'''Zen''' ('''family 17h''') is the [[microarchitecture]] developed by [[AMD]] as a successor to both {{\\|Excavator}} and {{\\|Puma}}. Zen is an entirely new design, built from the ground up for optimal balance of performance and power capable of covering the entire computing spectrum from fanless notebooks to high-performance desktop computers. Zen was officially launched on March 2, [[2017]]. Zen was replaced by {{\\|Zen+}} in [[2018]].
+
'''Zen''' ('''family 17h''') is the [[microarchitecture]] developed by [[AMD]] as a successor to both {{\\|Excavator}} and {{\\|Puma}}. Zen is an entirely new design, built from the ground up for optimal balance of performance and power capable of covering the entire computing spectrum from fanless notebooks to high-performance desktop computers. Zen was officially launched on March 2, [[2017]]. Zen is set to be eventually replaced by {{\\|Zen 2}}.
  
For performance desktop and mobile computing, Zen is branded as {{amd|Athlon}}, {{amd|Ryzen 3}}, {{amd|Ryzen 5}}, {{amd|Ryzen 7}}, {{amd|Ryzen 9}} and {{amd|Ryzen Threadripper}} processors. For servers, Zen is branded as {{amd|EPYC}}.
+
For performance desktop and mobile computing, Zen is branded as {{amd|Ryzen 3}}, {{amd|Ryzen 5}}, and {{amd|Ryzen 7}} processors. For servers, Zen is branded as {{amd|EPYC}}.
  
 
== Etymology ==
 
== Etymology ==
Line 83: Line 82:
  
 
== Codenames ==
 
== Codenames ==
 +
{{future information}}
 +
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
Line 89: Line 90:
 
| {{amd|Naples|l=core}} || Up to 32/64 || High-end server [[multiprocessors]]
 
| {{amd|Naples|l=core}} || Up to 32/64 || High-end server [[multiprocessors]]
 
|-
 
|-
| {{amd|Whitehaven|l=core}} || Up to 16/32 || Enthusiasts market processors
+
| {{amd|Snowy Owl|l=core}} || 16/32 || Mid-range server processors
|-
 
| {{amd|Summit Ridge|l=core}} || Up to 8/16 || Mainstream to high-end desktops
 
|-
 
| {{amd|Raven Ridge|l=core}} || Up to 4/8 || Mobile processors with {{\\|Vega}} GPU
 
 
|-
 
|-
| Dali || Up to 2/4 || Budget mobile processors with {{\\|Vega}} GPU
+
| {{amd|Summit Ridge|l=core}} || Up to 8/16 || Mainstream to high-end desktops & enthusiasts market processors
 
|-
 
|-
| {{amd|Snowy Owl|l=core}} || Up to 16/32 || Embedded edge processors
+
| {{amd|Raven Ridge|l=core}} || 4/8 || Mainstream desktop & mobile processors with GPU
|-
 
| {{amd|Great Horned Owl|l=core}} || Up to 4/8 || Embedded processors with {{\\|Vega}} GPU
 
|-
 
| {{amd|Banded Kestrel|l=core}} || Up to 2/4 || Low-power/Cost-sensitive embedded processors with {{\\|Vega}} GPU  
 
 
|}
 
|}
  
Line 115: Line 108:
 
! Cores !! Unlocked !! {{x86|AVX2}} !! [[SMT]]  !! {{amd|XFR}} !! [[IGP]] !! [[ECC]] !! [[Multiprocessing|MP]]
 
! Cores !! Unlocked !! {{x86|AVX2}} !! [[SMT]]  !! {{amd|XFR}} !! [[IGP]] !! [[ECC]] !! [[Multiprocessing|MP]]
 
|-
 
|-
! colspan="11" | Mainstream
+
| [[File:amd ryzen 3 logo.png|75px|link=Ryzen 3]] || {{amd|Ryzen 3}} || Low-end Performance || [[quad-core|Quad]] || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" | ✘ || style="background-color: #f9dcb3;" | ✔/✘ || style="background-color: #ffdad6;" | ✘ || style="background-color: #d6ffd8;" | ✔ || style="background-color: #ffdad6;" |
 
|-
 
|-
| [[File:amd ryzen 3 logo.png|75px|link=Ryzen 3]] || {{amd|Ryzen 3}} || Entry level Performance || [[quad-core|Quad]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|some}} || {{tchk|yes}} || {{tchk|no}}
+
| rowspan="2" | [[File:amd ryzen 5 logo.png|75px|link=Ryzen 5]] || rowspan="2" | {{amd|Ryzen 5}} || rowspan="2" | Mid-range Performance || [[quad-core|Quad]] || style="background-color: #d6ffd8;" | || style="background-color: #d6ffd8;" | || style="background-color: #d6ffd8;" | || style="background-color: #f9dcb3;" | ✔/✘ || style="background-color: #ffdad6;" | || style="background-color: #d6ffd8;" | || style="background-color: #ffdad6;" |
 
|-
 
|-
| rowspan="2" | [[File:amd ryzen 5 logo.png|75px|link=Ryzen 5]] || rowspan="2" | {{amd|Ryzen 5}} || rowspan="2" | Mid-range Performance || [[quad-core|Quad]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|some}} || {{tchk|yes}} || {{tchk|no}}
+
| [[hexa-core|Hexa]] || style="background-color: #d6ffd8;" | || style="background-color: #d6ffd8;" | || style="background-color: #d6ffd8;" | || style="background-color: #f9dcb3;" | ✔/✘ || style="background-color: #ffdad6;" | || style="background-color: #d6ffd8;" | || style="background-color: #ffdad6;" |
 
|-
 
|-
| [[hexa-core|Hexa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}}
+
| [[File:amd ryzen 7 logo.png|75px|link=Ryzen 7]] || {{amd|Ryzen 7}} || High-end Performance / Enthusiasts || [[octa-core|Octa]] || style="background-color: #d6ffd8;" | ✔ || style="background-color: #d6ffd8;" | || style="background-color: #d6ffd8;" | || style="background-color: #f9dcb3;" | ✔/✘ || style="background-color: #ffdad6;" | || style="background-color: #d6ffd8;" | || style="background-color: #ffdad6;" |
 
|-
 
|-
| [[File:amd ryzen 7 logo.png|75px|link=Ryzen 7]] || {{amd|Ryzen 7}} || High-end Performance || [[octa-core|Octa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}}
+
| [[File:amd epyc logo.png|75px|link=EPYC]] || {{amd|EPYC}} || High-performance Server Processor || [[24 cores|24]]-[[32 cores|32]] || style="background-color: #d6ffd8;" | || style="background-color: #d6ffd8;" | || style="background-color: #d6ffd8;" | ||   || style="background-color: #ffdad6;" | || style="background-color: #d6ffd8;" | || style="background-color: #d6ffd8;" |
|-
 
| [[File:amd ryzen 9 logo.png|75px|link=Ryzen 7]] || {{amd|Ryzen 9}} || High-end Performance || [[12 cores|12]]-[[16 cores|16]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}}
 
|-
 
! colspan="11" | Enthusiasts / Workstations
 
|-
 
| [[File:ryzen threadripper logo.png|75px|link=Ryzen Threadripper]] || {{amd|Ryzen Threadripper}} || Enthusiasts || [[8 cores|8]]-[[16 cores|16]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}}
 
|-
 
! colspan="11" | Servers
 
|-
 
| [[File:amd epyc logo.png|75px|link=amd/epyc]] || {{amd|EPYC}} || High-performance Server Processor || [[8 cores|8]]-[[32 cores|32]] || {{tchk|no}} || {{tchk|yes}} || {{tchk|yes}} ||   || {{tchk|no}} || {{tchk|yes}} || {{tchk|yes}}
 
|-
 
! colspan="11" | Embedded / Edge
 
|-
 
| [[File:epyc embedded logo.png|75px|link=amd/epyc embedded]] || {{amd|EPYC Embedded}} || Embedded / Edge Server Processor || [[8 cores|8]]-[[16 cores|16]] || {{tchk|no}} || {{tchk|yes}} || {{tchk|some}} ||   || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}}
 
|-
 
| [[File:ryzen embedded logo.png|75px|link=amd/ryzen embedded]] || {{amd|Ryzen Embedded}} || Embedded APUs || [[4 cores|4]] ||   || {{tchk|yes}} || {{tchk|some}} ||   || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}}
 
 
|}
 
|}
  
 
* '''Note:''' While a model has an unlocked multiplier, not all chipsets support overclocking. (see [[#Sockets/Platform|§Sockets]])
 
* '''Note:''' While a model has an unlocked multiplier, not all chipsets support overclocking. (see [[#Sockets/Platform|§Sockets]])
* '''Note:''' 'X' models will enjoy "Full XFR" providing an additional +100 MHz (200 for {{amd|1500X}} and {{amd|Threadripper}} line) when sufficient thermo/electric requirements are met. Non-X models are limited to just +50 MHz.
+
* '''Note:''' 'X' models will enjoy "Full XFR" providing an additional +100 MHz (200 for {{amd|1500X}}) when sufficient thermo/electric requirements are met. Non-X models are limited to just +50 MHz.
  
 
=== Identification ===
 
=== Identification ===
Line 156: Line 133:
 
| ex 7      = X
 
| ex 7      = X
 
| ex 2 1    = Ryzen
 
| ex 2 1    = Ryzen
| ex 2 2    = 5
+
| ex 2 2    = 3
 
| ex 2 3    =   
 
| ex 2 3    =   
| ex 2 4    = 3
+
| ex 2 4    = 1
| ex 2 5    = 5
+
| ex 2 5    = 2
| ex 2 6    = 50
+
| ex 2 6    = 00
| ex 2 7    = H
+
| ex 2 7    = M
| ex 3 1    = Ryzen
 
| ex 3 2    = 3
 
| ex 3 3    =   
 
| ex 3 4    = 2
 
| ex 3 5    = 2
 
| ex 3 6    = 00
 
| ex 3 7    = U
 
 
| desc 1    = '''Brand Name'''<br><table><tr><td style="width: 50px;">'''{{amd|Ryzen}}'''</td><td></td></tr></table>
 
| desc 1    = '''Brand Name'''<br><table><tr><td style="width: 50px;">'''{{amd|Ryzen}}'''</td><td></td></tr></table>
| desc 2    = '''Market segment'''<br><table><tr><td style="width: 50px;">'''3'''</td><td>Low-end performance</td></tr><tr><td>'''5'''</td><td>Mid-range performance</td></tr><tr><td>'''7'''</td><td>Enthusiast / High-end performance</td></tr><tr><td>'''9'''</td><td>High-end performance / Workstation</td></tr><tr><td>'''Threadripper'''</td><td>High-end performance / Workstation</td></tr></table>
+
| desc 2    = '''Market segment'''<br><table><tr><td style="width: 50px;">'''3'''</td><td>Low-end performance</td></tr><tr><td>'''5'''</td><td>Mid-range performance</td></tr><tr><td>'''7'''</td><td>Enthusiast / High-end performance</td></tr></table>
 
| desc 3    =  
 
| desc 3    =  
| desc 4    = '''Generation'''<br><table><tr><td style="width: 50px;">'''1'''</td><td>First generation Zen (2017)</td></tr><tr><td style="width: 50px;">'''2'''</td><td>First generation Zen for Mobile and Desktop APUs (2017); First generation Zen with enhanced node (Zen+)(2018)</td></tr><tr><td style="width: 50px;">'''3'''</td><td>First generation Zen with enhanced node (Zen+) for Mobile and Desktop APUs (2019); Second generation Zen (Zen 2)(2019)</td></tr><tr><td style="width: 50px;">'''4'''</td><td>Second generation Zen (Zen 2) for Mobile and Desktop APUs (2020)</td></tr><tr><td style="width: 50px;">'''5'''</td><td>Third generation Zen (Zen 3)(2020)</td></tr></table>
+
| desc 4    = '''Generation'''<br><table><tr><td style="width: 50px;">'''1'''</td><td>First generation Zen (2017)</td></tr></table>
| desc 5    = '''Performance Level'''<br><table><tr><td style="width: 50px;">'''9'''</td><td>Extreme (Ryzen Threadripper & Ryzen 9)</td></tr><tr><td>'''8'''</td><td>Highest (Ryzen 7)</td></tr><tr><td>'''6-7'''</td><td>High (Ryzen 5 & 7)</td></tr><tr><td>'''4-5'''</td><td>Mid (Ryzen 5)</td></tr><tr><td>'''1-3'''</td><td>Low (Ryzen 3)</td></tr></table>
+
| desc 5    = '''Performance Level'''<br><table><tr><td style="width: 50px;">'''8'''</td><td>Highest</td></tr><tr><td>'''6-7'''</td><td>High</td></tr><tr><td>'''4-5'''</td><td>Mid</td></tr><tr><td>'''1-3'''</td><td>Low</td></tr></table>
| desc 6    = '''Model Number'''<br>Speed bump and/or differentiator for high core count chips (8 cores+).
+
| desc 6    = '''Model Number'''<br>Reserved for future speed bump/differentiator. Currently all models are "00".
| desc 7    = '''Power Segment'''<br><table><tr><td style="width: 50px;">'''(none)'''</td><td>Standard Desktop</td></tr><tr><td>'''U'''</td><td>Standard Mobile</td></tr><tr><td>'''X'''</td><td>High Performance, with XFR</td></tr><tr><td style="width: 50px;">'''WX'''</td><td>High Core Count Workstation</td></tr><tr><td>'''G'''</td><td>Desktop + [[IGP]]</td></tr><tr><td>'''E'''</td><td>Low-power Desktop</td></tr><tr><td>'''GE'''</td><td>Low-power Desktop + [[IGP]]</td></tr><tr><td>'''M'''</td><td>Low-power Mobile</td></tr><tr><td>'''H'''</td><td>High-performance Mobile</td></tr><tr><td>'''S'''</td><td>Slim Mobile</td></tr><tr><td>'''HS'''</td><td>High-Performance Slim Mobile</td></tr><tr><td>'''XT'''</td><td>Extreme</td></tr></table>
+
| desc 7    = '''Power Segment'''<br><table><tr><td style="width: 50px;">'''(none)'''</td><td>Standard Desktop</td></tr><tr><td>'''U'''</td><td>Standard Mobile</td></tr><tr><td>'''X'''</td><td>High Performance, with XFR</td></tr><tr><td>'''G'''</td><td>Desktop + [[IGP]]</td></tr><tr><td>'''T'''</td><td>Low-power Desktop</td></tr><tr><td>'''S'''</td><td>Low-power Desktop + [[IGP]]</td></tr><tr><td>'''M'''</td><td>Low-power Mobile</td></tr><tr><td>'''H'''</td><td>High-performance Mobile</td></tr></table>
 
}}
 
}}
  
 
== Release Dates ==  
 
== Release Dates ==  
[[File:ryzen threadripper.png|right|thumb|First 16-core HEDT market CPU]]
+
[[File:ryzen threadripper.png|right|thumb|First 16-code HEDT market CPU]]
The first set of processors, as part of the {{amd|Ryzen 7}} family were introduced at an AMD event on February 22, 2017 before the Game Developer Conference (GDC). However initial models don't get shipped until March 2. {{amd|Ryzen 5}} [[hexa-core]] and [[quad-core]] variants were released on April 11, 2017. Server processors are set to be released in by the end of Q2, 2017. In October 2017, AMD launched mobile Zen-based processors featuring {{\\|Vega}} GPUs.
+
The first set of processors, as part of the {{amd|Ryzen 7}} family were introduced at an AMD event on February 22, 2017 before the Game Developer Conference (GDC). However initial models don't get shipped until March 2. {{amd|Ryzen 5}} [[hexa-core]] and [[quad-core]] variants were released on April 11, 2017. Server processors are set to be released in by the end of Q2, 2017. Mobile processors are expected to be released by the end of 2017.
 
 
 
 
::[[File:amd zen ryzen rollout.png|600px]]
 
 
 
{{clear}}
 
  
 
== Process Technology ==
 
== Process Technology ==
 
{{see also|14 nm process}}
 
{{see also|14 nm process}}
Zen is manufactured on [[Global Foundries]]' [[14 nm process]] Low Power Plus (14LPP). AMD's previous microarchitectures were based on [[32 nm|32]] and [[28 nm|28]] nanometer processes. The jump to 14 nm was part of AMD's attempt to remain competitive against Intel (Both {{intel|Skylake}} and {{intel|Kaby Lake}} are also manufactured on 14 nm). The move to 14 nm will bring along related benefits of a smaller node such as reduced heat, reduced power consumption, and higher density for identical designs.
+
Zen is planned to be manufactured on [[Global Foundries]]' [[14 nm process]], same one used by [[IBM]] for their {{ibm|POWER9|l=arch}}. AMD's previous microarchitectures were based on [[32 nm|32]] and [[28 nm|28]] nanometer processes. The jump to 14 nm is part of AMD's attempt to remain competitive against Intel (Both {{intel|SkyLake}} and {{intel|Kaby Lake}} are also manufactured on 14 nm although by late 2017 Intel plans on moving on to {{intel|Cannonlake}} and [[10 nm process]]). The move to 14 nm will bring along related benefits of a smaller node such as reduced heat, reduced power consumption, and higher density for identical designs.
  
 
== Compatibility ==
 
== Compatibility ==
[[Linux]] added initial support for Zen starting with Linux Kernel 4.10. [[Microsoft]] will only support Windows 10 for Zen.
+
[[Linux]] added initial support for Zen starting with Linux Kernel 4.10. Microsoft will only support Windows 10 for Zen.
  
 
{| class="wikitable"
 
{| class="wikitable"
 
! Vendor !! OS  !! Version !! Notes
 
! Vendor !! OS  !! Version !! Notes
 
|-
 
|-
| rowspan="3" | [[Microsoft]] || rowspan="3" | Windows || style="background-color: #ffdad6;" | Windows 7 || No Support
+
| rowspan="3" | Microsoft || rowspan="3" | Windows || style="background-color: #ffdad6;" | Windows 7 || No Support
 
|-
 
|-
 
| style="background-color: #ffdad6;" | Windows 8 || No Support
 
| style="background-color: #ffdad6;" | Windows 8 || No Support
Line 203: Line 168:
 
| style="background-color: #d6ffd8;" | Windows 10 || Support
 
| style="background-color: #d6ffd8;" | Windows 10 || Support
 
|-  
 
|-  
| rowspan="2" | Linux || rowspan="2" | Linux || style="background-color: #d6ffd8;" | Kernel 4.10 || Initial Support
+
| Linux || Linux || style="background-color: #d6ffd8;" | Kernel 4.10 || Initial Support
|-
 
| style="background-color: #d6ffd8;" | Kernel 4.15 || Full Support
 
 
|}
 
|}
  
Line 214: Line 177:
 
! Compiler !! Arch-Specific || Arch-Favorable
 
! Compiler !! Arch-Specific || Arch-Favorable
 
|-
 
|-
| [[AOCC]] || <code>-march=znver1</code> || <code>-mtune=znver1</code>
+
| [[AOCC]] || <code>‐march=znver1</code> || <code>-mtune=znver1</code>
 
|-
 
|-
 
| [[GCC]] || <code>-march=znver1</code> || <code>-mtune=znver1</code>
 
| [[GCC]] || <code>-march=znver1</code> || <code>-mtune=znver1</code>
Line 221: Line 184:
 
|-
 
|-
 
| [[Visual Studio]] || <code>/arch:AVX2</code> || ?
 
| [[Visual Studio]] || <code>/arch:AVX2</code> || ?
|}
 
 
=== CPUID ===
 
{{see also|amd/cpuid|l1=AMD CPUIDs}}
 
{| class="wikitable tc1 tc2 tc3 tc4"
 
! Core !! Extended<br>Family !! Family !! Extended<br>Model !! Model
 
|-
 
| rowspan="2" | {{amd|Naples|l=core}}, {{amd|Whitehaven|l=core}}, {{amd|Summit Ridge|l=core}}
 
|| 0x8 || 0xF || 0x0 || 0x1
 
|-
 
| colspan="4" | Family 23 Model 1
 
|-
 
| rowspan="2" | {{amd|Raven Ridge|l=core}} || 0x8 || 0xF || 0x1 || 0x1
 
|-
 
| colspan="4" | Family 23 Model 17
 
 
|}
 
|}
  
Line 242: Line 190:
  
 
=== Key changes from {{\\|Excavator}} ===
 
=== Key changes from {{\\|Excavator}} ===
* Zen was designed to succeed ''both'' {{\\|Excavator}} (High-performance) and {{\\|Puma}} (Low-power) covering the entire range in one architecture
+
* Zen was designed to succeed BOTH {{\\|Excavator}} (High-performance) and {{\\|Puma}} (Low-power) covering the entire range in one architecture
 
** Cover the entire spectrum from fanless notebooks to high-performance desktops
 
** Cover the entire spectrum from fanless notebooks to high-performance desktops
 
** More aggressive clock gating with multi-level regions
 
** More aggressive clock gating with multi-level regions
Line 248: Line 196:
 
*** >15% switching capacitance (C<sub>AC</sub>) improvement
 
*** >15% switching capacitance (C<sub>AC</sub>) improvement
 
* Utilizes [[14 nm process]] (from [[28 nm]])
 
* Utilizes [[14 nm process]] (from [[28 nm]])
* 52% improvement in IPC per core for a single-thread (AMD Claim)
+
* 52% improvement in IPC per core for a single-thread (From Excavator)
** From {{\\|Piledriver}} to Zen
+
** Based on the industry-standardized SPEC CPU2006 test suit
** Based on the industry-standardized SPECint_base2006 score compiled with GCC 4.6 -O2 at a fixed 3.4GHz
+
* Up to 3.7x performance/watt improvment
* Up to 3.performance/watt improvment
 
 
* Return to conventional high-performance x86 design
 
* Return to conventional high-performance x86 design
 
** Traditional design for cores without shared blocks (e.g. shared SIMD units)
 
** Traditional design for cores without shared blocks (e.g. shared SIMD units)
 
** Large beefier core design
 
** Large beefier core design
 
* Core engine
 
* Core engine
** Simultaneous Multithreading (SMT) support, 2 threads/core (see [[#Simultaneous_MultiThreading (SMT)|§ Simultaneous MultiThreading]] for details)
+
** Simultaneous Multithreading (SMT) support, 2 threads/core (see [[#Simultaneous_MultiThreading_.28SMT.29|§ Simultaneous MultiThreading]] for details)
 
** Branch Predictor
 
** Branch Predictor
 
*** Improved branch mispredictions
 
*** Improved branch mispredictions
Line 262: Line 209:
 
**** Lower miss latency penalty
 
**** Lower miss latency penalty
 
*** BP is now decoupled from fetch stage
 
*** BP is now decoupled from fetch stage
** Large μop cache (2K instructions)
+
** Large Op cache (2K instructions)
 
** Wider μop dispatch (6, up from 4)
 
** Wider μop dispatch (6, up from 4)
 
** Larger instruction scheduler
 
** Larger instruction scheduler
Line 279: Line 226:
 
*** 64 KiB (double from previous capacity of 32 KiB)
 
*** 64 KiB (double from previous capacity of 32 KiB)
 
*** Write-back L1 cache eviction policy (From write-through)
 
*** Write-back L1 cache eviction policy (From write-through)
*** the bandwidth
+
*** 2x the bandwidth
 
** L2
 
** L2
*** the bandwidth
+
*** 2x the bandwidth
 
*** Faster L2 cache
 
*** Faster L2 cache
 
** Faster L3 cache
 
** Faster L3 cache
 +
** Large Op cache
 
** Better L1$ and L2$ data prefetcher
 
** Better L1$ and L2$ data prefetcher
** L3 bandwidth
+
** 5x L3 bandwidth
 
** Move elimination block added
 
** Move elimination block added
 
** Page Table Entry (PTE) Coalescing
 
** Page Table Entry (PTE) Coalescing
Line 301: Line 249:
 
While not new, Zen also supports {{x86|AVX}}, {{x86|AVX2}}, {{x86|FMA3}}, {{x86|BMI1}}, {{x86|BMI2}}, {{x86|AES}}, {{x86|RdRand}}, {{x86|SMEP}}. Note that with Zen, AMD dropped support for {{x86|XOP}}, {{x86|TBM}}, and {{x86|LWP}}.
 
While not new, Zen also supports {{x86|AVX}}, {{x86|AVX2}}, {{x86|FMA3}}, {{x86|BMI1}}, {{x86|BMI2}}, {{x86|AES}}, {{x86|RdRand}}, {{x86|SMEP}}. Note that with Zen, AMD dropped support for {{x86|XOP}}, {{x86|TBM}}, and {{x86|LWP}}.
  
'''Note:''' WikiChip's testing shows {{x86|FMA4}} still works despite not being officially supported and not even reported by [[CPUID]]. This has also been confirmed by [http://agner.org/optimize/blog/read.php?i=838 Agner here]. Those tests were not exhaustive. Never use them in production.
+
'''Note:''' WikiChip's testing shows {{x86|FMA4}} still works despite not being officially supported and not even reported by [[CPUID]]. This has also been confirmed by [http://agner.org/optimize/blog/read.php?i=838 Agner here].
  
 
=== Block Diagram ===
 
=== Block Diagram ===
 
==== Client Configuration ====
 
==== Client Configuration ====
 
===== Entire SoC Overview =====
 
===== Entire SoC Overview =====
[[File:zen soc block.svg|600px]]
+
[[File:zen soc block.svg|900px]]
 
===== Individual Core =====
 
===== Individual Core =====
 
[[File:zen block diagram.svg]]
 
[[File:zen block diagram.svg]]
 +
===== {{amd|Summit Ridge|l=core}} =====
 +
[[File:AMD Summit Ridge SoC.svg]]
  
==== Single/Multi-chip Packages ====
+
==== Server Configuration ====
===== Single-die =====
 
Single-die as used in {{amd|Summit Ridge|l=core}}:
 
:[[File:AMD Summit Ridge SoC.svg|700px]]
 
 
 
===== 2-die MCP =====
 
2-die MCP used for {{amd|Threadripper}}:
 
 
 
:[[File:AMD Threadripper SoC.svg|700px]]
 
===== 4-die MCP =====
 
4-die MCP used for {{amd|EPYC}}:
 
 
 
:[[File:AMD Naples SoC.svg|800px]]
 
 
 
====== 4-die CCX configs ======
 
 
<div style="display: inline-block">
 
<div style="display: inline-block">
<div style="float: left; margin: 15px;">'''32-core configuration:'''<br>[[File:zen soc block (32 cores).svg|350px]]</div>
+
<div style="float: left; margin: 15px;">'''32-core configuration:'''<br>[[File:zen soc block (32 cores).svg|450px]]</div>
<div style="float: left; margin: 15px;">'''24-core configuration:'''<br>[[File:zen soc block (24 cores).svg|350px]]</div>
+
<div style="float: left; margin: 15px;">'''24-core configuration:'''<br>[[File:zen soc block (24 cores).svg|450px]]</div>
<div style="float: left; margin: 15px;">'''16-core configuration:'''<br>[[File:zen soc block (16 cores).svg|350px]]</div>
+
<div style="float: left; margin: 15px;">'''16-core configuration:'''<br>[[File:zen soc block (16 cores).svg|450px]]</div>
<div style="float: left; margin: 15px;">'''8-core configuration:'''<br>[[File:zen soc block (8 cores).svg|350px]]</div>
+
<div style="float: left; margin: 15px;">'''8-core configuration:'''<br>[[File:zen soc block (8 cores).svg|450px]]</div>
 
</div>
 
</div>
 +
 +
===== {{amd|Naples|l=core}} =====
 +
[[File:AMD Naples SoC.svg|900px]]
  
 
=== Memory Hierarchy ===
 
=== Memory Hierarchy ===
Line 337: Line 276:
 
*** 2,048 µOPs, 8-way set associative
 
*** 2,048 µOPs, 8-way set associative
 
**** 32-sets, 8-µOP line size
 
**** 32-sets, 8-µOP line size
*** Parity protected
 
 
** L1I Cache:
 
** L1I Cache:
 
*** 64 KiB 4-way set associative
 
*** 64 KiB 4-way set associative
 
**** 256-sets, 64 B line size
 
**** 256-sets, 64 B line size
**** Shared by the two threads, per core
+
**** shared by the two threads, per core
*** Parity protected
 
 
** L1D Cache:
 
** L1D Cache:
 
*** 32 KiB 8-way set associative
 
*** 32 KiB 8-way set associative
 
**** 64-sets, 64 B line size
 
**** 64-sets, 64 B line size
**** Write-back policy
+
**** write-back policy
 
*** 4-5 cycles latency for Int
 
*** 4-5 cycles latency for Int
 
*** 7-8 cycles latency for FP
 
*** 7-8 cycles latency for FP
*** SEC-DED ECC
 
 
** L2 Cache:
 
** L2 Cache:
 
*** 512 KiB 8-way set associative
 
*** 512 KiB 8-way set associative
 
*** 1,024-sets, 64 B line size
 
*** 1,024-sets, 64 B line size
*** Write-back policy
+
*** write-back policy
 
*** Inclusive of L1
 
*** Inclusive of L1
*** Latency:
+
*** 17 cycles latency
**** 17 cycles latency (ONLY {{amd|Summit Ridge|l=core}})
 
**** 12 cycles latency (All others)
 
*** DEC-TED ECC
 
 
** L3 Cache:
 
** L3 Cache:
 
*** Victim cache
 
*** Victim cache
*** Summit Ridge, Naples: 8 MiB/CCX, shared across all cores.
+
*** 8 MiB/CCX, shared across all cores.
*** Raven Ridge: 4 MiB/CCX, shared across all cores.
 
 
*** 16-way set associative
 
*** 16-way set associative
 
**** 8,192-sets, 64 B line size
 
**** 8,192-sets, 64 B line size
 
*** 40 cycles latency
 
*** 40 cycles latency
*** DEC-TED ECC
 
 
** System DRAM:
 
** System DRAM:
*** 2 channels per die
+
*** 2 Channels
*** Summit Ridge: up to PC4-21300U (DDR4-2666 UDIMM), ECC supported
+
*** Up to DRR4-2666
*** Raven Ridge: up to PC4-23466U (DDR4-2933 UDIMM), ECC supported by PRO models
+
*** ECC Support
*** Naples: up to PC4-21300L (DDR4-2666 RDIMM/LRDIMM), ECC supported
 
*** ECC: x4 DRAM device failure correction (Chipkill), x8 SEC-DED ECC, Patrol and Demand scrubbing, Data poisoning
 
  
 
Zen TLB consists of dedicated level one TLB for instruction cache and another one for data cache.
 
Zen TLB consists of dedicated level one TLB for instruction cache and another one for data cache.
Line 381: Line 310:
 
*** 64 entry L1 TLB, all page sizes
 
*** 64 entry L1 TLB, all page sizes
 
*** 512 entry L2 TLB, no 1G pages
 
*** 512 entry L2 TLB, no 1G pages
*** Parity protected
 
 
** DTLB
 
** DTLB
 
*** 64 entry L1 TLB, all page sizes
 
*** 64 entry L1 TLB, all page sizes
 
*** 1,532-entry L2 TLB, no 1G pages
 
*** 1,532-entry L2 TLB, no 1G pages
*** Parity protected
 
  
 
== Core ==
 
== Core ==
Line 401: Line 328:
 
Unlike many of Intel's recent microarchitectures (such as {{intel|Skylake|l=arch}} and {{intel|Kaby Lake|l=arch}}) which make use of a unified scheduler, AMD continue to use a split pipeline design. µOP are decoupled at the µOP Queue and are sent through the two distinct pipelines to either the Integer side or the FP side. The two sections are completely separate, each featuring separate schedulers, queues, and execution units. The Integer side splits up the µOPs via a set of individual schedulers that feed the various ALU units. On the floating point side, there is a different scheduler to handle the 128-bit FP operations. Zen support all modern {{x86|extensions|x86 extensions}} including {{x86|AVX}}/{{x86|AVX2}}, {{x86|BMI1}}/{{x86|BMI2}}, and {{x86|AES}}. Zen also supports {{x86|SHA}}, secure hash implementation instructions that are currently only found in [[Intel]]'s ultra-low power microarchitectures (e.g. {{intel|Goldmont|l=arch}}) but not in their mainstream processors.
 
Unlike many of Intel's recent microarchitectures (such as {{intel|Skylake|l=arch}} and {{intel|Kaby Lake|l=arch}}) which make use of a unified scheduler, AMD continue to use a split pipeline design. µOP are decoupled at the µOP Queue and are sent through the two distinct pipelines to either the Integer side or the FP side. The two sections are completely separate, each featuring separate schedulers, queues, and execution units. The Integer side splits up the µOPs via a set of individual schedulers that feed the various ALU units. On the floating point side, there is a different scheduler to handle the 128-bit FP operations. Zen support all modern {{x86|extensions|x86 extensions}} including {{x86|AVX}}/{{x86|AVX2}}, {{x86|BMI1}}/{{x86|BMI2}}, and {{x86|AES}}. Zen also supports {{x86|SHA}}, secure hash implementation instructions that are currently only found in [[Intel]]'s ultra-low power microarchitectures (e.g. {{intel|Goldmont|l=arch}}) but not in their mainstream processors.
  
From the memory subsystem point of view, data is fed into the execution units from the [[L1D$]] via the load and store queue (both of which were almost doubled in capacity) via the two [[Address Generation Units]] (AGUs) at the rate of 2 loads and 1 store per cycle. Each core also has a 512 KiB level 2 cache. L2 feeds both the the level 1 data and level 1 instruction caches at 32B per cycle (32B can be sent in either direction (bidirectional bus) each cycle). L2 is connected to the L3 cache which is shared across all cores. As with the L1 to L2 transfers, the L2 also transfers data to the L3 and vice versa at 32B per cycle (32B in either direction each cycle).
+
From the memory subsystem point of view, data is fed into the execution units from the [[L1D$]] via the load and store queue (both of which were almost doubled in capacity) via the two [[Address Generation Units]] (AGUs) at the rate of 2 loads and 1 store per cycle. Each core also has a 512 KiB level 2 cache. L2 feeds both the the level 1 data and level 1 instruction caches at 32B per cycle (32B can be send in either direction (bidirectional bus) each cycle). L2 is connected to the L3 cache which is shared across all cores. As with the L1 to L2 transfers, the L2 also transfers data to the L3 and vice versa at 32B per cycle (32B in either direction each cycle).
  
 
{{clear}}
 
{{clear}}
Line 417: Line 344:
 
* 512-entry L2 TLB, no 1G pages
 
* 512-entry L2 TLB, no 1G pages
  
===== Fetching =====
+
===== fetching =====
 
Instructions are fetched from the [[L2 cache]] at the rate of 32B/cycle. Zen features an asymmetric [[level 1 cache]] with a 64 [[KiB]] [[instruction cache]], double the size of the L1 data cache. Depending on the branch prediction decision instructions may be fetched from the instruction cache or from the [[µOPs]] cache in which eliminates the need for performing the costly instruction decoding.
 
Instructions are fetched from the [[L2 cache]] at the rate of 32B/cycle. Zen features an asymmetric [[level 1 cache]] with a 64 [[KiB]] [[instruction cache]], double the size of the L1 data cache. Depending on the branch prediction decision instructions may be fetched from the instruction cache or from the [[µOPs]] cache in which eliminates the need for performing the costly instruction decoding.
 
[[File:amd zen hc28 decode.png|left|300px]]
 
[[File:amd zen hc28 decode.png|left|300px]]
On the traditional side of decode, instructions are fetched from the L1$ at 32B aligned bytes per cycle and go to the instruction byte buffer and through the pick stage to the decode. Actual tests show the effective throughput is generally much lower (around 16-20 bytes). This is slightly higher than the fetch window in [[Intel]]'s {{intel|Skylake}} which has a 16-byte fetch window. The size of the instruction byte buffer is of 20 entries (10 entries per thread in SMT).
+
On the traditional side of decode, instructions are fetched from the L1$ at 32B aligned bytes per cycle and go to the instruction byte buffer and through the pick stage to the decode. Actual tests show the effective throughput is generally much lower (around 16-20 bytes). This is slightly higher than the fetch window in [[Intel]]'s {{intel|Skylake}} which has a 16-byte fetch window. The size of the instruction byte buffer was not given by AMD but it's expected to be larger than the 16-entry structure found in {{amd|microarchitectures|their previous}} architecture.
  
 
===== µOP cache & x86 tax =====
 
===== µOP cache & x86 tax =====
Line 427: Line 354:
 
The [[µOP cache]] used in Zen is not a [[trace cache]] and much closely resembles the one used by Intel in their microarchitectures since {{intel|Sandy Bridge|l=arch}}. The µOP cache is an independent unit not part of the [[L1I$]] and is not a necessarily a subset of the L1I cache either; I.e., there are instances where there could be a hit in the µOP cache but a miss in the L1$. This happens when an instruction that got stored in the µOP cache gets evicted from L1. During the fetch stage probing must be done from both paths. Zen has a specific unit called 'Micro-Tags' which does the probing and determines whether the instruction should be accessed from the µOP cache or from the L1I$. The µOP cache itself has a dedicated $tags for accessing those µOPs.
 
The [[µOP cache]] used in Zen is not a [[trace cache]] and much closely resembles the one used by Intel in their microarchitectures since {{intel|Sandy Bridge|l=arch}}. The µOP cache is an independent unit not part of the [[L1I$]] and is not a necessarily a subset of the L1I cache either; I.e., there are instances where there could be a hit in the µOP cache but a miss in the L1$. This happens when an instruction that got stored in the µOP cache gets evicted from L1. During the fetch stage probing must be done from both paths. Zen has a specific unit called 'Micro-Tags' which does the probing and determines whether the instruction should be accessed from the µOP cache or from the L1I$. The µOP cache itself has a dedicated $tags for accessing those µOPs.
  
===== Decode =====
+
===== decode =====
 
[[File:amd fastpath single-double (zen).svg|right|450px]]
 
[[File:amd fastpath single-double (zen).svg|right|450px]]
 
Having to execute [[x86]], there are instructions that actually include multiple operations. Some of those operations cannot be realized efficiently in an OoOE design and therefore must be converted into simpler operations. In the front-end, complex x86 instructions are broken down into simpler fixed-length operations called [[macro-operations]] or MOPs (sometimes also called complex OPs or COPs). Those are often mistaken for being "[[RISC]]ish" in nature but they retain their CISC characteristics. MOPS can perform both an arithmetic operation and memory operation (e.g. you can read, modify, and write in a single MOP). MOPs can be further cracked into smaller simpler single fixed length operation called [[micro-operations]] (µOPs). µOPs are a fixed length operation that performs just a single operation (i.e., only a single load, store, or an arithmetic). Traditionally AMD used to distinguish between the two ops, however with Zen AMD simply refers to everything as µOPs although internally they are still two separate concepts.
 
Having to execute [[x86]], there are instructions that actually include multiple operations. Some of those operations cannot be realized efficiently in an OoOE design and therefore must be converted into simpler operations. In the front-end, complex x86 instructions are broken down into simpler fixed-length operations called [[macro-operations]] or MOPs (sometimes also called complex OPs or COPs). Those are often mistaken for being "[[RISC]]ish" in nature but they retain their CISC characteristics. MOPS can perform both an arithmetic operation and memory operation (e.g. you can read, modify, and write in a single MOP). MOPs can be further cracked into smaller simpler single fixed length operation called [[micro-operations]] (µOPs). µOPs are a fixed length operation that performs just a single operation (i.e., only a single load, store, or an arithmetic). Traditionally AMD used to distinguish between the two ops, however with Zen AMD simply refers to everything as µOPs although internally they are still two separate concepts.
Line 433: Line 360:
 
Decoding is done by the 4 Zen decoders. The decode stage allows for four [[x86]] instructions to be decoded per cycle which are in turn sent to the µOP Queue. Previously, in the {{\\|Bulldozer}}/{{\\|Jaguar}}-based designs AMD had two paths: a FastPath Single which emitted a single MOP and a FastPath Double which emitted two MOPs which are in turn sent down the pipe to the schedulers. Michael Clark (Zen's lead architect) noted that Zen has significantly denser MOPs meaning almost all instructions will be a FastPath Single (i.e., one to one transformations). What would normally get broken down into two MOPs in {{\\|Bulldozer}} is now translated into a single dense MOP. It's for those reasons that while up to 8MOPs/cycle can be emitted, usually only 4MOPs/cycle are emitted from the [[instruction decoder|decoders]].
 
Decoding is done by the 4 Zen decoders. The decode stage allows for four [[x86]] instructions to be decoded per cycle which are in turn sent to the µOP Queue. Previously, in the {{\\|Bulldozer}}/{{\\|Jaguar}}-based designs AMD had two paths: a FastPath Single which emitted a single MOP and a FastPath Double which emitted two MOPs which are in turn sent down the pipe to the schedulers. Michael Clark (Zen's lead architect) noted that Zen has significantly denser MOPs meaning almost all instructions will be a FastPath Single (i.e., one to one transformations). What would normally get broken down into two MOPs in {{\\|Bulldozer}} is now translated into a single dense MOP. It's for those reasons that while up to 8MOPs/cycle can be emitted, usually only 4MOPs/cycle are emitted from the [[instruction decoder|decoders]].
  
Dispatch is capable of sending up to 6 µOP to [[Integer]] EX and an up to 4 µOP to the [[Floating Point]] (FP) EX. Zen can dispatch to both at the same time (i.e. for a maximum of 10 µOP per cycle), however, since the retire control unit (RCU) can only handle up to 6 MOPs/cycle, the effective number of dispatched µOPs is likely lower.
+
Dispatch is capable of sending up to 6 µOP to [[Integer]] EX and an additional 4 µOP to the [[Floating Point]] (FP) EX. Zen can dispatch to both at the same time (i.e. for a maximum of 10 µOP per cycle).
  
 
====== MSROM ======
 
====== MSROM ======
Line 441: Line 368:
 
A number of optimization opportunities are exploited at this stage.
 
A number of optimization opportunities are exploited at this stage.
 
====== Stack Engine ======
 
====== Stack Engine ======
At the decode stage Zen incorporates the the Stack Engine Memfile (SEM). Note that while AMD refers to SEM as a new unit, they have had a Stack Engine in their designs since {{\\|K10}}. The Memfile sits between the queue and dispatch monitoring the MOP traffic. The Memfile is capable of performing [[store-to-load forwarding]] right at dispatch for loads that trail behind known stores with physical addresses. Other things such as eliminating stack PUSH/POP operations are also done at this stage so they are effectively a zero-latency instructions; proceeding instructions that rely on the stack pointer are not delayed. This is a fairly effective low-power solution that off-loads some of the work that would otherwise be done by [[AGU]].
+
At the decode stage Zen incorporates the the Stack Engine Memfile (SEM). Note that while AMD refers to SEM as a new unit, they have had a Stack Engine in their designs since {{\\|K10}}. The Memfile sits between the queue and dispatch monitoring the MOP traffic. The Memfile is capable of performing [[store-to-load forwarding]] right at dispatch for loads that trail behind known stores with physical addresses. Other things such as eliminating stack PUSH/POP operations are also done at this stage so they are effectively a zero-latency instructions; proceeding instructions that relay on the stack pointer are not delayed. This is a fairly effective low-power solution that off-loads some of the work that would otherwise be done by [[AGU]].
  
 
====== µOP-Fusion  ======
 
====== µOP-Fusion  ======
Line 451: Line 378:
 
==== Execution Engine ====
 
==== Execution Engine ====
 
[[File:amd zen hc28 integer.png|350px|right]]
 
[[File:amd zen hc28 integer.png|350px|right]]
As mentioned early, Zen returns to a fully partitioned core design with a private L2 cache and private [[FP]]/[[SIMD]] units. Previously those units shared resources spanning two cores. Zen's Execution Engine (Back-End) is split into two major sections: [[integer]] & memory operations and [[floating point]] operations. The two sections are decoupled with independent [[register renaming|renaming]], [[schedulers]], [[queues]], and execution units. Both Integer and FP sections have access to the [[Retire Queue]] which is 192 entries (96 per thread) and can [[retire]] 8 instructions per cycle (independent of either Integer or FP). The wider-than-dispatch retire allows Zen to catch up and free the resources much quicker (previous architectures saw bottleneck at this point in situations where an older op is stalling causing a reduction in performance due to retire needing to catch up to the front of the machine).
+
As mentioned early, Zen returns to a fully partitioned core design with a private L2 cache and private [[FP]]/[[SIMD]] units. Previously those units shared resources spanning two cores. Zen's Execution Engine (Back-End) is split into two major sections: [[integer]] & memory operations and [[floating point]] operations. The two sections are decoupled with independent [[register renaming|renaming]], [[schedulers]], [[queues]], and execution units. Both Integer and FP sections have access to the [[Retire Queue]] which is 192 entries and can [[retire]] 8 instructions per cycle (independent of either Integer or FP). The wider-than-dispatch retire allows Zen to catch up and free the resources much quicker (previous architectures saw bottleneck at this point in situations where an older op is stalling causing a reduction in performance due to retire needing to catch up to the front of the machine).
  
 
Because the two regions are entirely divided, a penalty of one cycle latency will incur for operands that crosses boundaries; for example, if an [[operand]] of an integer arithmetic µOP depends on the result of a floating point µOP operation. This applies both ways. This is a similar to the inter-[[Common Data Bus]] exchanges in Intel's designs (e.g., {{intel|Skylake|l=arch}}) which incur a delay of 1 to 2 cycles when dependent operands cross domains.
 
Because the two regions are entirely divided, a penalty of one cycle latency will incur for operands that crosses boundaries; for example, if an [[operand]] of an integer arithmetic µOP depends on the result of a floating point µOP operation. This applies both ways. This is a similar to the inter-[[Common Data Bus]] exchanges in Intel's designs (e.g., {{intel|Skylake|l=arch}}) which incur a delay of 1 to 2 cycles when dependent operands cross domains.
Line 472: Line 399:
 
The FP deals with all vector operations. The simple integer vector operations (e.g. shift, add) can all be done in one cycle, half the latency of AMD's previous architecture. Basic [[floating point]] math has a latency of three cycles including [[multiplication]] (one additional cycle for double precision). [[Fused multiply-add]] are five cycles.
 
The FP deals with all vector operations. The simple integer vector operations (e.g. shift, add) can all be done in one cycle, half the latency of AMD's previous architecture. Basic [[floating point]] math has a latency of three cycles including [[multiplication]] (one additional cycle for double precision). [[Fused multiply-add]] are five cycles.
  
The FP has a single pipe for 128-bit load operations. In fact, the entire FP side is optimized for 128-bit operations. Zen supports all the latest instructions such as SSE and {{x86|AVX1}}/{{x86|AVX2|2}}. The way 256-bit AVX was designed was so that they can be carried out as two independent 128-bit operations. Zen takes advantage of that by operating on those instructions as two operations; i.e., Zen splits up 256-bit operations into two µOPs so they are effectively half the throughput of their 128-bit operations counterparts. Likewise, stores are also done on 128-bit chunks, making 256-bit loads have an effective throughput of one store every two cycles. The pipes are fairly well balanced, therefore most operations will have at least two pipes to be scheduled on retaining the throughput of at least one such instruction each cycle. As implies, 256-bit operations will use up twice the resources to complete (i.e., 2x register, scheduler, and ports). This is a compromise [[AMD]] has taken which helps conserve die space and power. By contrast, [[Intel]]'s competing design, {{intel|Skylake}}, does have dedicated 256-bit circuitry. It's also worth noting that Intel's contemporary {{intel|Skylake SP|server class models|l=core}} have extended this further to incorporate dedicated 512-bit circuitry supporting {{x86|AVX-512}} with the highest performance models {{intel|Skylake (server)#Execution_engine|l=arch|having a whole second}} dedicated AVX-512 unit.
+
The FP has a single pipe for 128-bit load operations. In fact, the entire FP side is optimized for 128-bit operations. Zen supports all the latest instructions such as SSE and {{x86|AVX1}}/{{x86|AVX2|2}}. The way 256-bit AVX was designed was so that they can be carried out as two independent 128-bit operations. Zen takes advantage of that by operating on those instructions as two operations; i.e., Zen splits up 256-bit operations into two µOPs so they are effectively half the throughput of their 128-bit operations counterparts. Likewise, stores are also done on 128-bit chunks, making 256-bit loads have an effective throughput of one store every two cycles. The pipes are fairly well balances, therefore most operations will have at least two pipes to be scheduled on retaining the throughput of at least on such instruction each cycle. As implies, 256-bit operations will use up twice the resources to complete (i.e., 2x register, scheduler, and ports). This is a compromise [[AMD]] has taken which helps conserve die space and power. By contrast, [[Intel]]'s competing design, {{intel|Skylake}}, does have dedicated 256-bit circuitry.  
  
 
Additionally Zen also supports {{x86|SHA}} and {{x86|AES}} with 2 AES units implemented in an attempt to improve encryption performance. Those units can be found on pipes 0 and 1 of the floating point scheduler.
 
Additionally Zen also supports {{x86|SHA}} and {{x86|AES}} with 2 AES units implemented in an attempt to improve encryption performance. Those units can be found on pipes 0 and 1 of the floating point scheduler.
Line 479: Line 406:
 
==== Memory Subsystem ====
 
==== Memory Subsystem ====
 
[[File:amd zen hc28 memory.png|300px|right]]
 
[[File:amd zen hc28 memory.png|300px|right]]
Loads and Stores are conducted via the two AGUs which can operate simultaneously. Zen has a 44-entry Load Queue and a 44-entry Store Queue. Taking the two 14-entry deep AGU schedulers into account, the processor can keep up to 72 out-of-order loads in flight (same as Intel's {{intel|Skylake|l=arch}}). Zen employs a split TLB-data pipe design which allows TLB tag access to take place while the data cache is being fed in order to determine if the data is available and send their address to the L2 to start prefetching early on. Zen is capable of up to two loads per cycle (2x16B each) and up to one store per cycle (1x16B). The L1 TLB is 64-entry for all page sizes and the L2 TLB is a 1536-entry with no 1 GiB pages.
+
Loads and Stores are conducted via the two AGUs which can operate simultaneously. Zen has a much larger load queue capable of supporting 72 out-of-order loads (same as Intel's {{intel|Skylake|l=arch}}). There is also a 44-entry Store Queue. Zen employs a split TLB-data pipe design which allows TLB tag access to take place while the data cache is being fed in order to determine if the data is available and send their address to the L2 to start prefetching early on. Zen is capable of up to two loads per cycle (2x16B each) and up to one store per cycle (1x16B). The L1 TLB is 64-entry for all page sizes and the L2 TLB is a 1536-entry with no 1 GiB pages.
  
Zen incorporates a 64 KiB 4-way set associative L1 instruction cache and a 32 KiB 8-way set associative L1 data cache. Both the instruction cache and the data cache can fetch from the L2 cache at 32 Bytes per cycle. The L2 cache is a 512 KiB 8-way set associative unified cache, inclusive, and private to the core. The L2 cache can fetch and write 32B/cycle into the 8MB L3 cache (32B in either direction each cycle, i.e. bidirectional bus).
+
Zen incorporates a 64 KiB 4-way set associative L1 instruction cache and a 32 KiB 8-way set associative L2 data cache. Both the instruction cache and the data cache can fetch from the L2 cache at 32 Bytes per cycle. The L2 cache is a 512 KiB 8-way set associative unified cache, inclusive, and private to the core. The L2 cache can fetch and write 32B/cycle into the L3 (32B in either direction each cycle, i.e. bidirectional bus).
  
 
== Infinity Fabric ==
 
== Infinity Fabric ==
Line 500: Line 427:
  
 
[[File:zen soc clock domain.svg|650px]]
 
[[File:zen soc clock domain.svg|650px]]
 
== Security ==
 
[[File:amd sme.png|right|200px]]
 
AMD incorporated a number of new security technologies into their server-class Zen processors (e.g., {{amd|EPYC}}). The various security features are offered via a new dedicated security subsystem which integrates an {{armh|Cortex-A5|l=arch}} core. The dedicated secure processor runs a secured kernel with the firmware which sits externally (e.g., on an SPI ROM). The secure processor is responsible for the cryptographic functionalities for the secure key generation and management as well as hardware-validated boots.
 
 
<table class="wikitable">
 
<tr><td></td><th>SME</th><th>SEV</th></tr>
 
<tr><th>Protection Per</th><td>Whole Machine</td><td>Individual VMs</td></tr>
 
<tr><th>Type of Protection</th><td>Physical Memory Attack</td><td>Physical Memory Attack<br>Vulnerable VM</td></tr>
 
<tr><th>Encryption Per</th><td>Native page table</td><td>Guest page table</td></tr>
 
<tr><th>Key Management</th><td>Key/Machine</td><td>Key/VM</td></tr>
 
<tr><th>Requires Driver</th><td>No</td><td>Yes</td></tr>
 
</table>
 
 
 
=== Secure Memory Encryption (SME) ===
 
{{main|x86/secure memory encryption|l1=Secure Memory Encryption}}
 
'''Secure Memory Encryption''' ('''SME''') is a new feature which offers full hardware memory encryption against physical memory attacks. A single key is used for the encryption. An [[AES-128]] Encryption engine sits on the [[integrated memory controller]] thereby offering real-time per page table entry encryption - this works across execution cores, network, storage, graphics, and any other I/O access that goes through the DMA. SME incurs additional latency tax only for encrypted pages.
 
 
AMD also supports '''Transparent SME''' ('''TSME''') on their workstation-class PRO (Performance, Reliability, Opportunity) processors in addition to the server models. TSME is subset of SME limited to base encryption without OS/HV involvement, allowing for legacy OS/HV software support. In this mode, all memory is encrypted regardless of the value of the C-bit on any particular page. When this mode is enabled, SME and SEV are not available.
 
 
=== Secure Encrypted Virtualization (SEV) ===
 
{{main|x86/secure encrypted virtualization|l1=Secure Encrypted Virtualization}}[[File:amd sev.png|right|150px]]
 
'''Secure Encrypted Virtualization''' ('''SEV''') is a more specialized version of SME whereby individual keys can be used per hypervisor and per VM, a cluster of VMs, or a container. This allows the hypervisor memory to be encrypted and cryptographically isolated from the guest machines. Additionally SEV can work alongside unencrypted VMs from the same hypervisor. All this functionality is integrated and works with existing AMD-V technology.
 
 
 
: [[File:amd sev architecture.png|700px]]
 
 
{{clear}}
 
  
 
== Power ==
 
== Power ==
Line 539: Line 437:
 
* '''VDDM''' - [[L2]]/[[L3]] [[SRAM]] supply
 
* '''VDDM''' - [[L2]]/[[L3]] [[SRAM]] supply
 
</div>
 
</div>
Zen presented AMD with a number of new challenges in the area of power largely due to their decision to cover the entire spectrum of systems from ultra-low power to high performance. Previously AMD handled this by designing two independent architectures (i.e., {{\\|Excavator}} and {{\\|Puma}}). In Zen, SoC voltage coming from the [[Voltage Regulator Module]] (VRM) is fed to the RVDD, a package metal plane that distributes the highest VID request from all cores. In Zen, each core has a digital [[LDO regulator]] (low-dropout) and a [[digital frequency synthesizer]] (DFS) to vary frequency and voltage across power states on individual core basis. The LDO regulates RVDD for each power domain and create an optimal VDD per core using a system of sensors they've embedded across the entire chip; this is in addition to other properties such as countermeasures against droop. This is in contrast to some alternative solutions by [[Intel]] which attempted to integrated the voltage regulator (FIVR) on die in {{intel|Haswell|l=arch}} (and consequently removing it in {{intel|Skylake|l=arch}} due to a number of thermal restrictions it created). Zen's new voltage control is an attempt at a much finer power tuning on a per core level based on a collection of information it has on that core and overall chip.
+
Zen presented AMD with a number of new challenges in the area of power largely due to their decision to cover the entire spectrum of systems from ultra-low power to high performance. Previously AMD handled this by designing two independent architectures (i.e., {{\\|Excavator}} and {{\\|Puma}}). In Zen, SoC voltage coming from the Voltage Regulator Module (VRM) is fed to the RVDD, a package metal plane that distributes the highest VID request from all cores. In Zen, each core has a digital [[LDO regulator]] (low-dropout) and a [[digital frequency synthesizer]] (DFS) to vary frequency and voltage across power states on individual core basis. The LDO regulates RVDD for each power domain and create an optimal VDD per core using a system of sensors they've embedded across the entire chip; this is in addition to other properties such as countermeasures against droop. This is in contrast to some alternative solutions by [[Intel]] which attempted to integrated the voltage regulator (FIVR) on die in {{intel|Haswell|l=arch}} (and consequently removing it in {{intel|Skylake|l=arch}} due to a number of thermal restrictions it created). Zen's new voltage control is an attempt at a much finer power tuning on a per core level based on a collection of information it has on that core and overall chip.
  
 
<div style="display: block;">
 
<div style="display: block;">
Line 556: Line 454:
 
Zen implements over 1300 sensors to monitor the state of the die over all [[critical paths]] including the CCX and external components such as the memory fabric. Additionally the CCX also incorporates 48 high-speed power supply monitors, 20 [[thermal diodes]], and 9 high-speed droop detectors.
 
Zen implements over 1300 sensors to monitor the state of the die over all [[critical paths]] including the CCX and external components such as the memory fabric. Additionally the CCX also incorporates 48 high-speed power supply monitors, 20 [[thermal diodes]], and 9 high-speed droop detectors.
 
<div style="text-align: center;">[[File:zen pure power sensory.png|600px]]</div>
 
<div style="text-align: center;">[[File:zen pure power sensory.png|600px]]</div>
 
=== System Management Unit  ===
 
{{empty section}}
 
  
 
== Features ==
 
== Features ==
Line 585: Line 480:
  
 
[[File:10682-icon-neural-net-prediction-140x140.png|50px|left]]
 
[[File:10682-icon-neural-net-prediction-140x140.png|50px|left]]
'''Neural Net Prediction''' - This appears to be largely marketing term for Zen's much beefier and more finely tuned [[branch prediction]] unit. Zen uses a [[perceptron branch predictor|hashed perceptron system]] to intelligently anticipate future code flows, allowing warming up of cold blocks in order to avoid possible waits. Most of that functionality is already found on every modern high-end microprocessor (including AMD's own previous microarchitectures). Because AMD has not disclosed any more specific information about BP, it can only be speculated that no new groundbreaking logic was introduced in Zen.
+
'''Neural Net Prediction''' - This appears to be largely marketing term for Zen's much beefier and more finely tune [[branch prediction]] unit. Zen uses a [[perceptron branch predictor|hashed perceptron system]] to intelligently anticipate future code flows, allowing warming up of cold blocks in order to avoid possible waits. Most of that functionality is already found on every modern high-end microprocessor (including AMD's own previous microarchitectures). Because AMD has not disclosed any more specific information about BP, it can only be speculated that no new groundbreaking logic was introduced in Zen.
 
{{clear}}
 
{{clear}}
 
[[File:10682-icon-smart-prefetch-140x140.png|50px|left]]
 
[[File:10682-icon-smart-prefetch-140x140.png|50px|left]]
Line 595: Line 490:
 
{{clear}}
 
{{clear}}
 
[[File:10682-icon-precision-boost-140x140.png|50px|left]]
 
[[File:10682-icon-precision-boost-140x140.png|50px|left]]
'''{{amd|Precision Boost}}''' - A feature that provides the ability to adjust the frequency of the processor on-the-fly given sufficient headroom (e.g. thermal limits based on the sensory data collected by a network of sensors across the chip), i.e. "Turbo Frequency". Precision Boost adjusts in 25 MHz increments. With Zen-based APUs, AMD introduced '''{{amd|Precision Boost 2}}''' - an enhancement of the original PB feature that uses a new algorithm that controls the boost frequency on a per-thread basis depending on the headroom.
+
'''Precision Boost''' - A feature that provides the ability to adjust the frequency of the processor on-the-fly given sufficient headroom (e.g. thermal limits based on the sensory data collected by a network of sensors across the chip), i.e. "Turbo Frequency". Precision Boost adjusts in 25 MHz increments, considerably more granular when compared to Intel's {{intel|Turbo Boost}} which operates at 100 MHz bin increments. Having more granular boost increments in theory could allow it to clock slightly higher than competitor's products without reaching thermal limits (e.g., complex workloads involving {{x86|AVX2}}).
 
{{clear}}
 
{{clear}}
 
[[File:amd zen xfr.jpg|300px|right]]
 
[[File:amd zen xfr.jpg|300px|right]]
 
[[File:10682-icon-frequency-range-140x140.png|50px|left]]
 
[[File:10682-icon-frequency-range-140x140.png|50px|left]]
'''{{amd|XFR|Extended Frequency Range}}''' ('''XFR''') - This is a fully automated solution that attempts to allow higher upper limit on the maximum frequency based on the cooling technique used (e.g. air, water, LN2). Whenever the chip senses that it's suitable enough for a given frequency, it will attempt to increase that limit further. XFR is partially enabled on all models, providing an extra +50 MHz frequency boost whenever possible. For 'X' models, full XFR is enabled providing twice the headroom of up to +100 MHz. With Zen-based APUs, AMD introduced '''{{amd|Mobile XFR}}''' (mXFR) which offers mobile devices with premium cooling a sustainable higher boost frequency for a longer period of time.
+
'''Extended Frequency Range''' ('''XFR''') - This is a fully automated solution that attempts to allow higher upper limit on the maximum frequency based on the cooling technique used (e.g. air, water, LN2). Whenever the chip senses that it's suitable enough for a given frequency, it will attempt to increase that limit further. XFR is partially enabled on all models, providing an extra +50 MHz frequency boost whenever possible. For 'X' models, full XFR is enabled providing twice the headroom of up to +100 MHz.
 
{{clear}}
 
{{clear}}
The AMD presentation slide on the right depicts a normal use case for the {{amd|Ryzen 7}} {{amd|Ryzen 7/1800X|1800X}}. When under normal workload, the processor will operate at around its base frequency of 3.6 GHz. When experiencing heavier workload, Precision Boost will kick in increment it as necessary up to its maximum frequency of 4 GHz. With adequate cooling, {{amd|XFR}} will bump it up an additional 100 MHz. This boost is sustainable for the first two active cores, at which point the boost frequency will drop to the "all core" frequency. When light workload get experienced, the processor will reduce its frequency. As Pure Power senses the workload and CPU state, it can also drastically downclock the CPU when appropriate (such as in the graph during mostly idle scenarios).
+
The AMD presentation slide on the right depicts a normal use case for the {{amd|Ryzen 7}} {{amd|Ryzen 7/1800X|1800X}}. When under normal workload, the processor will operate at around its base frequency of 3.6 GHz. When expericing heavier workload, Precision Boost will kick in increment it as necessary up to its maximum frequency of 4 GHz. With adequate cooling, {{amd|XFR}} will bump it up an additional 100 MHz. When light workload get experienced, the processor will reduce its frequency. As Pure Power senses the workload and CPU state, it can also drastically downclock the CPU when appropriate (such as in the graph during mostly idle).
 
<div style="text-align: center;">[[File:ryzen-xfr-1800x example.jpg|700px]]</div>
 
<div style="text-align: center;">[[File:ryzen-xfr-1800x example.jpg|700px]]</div>
 
{{clear}}
 
{{clear}}
Line 628: Line 523:
 
Each Zeppelin provides 32 Gen 3.0 [[PCIe]] lanes for a total of 128 lanes. In a single-socket configuration, all 128 lanes may be used for general purpose I/O - for example 6 GPUs over x16 and x8 more lanes for additional storage. This is considerably more than any comparable contemporary [[Intel]] model (either {{intel|Broadwell EP|l=core}} or {{intel|Skylake SP|l=core}}). {{amd|Naples|l=core}}-based processors scale all the way up to [[32 cores]] with 64 [[threads]] (for up to 64 cores and 128 threads per complete system). The caveat is that when in 2-way MP mode, half of the lanes are lost. 64 of the 128 of the PCIe lanes get allocated for interchip communication via AMD's {{amd|Infinity Fabrics}} protocols with the remaining 64 lanes left for the system. 64 PCIe lanes for socket-to-socket communication provides a maximum bandiwdth of  This setup still leaves the system with 128 PCIe lanes, but it's not any more than in a single-socket configuration.
 
Each Zeppelin provides 32 Gen 3.0 [[PCIe]] lanes for a total of 128 lanes. In a single-socket configuration, all 128 lanes may be used for general purpose I/O - for example 6 GPUs over x16 and x8 more lanes for additional storage. This is considerably more than any comparable contemporary [[Intel]] model (either {{intel|Broadwell EP|l=core}} or {{intel|Skylake SP|l=core}}). {{amd|Naples|l=core}}-based processors scale all the way up to [[32 cores]] with 64 [[threads]] (for up to 64 cores and 128 threads per complete system). The caveat is that when in 2-way MP mode, half of the lanes are lost. 64 of the 128 of the PCIe lanes get allocated for interchip communication via AMD's {{amd|Infinity Fabrics}} protocols with the remaining 64 lanes left for the system. 64 PCIe lanes for socket-to-socket communication provides a maximum bandiwdth of  This setup still leaves the system with 128 PCIe lanes, but it's not any more than in a single-socket configuration.
  
In addition to PCIe lanes, each Zeppelin provides a memory controller supporting dual-channel [[ECC]] DDR4 memory. With EPYC packing 4 such dies, each chip sports 4 memory controllers supporting up 16 DIMMs of 2 [[TiB]] octa-channel DDR4 ECC memory.
+
In addition to PCIe lanes, each Zeppelin provides a memory controller supporting dual-channel [[ECC]] DDR4 memory. With EPYC packing 4 suck dies, each chip sports 4 memory controllers supporting up 16 DIMMs of 2 [[TiB]] octa-channel DDR4 ECC memory.
  
 
<div style="display: block;">
 
<div style="display: block;">
Line 635: Line 530:
 
</div>
 
</div>
  
 
+
In addition to the large amount of memory supported by the four Zeppelin, all EPYC offer the full 64 MiB a result from 8 MiB from each of the 8 CCXs. The way binning is done for the various EPYC models is by disabing either 1, 2, 3, or 4 cores per CCX from each of the Zeppelin dies to form either [[8 cores|8]], [[16 cores|16]], [[24 cores|24]], or [[32 cores|32]].
[[File:amd naples mcp.png|250px|right]]
 
In addition to the large amount of memory supported by the four Zeppelin, all EPYC offer the full 64 MiB a result of 8 MiB from each of the 8 CCXs. The way binning is done for the various EPYC models is by disabling either 1, 2, 3, or 4 cores per CCX from each of the Zeppelin dies to form either [[8 cores|8]], [[16 cores|16]], [[24 cores|24]], or [[32 cores|32]].
 
  
 
<div style="display: block;">
 
<div style="display: block;">
 +
[[File:amd naples mcp.png|250px]]
 
[[File:amd epyc interconnect.png|650px]]
 
[[File:amd epyc interconnect.png|650px]]
This image originates from a slide presented at AMD EPYC Tech Day, June 20, 2017 and shows one layer of die interconnects on the EPYC package substrate. The pink lines fanning out at the top and bottom connect to the UMCs on the respective chip. The light blue and pink lines in the center are bidirectional GMI links. The UMC connections of the top left and bottom right chip, the GMI link from the bottom left to the top right chip, and the PCIe connections are not visible in this picture. Of the four GMI interfaces on each die only the three closest to the other dies are used. It should be noted that its creator pasted Zeppelin die shots onto the image and improperly reflected the top left and bottom right chip. In reality four identical dies are used, with the top left chip mounted in the same orientation as the top right chip, and both bottom chips rotated by 180 degrees.</div>
+
</div>
{{clear}}
 
  
==== Die-die memory latencies ====
 
The following is the average die-to-die latencies across two cpu sockets of 4 dies each over the {{amd|infinity fabric}} on the [[EPYC 7601]] (32-core, 2.2 GHz) using 16 DDR4-2666 DIMMS.
 
  
<table class="wikitable">
+
{{clear}}
<tr><th colspan="9">Die-to-die Latency</th></tr>
 
<tr><th>Die</th><td>0</td><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td><td>6</td><td>7</td></tr>
 
<tr><td>0</td><td style="background-color: #63f800;">85 ns</td><td style="background-color: #f2f800;">140 ns</td><td style="background-color: #daf800;">135 ns</td><td style="background-color: #daf800;">135 ns</td><td style="background-color: #f81700;">245 ns</td><td style="background-color: #f81700;">245 ns</td><td style="background-color: #f86f00;">200 ns</td><td style="background-color: #f84000;">235 ns</td></tr>
 
<tr><td>1</td><td style="background-color: #f2f800;">140 ns</td><td style="background-color: #63f800;">85 ns</td><td style="background-color: #daf800;">135 ns</td><td style="background-color: #daf800;">135 ns</td><td style="background-color: #f81700;">245 ns</td><td style="background-color: #f82f00;">240 ns</td><td style="background-color: #f84000;">235 ns</td><td style="background-color: #f86f00;">200 ns</td></tr>
 
<tr><td>2</td><td style="background-color: #daf800;">135 ns</td><td style="background-color: #c6f800;">130 ns</td><td style="background-color: #63f800;">85 ns</td><td style="background-color: #f2f800;">140 ns</td><td style="background-color: #f86f00;">200 ns</td><td style="background-color: #f80000;">250 ns</td><td style="background-color: #f84000;">235 ns</td><td style="background-color: #f84000;">235 ns</td></tr>
 
<tr><td>3</td><td style="background-color: #daf800;">135 ns</td><td style="background-color: #c6f800;">130 ns</td><td style="background-color: #f2f800;">140 ns</td><td style="background-color: #63f800;">85 ns</td><td style="background-color: #f80000;">250 ns</td><td style="background-color: #f86f00;">200 ns</td><td style="background-color: #f84000;">235 ns</td><td style="background-color: #f84000;">235 ns</td></tr>
 
<tr><td>4</td><td style="background-color: #f82f00;">240 ns</td><td style="background-color: #f82f00;">240 ns</td><td style="background-color: #f86f00;">200 ns</td><td style="background-color: #f82f00;">240 ns</td><td style="background-color: #63f800;">85 ns</td><td style="background-color: #f2f800;">140 ns</td><td style="background-color: #daf800;">135 ns</td><td style="background-color: #daf800;">135 ns</td></tr>
 
<tr><td>5</td><td style="background-color: #f82f00;">240 ns</td><td style="background-color: #f82f00;">240 ns</td><td style="background-color: #f84000;">235 ns</td><td style="background-color: #f86f00;">200 ns</td><td style="background-color: #f2f800;">140 ns</td><td style="background-color: #63f800;">85 ns</td><td style="background-color: #daf800;">135 ns</td><td style="background-color: #daf800;">135 ns</td></tr>
 
<tr><td>6</td><td style="background-color: #f86f00;">200 ns</td><td style="background-color: #f81700;">245 ns</td><td style="background-color: #f84000;">235 ns</td><td style="background-color: #f84000;">235 ns</td><td style="background-color: #daf800;">135 ns</td><td style="background-color: #daf800;">135 ns</td><td style="background-color: #63f800;">85 ns</td><td style="background-color: #f2f800;">140 ns</td></tr>
 
<tr><td>7</td><td style="background-color: #f81700;">245 ns</td><td style="background-color: #f86f00;">200 ns</td><td style="background-color: #f84000;">235 ns</td><td style="background-color: #f84000;">235 ns</td><td style="background-color: #daf800;">135 ns</td><td style="background-color: #daf800;">135 ns</td><td style="background-color: #f2f800;">140 ns</td><td style="background-color: #63f800;">85 ns</td></tr>
 
</table>
 
  
 
=== Modules (Zeppelin) ===
 
=== Modules (Zeppelin) ===
In order to reduce various development costs (e.g., [[masks]]), AMD kept the number of die variations to a minimum. Zen is composed of individual modules (i.e., dies) called '''Zeppelins''' that can be interconnected in a [[multi-chip module]] to form larger systems. The {{amd|Threadripper}} die is the same as the {{amd|Ryzen}} die. Likewise the {{amd|EPYC}} family uses the same die. The differences between the processors is how those dies are connected together and which features are enabled and exposed.
+
Zen is composed of individual modules (i.e., dies) called '''Zeppelins''' that can be interconnected in a multi-chip module to form larger systems. Each module consist of:
 
 
<gallery widths=550px heights=300px>
 
File:zen-1zep.svg|Single-Zeppelin Configuration, as found in {{amd|Ryzen 3}}, {{amd|Ryzen 5}}, and {{amd|Ryzen 7}}.
 
File:zen-2zep.svg|Dual-Zeppelin Configuration, as found in {{amd|Ryzen Threadripper}}
 
File:zen-4zep.svg|Quad-Zeppelin Configuration, as found in {{amd|EPYC}}.
 
</gallery>
 
 
 
 
 
Each module consist of:
 
  
 
* 2 Core Complexes (CCX)
 
* 2 Core Complexes (CCX)
Line 677: Line 548:
 
** 2x Unified Memory Controllers (UMC) - one DRAM channel each; 64-bit data + [[ECC]] support, 2 DIMMs, DDR4 1333MT/s-3200MT/s
 
** 2x Unified Memory Controllers (UMC) - one DRAM channel each; 64-bit data + [[ECC]] support, 2 DIMMs, DDR4 1333MT/s-3200MT/s
 
* PSP (MP0) and SMU (MP1) microcontrollers
 
* PSP (MP0) and SMU (MP1) microcontrollers
** AMD Secure Processor, formerly Platform Security Processor
+
** AMD Secure Processor technology as Platform Security Processor (PSP)
** System Management Unit
 
 
* NBIO
 
* NBIO
 
** 2 SYSHUBs, 1 IOHUB with IOMMU v2.x
 
** 2 SYSHUBs, 1 IOHUB with IOMMU v2.x
 
** 2x8 PCIe Gen1/Gen2/Gen3
 
** 2x8 PCIe Gen1/Gen2/Gen3
 
* 6 x4 PHYs plus 5 x2 PHYs
 
* 6 x4 PHYs plus 5 x2 PHYs
** Support PCIe, WAFL, xGMI (Inter-Chip [[infinity_fabric|Global Memory Interconnect]]), SATA, and Ethernet
+
** Support PCIe, WAFL, xGMI, SATA, and Ethernet
 
*** Ethernet complex: Up to 4 lanes of 10/100/1000 SGMII, or 10GBASE-KR, or 1000BASE-KX Ethernet operation
 
*** Ethernet complex: Up to 4 lanes of 10/100/1000 SGMII, or 10GBASE-KR, or 1000BASE-KX Ethernet operation
 
* Southbridge
 
* Southbridge
Line 695: Line 565:
 
[[File:zen soc.png|900px]]
 
[[File:zen soc.png|900px]]
 
</div>
 
</div>
 
== Memory Modes ==
 
Some Zen-based models such as {{amd|Ryzen Threadripper}} which are based on a two-Zeppelin configuration can have large variations in performance depending on how the software was designed to make use of the available memory. Threadripper offers two different memory access modes:
 
 
* UMA ([[Uniform Memory Access]]) or Distributed Mode
 
 
* NUMA ([[Non-uniform Memory Access]]) or Local Mode
 
 
The difference between the two modes is how access to the four memory controllers is done.
 
 
[[File:zen uma.png|left|300px]]
 
In '''UMA''' or '''Distributed Mode''', memory access transactions are distributed uniformly across the four memory channels. Transactions are handled by all four channels simultaneously interleaving the distributed transactions across all channels. UMA allows applications to take advantage of the entire memory bandwidth delivered by the four memory channels. Unfortunately this means that the latency of memory access will vary depending on the path it takes, for example accesses from a channel that is physically located on the second die will be slower than local accesses. This consequently means that the average memory latency will also be slightly higher therefore applications where higher bandwidth is more important than latency will enjoy the benefits of this mode greatly.
 
 
[[File:zen numa.png|right|300px]]
 
In '''NUMA''' or '''Local Mode''', the two local memory channels that are connected to the same die as the CPU that is executing the application is prioritized for the memory access. I.e. memory access transactions are done on the two local memory controllers that are physically located on the same die in order to deliver lower latency. In contrast to distributed mode, the total memory bandwidth is effectively halved. However this mode is important for many applications that are more sensitive to memory latency.
 
{{clear}}
 
 
== Accelerated Processing Units ==
 
[[File:zen apu if.png|right|400px]]
 
{{see also|amd/microarchitectures/vega|amd/cores/raven ridge|l1=AMD Vega|l2=Codename Raven Ridge}}
 
In October 2017, AMD introduced the Zen-based Accelerated Processing Unit (APUs) which incorporate four "Zen" [[physical core|cores]] along with various number of {{\\|Vega}}-based Compute Units (CUs) under the codename {{amd|Raven Ridge|l=core}}. Zen-based APUs are based on an entirely separate [[die]] consisting of a single [[#CPU Complex (CCX)|CCX]] and a GPU. It's worth noting that on this die, the CCX is configured with half the [[L3 cache]] (i.e., 1 MiB/core instead of 2). The cache amount was most likely reduced due to power constraints. The GPU is based on the {{\\|Vega}} graphics microarchitecture featuring up to 11 Compute Units (CUs), each with 64 32-bit floating point arithmetic units. Though it's worth noting that it's currently unknown if 16-bit [[SIMD]] operations are supported like the discrete GPUs. Operating at 1 GHz, a Zen/Vega-based APU with 11 compute units will have a peak performance of 1.408 [[TFLOPs]], likewise, the lower-end parts with just 8 CUs will have a peak performance of 1.024 TFLOPS.
 
 
Up until Zen APUs, AMD had two separate buses: Fusion Compute Link (ONION) - a coherent bus that linked the GPU and CPU together which was used for cache snooping, and a Radeon Memory Bus (GARLIC) - a non-coherent bus that linked the GPU directly to the memory controller. Starting with Zen APUs, everything is now handled by the [[#Infinity Fabric|Infinity Fabric]]. The desktop [[#Modules (Zeppelin)|Zeppelin]] die featured a 32-byte wide data fabric, AMD has not stated if that is also the case with Raven Ridge. On Raven Ridge, the {{amd|Infinity Fabric}} services 6 clients: CCX, GPU, Multimedia Engine, Display Engine, Memory Controller, and the I/O and System Hub. It's worth noting that unlike previous APUs, on Raven Ridge, both the Multimedia Engine and the Display Engine are now separate from the GPU and are interconnected via the fabric.
 
 
Zen-based APUs also introduced {{amd|Precision Boost 2}} and {{amd|XFR|Mobile Extended Frequency Range}} (mXFR). See [[#Features]] for more details.
 
 
=== Power ===
 
{{see also|#Power|l1=§ Power}}
 
As with the desktop parts, voltage control is done on a per-core basis with the digital Low Drop-Out (LDO) regulator. Raven Ridge extended this to the GPU as well. The regulator drops the voltage supplied to RVDD from the on-board [[voltage regulator module]] based on the highest VID. Historically, the CPU and GPU were supplied their own separate VDD. With Raven Ridge, both the CPU and GPU supply comes from a unified RVDD [[power rail]]. Power is supplied to the 4 [[physical core|cores]] and the GPU, allowing each to have their own independent [[P-state]] (i.e., voltage and frequency). Note that all the compute units in the GPU operate at the same frequency and voltage.
 
 
<div>
 
<div style="float: left;">[[File:zen apu p-states.png|400px]]</div>
 
<div style="float: left; margin-left: 50px;">[[File:zen apu p-states utilization.png|400px]]</div>
 
</div>
 
{{clear}}
 
 
==== Enhanced power gating ====
 
[[File:raven ridge power regions.png|right|300px]]
 
Raven Ridge incorporates an enhanced power gating scheme to lower the average power consumption of the chip. Upon exiting [[P-State]], the CPU enters the [[C-state|CC6 Idle State]]. When all the CPU cores enter CC6, the ''CPUOFF'' state is asserted and the shared [[L3 cache]] power is lowered. Likewise, when the GPU enters idle state, up to 95% of the GPU is power gated. A ''GPUOFF'' state further power down the GPU uncore. When both ''CPUOFF'' and ''GPUOFF'' states are asserted, the system VDD regulator is switched off.
 
 
 
::[[File:zen apu power gating.png|500px]]
 
 
 
Power gating on Raven Ridge is split into two regions:
 
 
* Region A - the interface between the CPU, GPU, and I/O Hub
 
* Region B - the memory controller, multimedia engine, and display interface
 
 
The two regions can be independently power gated depending on the workload. For example, during a typical movie playback, Region B is mostly active while Region A is mostly power gates only become briefly active when necessary.
 
 
{{clear}}
 
  
 
== Die ==
 
== Die ==
Line 754: Line 572:
 
* 7 mm² area
 
* 7 mm² area
 
* L2 512 KiB; 1.5 mm²/core
 
* L2 512 KiB; 1.5 mm²/core
 
+
[[File:amd zen core die.png|400px]]
 
 
:[[File:amd zen core.png|500px]]
 
 
 
 
 
:[[File:amd zen core (annotated).png|500px]]
 
  
 
=== CCX ===
 
=== CCX ===
Line 765: Line 578:
 
* L3 8 MiB; 16 mm²
 
* L3 8 MiB; 16 mm²
 
* 1,400,000,000 transistors
 
* 1,400,000,000 transistors
 +
[[File:amd zen ccx.png|450px]]
  
 
+
=== Zeppelin (Octa-Core Die) ===
:[[File:amd zen ccx.png|450px]]
 
 
 
 
 
:[[File:amd zen ccx 2.png|700px]]
 
 
 
 
 
:[[File:amd zen ccx 2 (annotated).png|700px]]
 
 
 
=== Memory Controller ===
 
* 15 mm²
 
* Two DDR4 channels
 
** 72-bits each
 
 
 
 
 
:[[File:amd zeppelin memory controller.png|650px]]
 
 
 
 
 
:[[File:amd zeppelin memory controller (annotated).png|650px]]
 
 
 
=== Zeppelin ===
 
 
* [[14 nm process]]
 
* [[14 nm process]]
 
* 12 metal layers
 
* 12 metal layers
 
* 2,000 meters of signals
 
* 2,000 meters of signals
 
* 4,800,000,000 transistors
 
* 4,800,000,000 transistors
* ~22.058 mm x ~9.655 mm (Estimated)
+
* 22.01 mm x 8.87 mm
* 212.97 mm² die size (note that our initial measurement from tech day was off by half a millimeter on each side)
+
* ~195.228 mm² die size
 
 
 
 
:[[File:amd zen octa-core die shot.png|class=wikichip_ogimage|950px]]
 
 
 
 
 
:[[File:amd zen octa-core die shot (annotated).png|950px]]
 
  
=== APU ===
+
[[File:amd zen octa-core die shot.png|950px]]
* [[quad-core]] "Zen" CPUs + "{{\\|Vega}}" GPU with 11 CUs
 
* [[14 nm process]]
 
* 4,950,000,000 transistors
 
* ~19.213 mm x ~10.919 mm (Estimated)
 
* 209.78 mm²
 
  
: [[File:raven ridge die.png|950px]]
 
  
 +
{{future information}}
  
: [[File:raven ridge die (annotated).png|950px]]
+
[[File:amd zen octa-core die shot (annotated).png|950px]]
  
 
== Sockets/Platform ==
 
== Sockets/Platform ==
All Zen-based mainstream consumer microprocessors utilizes AMD's {{amd|Socket AM4}}, a unified socket infrastructure. All those processors are a complete [[system on a chip]] integrating the [[northbridge]] ([[memory controller]]) and the [[southbridge]] including 16 [[PCIe]] lanes for the [[GPU]], 4 PCIe lanes for the [[NVMe]]/SATA controllers as well as USB 3.0. The chipset, however, extends the processor with a number of additional connections beyond that offered by the SoC.  
+
All Zen-based consumer microprocessors utilizes AMD's {{amd|Socket AM4}}, a unified socket infrastructure. It's interesting to note that every {{amd|Ryzen 7}} processor is actually a complete [[system on a chip]] integrating the [[northbridge]] ([[memory controller]]) and the [[southbridge]] including 16 [[PCIe]] lanes for the [[GPU]], 4 PCIe lanes for I/O along with an [[NVMe controller]] as well as USB 3.0 and SATA controllers. Therefore in theory, Ryzen 7 processors do not even require a [[chipset]]. The role of the chipsets for Zen is to simply provide a number of additional connections beyond that offered by the SoC.  
  
 
{{amd socket am4 chipsets}}
 
{{amd socket am4 chipsets}}
 
[[File:x399 platform.png|right|400px]]
 
{{amd|Threadripper}} uses a different socket called "{{amd|Socket TR4}}" (or sTR4 or simply TR4). This socket allows for 4 memory channels, double the number available for {{amd|Ryzen}} providing up to 60 PCIe lanes (64 with 4 reserved for the chipset).
 
 
{{amd socket tr4 chipsets}}
 
 
{{clear}}
 
  
 
== All Zen Chips ==
 
== All Zen Chips ==
Line 835: Line 611:
 
{{comp table start}}
 
{{comp table start}}
 
<table class="comptable sortable tc13 tc14 tc15 tc16 tc17 tc18 tc19">
 
<table class="comptable sortable tc13 tc14 tc15 tc16 tc17 tc18 tc19">
<tr class="comptable-header"><th>&nbsp;</th><th colspan="20">List of all Zen-based Processors</th></tr>
+
<tr class="comptable-header"><th>&nbsp;</th><th colspan="19">List of all Zen-based Processors</th></tr>
<tr class="comptable-header"><th>&nbsp;</th><th colspan="14">Processor</th><th colspan="6">Features</th></tr>
+
<tr class="comptable-header"><th>&nbsp;</th><th colspan="14">Processor</th><th colspan="3">Features</th></tr>
{{comp table header 1|cols=Price, Process, Launched, Family, Core, C, T, L3$, L2$, L1$, Freq, Turbo, TDP, Max Mem, SMT, AMD-V, XFR, SEV, SME, TSME}}
+
{{comp table header 1|cols=Price, Process, Launched, Family, Core, C, T, L3$, L2$, L1$, Freq, Turbo, TDP, Max Mem, SMT, AMD-V, XFR}}
 
<tr class="comptable-header comptable-header-sep"><th>&nbsp;</th><th colspan="25">[[Uniprocessors]]</th></tr>
 
<tr class="comptable-header comptable-header-sep"><th>&nbsp;</th><th colspan="25">[[Uniprocessors]]</th></tr>
 
{{#ask: [[Category:microprocessor models by amd]] [[instance of::microprocessor]] [[microarchitecture::Zen]] [[max cpu count::1]]
 
{{#ask: [[Category:microprocessor models by amd]] [[instance of::microprocessor]] [[microarchitecture::Zen]] [[max cpu count::1]]
Line 859: Line 635:
 
  |?has amd amd-vi technology
 
  |?has amd amd-vi technology
 
  |?has amd extended frequency range
 
  |?has amd extended frequency range
|?has amd secure encrypted virtualization technology
 
|?has amd secure memory encryption technology
 
|?has amd transparent secure memory encryption technology
 
 
  |format=template
 
  |format=template
 
  |template=proc table 3
 
  |template=proc table 3
  |userparam=22:17
+
  |userparam=19:17
 
  |mainlabel=-
 
  |mainlabel=-
 
  |valuesep=,
 
  |valuesep=,
|limit=100
 
 
}}
 
}}
 
<tr class="comptable-header comptable-header-sep"><th>&nbsp;</th><th colspan="25">[[Multiprocessors]] (dual-socket)</th></tr>
 
<tr class="comptable-header comptable-header-sep"><th>&nbsp;</th><th colspan="25">[[Multiprocessors]] (dual-socket)</th></tr>
Line 890: Line 662:
 
  |?has amd amd-vi technology
 
  |?has amd amd-vi technology
 
  |?has amd extended frequency range
 
  |?has amd extended frequency range
|?has amd secure encrypted virtualization technology
 
|?has amd secure memory encryption technology
 
|?has amd transparent secure memory encryption technology
 
 
  |format=template
 
  |format=template
 
  |template=proc table 3
 
  |template=proc table 3
  |userparam=22:17
+
  |userparam=19:17
 
  |mainlabel=-
 
  |mainlabel=-
 
  |valuesep=,
 
  |valuesep=,
|limit=100
 
 
}}
 
}}
 
{{comp table count|ask=[[Category:microprocessor models by amd]] [[instance of::microprocessor]] [[microarchitecture::Zen]]}}
 
{{comp table count|ask=[[Category:microprocessor models by amd]] [[instance of::microprocessor]] [[microarchitecture::Zen]]}}
Line 904: Line 672:
 
{{comp table end}}
 
{{comp table end}}
  
== Designers ==
+
== References ==
* Mike Clark, chief architect
+
* Michael Clark, AMD's senior fellow and lead architect, Hot Chips 28, 2016
 
 
== Bibliography ==
 
* IEEE Hot Chips 28 Symposium (HCS) 2016
 
* AMD x86 Memory Encryption Technologies, Linux Security Summit 2016, David Kaplan, Security Architect, August 25, 2016
 
 
* Lisa Su, AMD CEO, AMD: New Horizon Live Event
 
* Lisa Su, AMD CEO, AMD: New Horizon Live Event
 
* Lisa Su, AMD CEO, AMD Annual Meeting of Shareholders Q4 2016
 
* Lisa Su, AMD CEO, AMD Annual Meeting of Shareholders Q4 2016
 
* Meet the AMD Experts - AMD Monthly Partner Training, January 2017
 
* Meet the AMD Experts - AMD Monthly Partner Training, January 2017
* IEEE ISSCC 2017
+
* Zen: A Next-Generation High-Performance x86 Core, ISSCC 2017
 
* AMD 'Tech Day', February 22, 2017
 
* AMD 'Tech Day', February 22, 2017
 
* AMD Infinity Fabric introduction by Mark Papermaster, April 6, 2017
 
* AMD Infinity Fabric introduction by Mark Papermaster, April 6, 2017
 
* AMD Zen at GDC 2017, March 3, 2017
 
* AMD Zen at GDC 2017, March 3, 2017
 +
* Processor Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1 Processors
 
* AMD 2017 Financial Analyst Day, May 16, 2017
 
* AMD 2017 Financial Analyst Day, May 16, 2017
 
* AMD EPYC Tech Day, June 20, 2017
 
* AMD EPYC Tech Day, June 20, 2017
* IEEE Hot Chips 29 Symposium (HCS) 2017
 
* AMD Ryzen Processor With Radeon Vega Graphics, October, 2017
 
* IEEE ISSCC 2018
 
* Processor Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1 Processors
 
  
 
== Documents ==
 
== Documents ==
Line 932: Line 693:
 
** [[:File:amd epyc performance brief.pdf|AMD EPYC Performance Brief]], June 2017
 
** [[:File:amd epyc performance brief.pdf|AMD EPYC Performance Brief]], June 2017
 
** [[:File:amd epyc solution brief.pdf|AMD EPYC Solution Brief]], June 2017
 
** [[:File:amd epyc solution brief.pdf|AMD EPYC Solution Brief]], June 2017
* [[:File:amd x86 memory encryption technology.pdf|AMD x86 Memory Encryption Technologies]], David Kaplan, Security Architect, LSS 2016 August 25, 2016
 
=== Manuals ===
 
* [[:File:56255 OSRR.pdf|Open-Source Register Reference For AMD Family 17h Processors]]
 
  
 
== See also ==
 
== See also ==
 
* {{intel|Kaby Lake}}
 
* {{intel|Kaby Lake}}
* {{intel|Cannon Lake}}
+
* {{intel|Cannonlake}}

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)

This page is a member of 1 hidden category:

codenameZen +
core count4 +, 6 +, 8 +, 16 +, 24 +, 32 + and 12 +
designerAMD +
first launchedMarch 2, 2017 +
full page nameamd/microarchitectures/zen +
instance ofmicroarchitecture +
instruction set architecturex86-64 +
manufacturerGlobalFoundries +
microarchitecture typeCPU +
nameZen +
pipeline stages19 +
process14 nm (0.014 μm, 1.4e-5 mm) +