From WikiChip
Editing amd/microarchitectures/zen
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.
Latest revision | Your text | ||
Line 63: | Line 63: | ||
|core name 5=Snowy Owl | |core name 5=Snowy Owl | ||
|core name 6=Great Horned Owl | |core name 6=Great Horned Owl | ||
− | |||
|predecessor=Excavator | |predecessor=Excavator | ||
|predecessor link=amd/microarchitectures/excavator | |predecessor link=amd/microarchitectures/excavator | ||
Line 74: | Line 73: | ||
|core names=Yes | |core names=Yes | ||
}} | }} | ||
− | '''Zen''' ('''family 17h''') is the [[microarchitecture]] developed by [[AMD]] as a successor to both {{\\|Excavator}} and {{\\|Puma}}. Zen is an entirely new design, built from the ground up for optimal balance of performance and power capable of covering the entire computing spectrum from fanless notebooks to high-performance desktop computers. Zen was officially launched on March 2, [[2017]]. Zen | + | '''Zen''' ('''family 17h''') is the [[microarchitecture]] developed by [[AMD]] as a successor to both {{\\|Excavator}} and {{\\|Puma}}. Zen is an entirely new design, built from the ground up for optimal balance of performance and power capable of covering the entire computing spectrum from fanless notebooks to high-performance desktop computers. Zen was officially launched on March 2, [[2017]]. Zen is set to be eventually replaced by {{\\|Zen+}}. |
− | For performance desktop and mobile computing, Zen is branded as | + | For performance desktop and mobile computing, Zen is branded as {{amd|Ryzen 3}}, {{amd|Ryzen 5}}, {{amd|Ryzen 7}} and {{amd|Ryzen Threadripper}} processors. For servers, Zen is branded as {{amd|EPYC}}. |
== Etymology == | == Etymology == | ||
Line 94: | Line 93: | ||
|- | |- | ||
| {{amd|Raven Ridge|l=core}} || Up to 4/8 || Mobile processors with {{\\|Vega}} GPU | | {{amd|Raven Ridge|l=core}} || Up to 4/8 || Mobile processors with {{\\|Vega}} GPU | ||
− | |||
− | |||
|- | |- | ||
| {{amd|Snowy Owl|l=core}} || Up to 16/32 || Embedded edge processors | | {{amd|Snowy Owl|l=core}} || Up to 16/32 || Embedded edge processors | ||
|- | |- | ||
| {{amd|Great Horned Owl|l=core}} || Up to 4/8 || Embedded processors with {{\\|Vega}} GPU | | {{amd|Great Horned Owl|l=core}} || Up to 4/8 || Embedded processors with {{\\|Vega}} GPU | ||
− | |||
− | |||
|} | |} | ||
Line 117: | Line 112: | ||
! colspan="11" | Mainstream | ! colspan="11" | Mainstream | ||
|- | |- | ||
− | | [[File:amd ryzen 3 logo.png|75px|link=Ryzen 3]] || {{amd|Ryzen 3}} || Entry level Performance || [[quad-core|Quad]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk| | + | | [[File:amd ryzen 3 logo.png|75px|link=Ryzen 3]] || {{amd|Ryzen 3}} || Entry level Performance || [[quad-core|Quad]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk|some}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}} |
|- | |- | ||
− | | rowspan="2" | [[File:amd ryzen 5 logo.png|75px|link=Ryzen 5]] || rowspan="2" | {{amd|Ryzen 5}} || rowspan="2" | Mid-range Performance || [[quad-core|Quad]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk| | + | | rowspan="2" | [[File:amd ryzen 5 logo.png|75px|link=Ryzen 5]] || rowspan="2" | {{amd|Ryzen 5}} || rowspan="2" | Mid-range Performance || [[quad-core|Quad]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|some}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}} |
|- | |- | ||
− | | [[hexa-core|Hexa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk| | + | | [[hexa-core|Hexa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|some}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}} |
|- | |- | ||
− | | [[File:amd ryzen 7 logo.png|75px|link=Ryzen 7]] || {{amd|Ryzen 7}} || High-end Performance || [[octa-core|Octa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk| | + | | [[File:amd ryzen 7 logo.png|75px|link=Ryzen 7]] || {{amd|Ryzen 7}} || High-end Performance || [[octa-core|Octa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|some}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}} |
− | |||
− | |||
|- | |- | ||
! colspan="11" | Enthusiasts / Workstations | ! colspan="11" | Enthusiasts / Workstations | ||
|- | |- | ||
− | | [[File:ryzen threadripper logo.png|75px|link=Ryzen Threadripper]] || {{amd|Ryzen Threadripper}} || Enthusiasts || [[8 cores|8]]-[[16 cores|16]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk| | + | | [[File:ryzen threadripper logo.png|75px|link=Ryzen Threadripper]] || {{amd|Ryzen Threadripper}} || Enthusiasts || [[8 cores|8]]-[[16 cores|16]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|some}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}} |
|- | |- | ||
! colspan="11" | Servers | ! colspan="11" | Servers | ||
|- | |- | ||
− | | [[File:amd epyc logo.png|75px|link=amd/epyc]] || {{amd|EPYC}} || High-performance Server Processor || [[8 cores|8]]-[[32 cores|32]] || | + | | [[File:amd epyc logo.png|75px|link=amd/epyc]] || {{amd|EPYC}} || High-performance Server Processor || [[8 cores|8]]-[[32 cores|32]] || || {{tchk|yes}} || {{tchk|yes}} || || {{tchk|no}} || {{tchk|yes}} || {{tchk|yes}} |
|- | |- | ||
! colspan="11" | Embedded / Edge | ! colspan="11" | Embedded / Edge | ||
|- | |- | ||
− | | [[File:epyc embedded logo.png|75px|link=amd/epyc embedded]] || {{amd|EPYC Embedded}} || Embedded / Edge Server Processor || [[8 cores|8]]-[[16 cores|16]] || | + | | [[File:epyc embedded logo.png|75px|link=amd/epyc embedded]] || {{amd|EPYC Embedded}} || Embedded / Edge Server Processor || [[8 cores|8]]-[[16 cores|16]] || || {{tchk|yes}} || {{tchk|some}} || || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}} |
|- | |- | ||
| [[File:ryzen embedded logo.png|75px|link=amd/ryzen embedded]] || {{amd|Ryzen Embedded}} || Embedded APUs || [[4 cores|4]] || || {{tchk|yes}} || {{tchk|some}} || || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} | | [[File:ryzen embedded logo.png|75px|link=amd/ryzen embedded]] || {{amd|Ryzen Embedded}} || Embedded APUs || [[4 cores|4]] || || {{tchk|yes}} || {{tchk|some}} || || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} | ||
Line 143: | Line 136: | ||
* '''Note:''' While a model has an unlocked multiplier, not all chipsets support overclocking. (see [[#Sockets/Platform|§Sockets]]) | * '''Note:''' While a model has an unlocked multiplier, not all chipsets support overclocking. (see [[#Sockets/Platform|§Sockets]]) | ||
− | * '''Note:''' 'X' models will enjoy "Full XFR" providing an additional +100 MHz (200 for {{amd|1500X}} | + | * '''Note:''' 'X' models will enjoy "Full XFR" providing an additional +100 MHz (200 for {{amd|1500X}}) when sufficient thermo/electric requirements are met. Non-X models are limited to just +50 MHz. |
=== Identification === | === Identification === | ||
Line 156: | Line 149: | ||
| ex 7 = X | | ex 7 = X | ||
| ex 2 1 = Ryzen | | ex 2 1 = Ryzen | ||
− | | ex 2 2 = | + | | ex 2 2 = 3 |
| ex 2 3 = | | ex 2 3 = | ||
− | | ex 2 4 = | + | | ex 2 4 = 1 |
− | | ex 2 5 = | + | | ex 2 5 = 2 |
− | + | | ex 2 6 = 00 | |
− | | ex 2 | + | | ex 2 7 = M |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | | ex | ||
| desc 1 = '''Brand Name'''<br><table><tr><td style="width: 50px;">'''{{amd|Ryzen}}'''</td><td></td></tr></table> | | desc 1 = '''Brand Name'''<br><table><tr><td style="width: 50px;">'''{{amd|Ryzen}}'''</td><td></td></tr></table> | ||
− | | desc 2 = '''Market segment'''<br><table><tr><td style="width: 50px;">'''3'''</td><td>Low-end performance</td></tr><tr><td>'''5'''</td><td>Mid-range performance</td></tr><tr><td>'''7'''</td><td>Enthusiast / High-end performance | + | | desc 2 = '''Market segment'''<br><table><tr><td style="width: 50px;">'''3'''</td><td>Low-end performance</td></tr><tr><td>'''5'''</td><td>Mid-range performance</td></tr><tr><td>'''7'''</td><td>Enthusiast / High-end performance</td></tr></table> |
| desc 3 = | | desc 3 = | ||
− | | desc 4 = '''Generation'''<br><table><tr><td style="width: 50px;">'''1'''</td><td>First generation Zen (2017)</td></tr><tr><td style="width: 50px;">'''2'''</td><td>First generation Zen | + | | desc 4 = '''Generation'''<br><table><tr><td style="width: 50px;">'''1'''</td><td>First generation Zen (2017)</td></tr><tr><td style="width: 50px;">'''2'''</td><td>First generation Zen (2017) for Mobile APUs</td></tr></table> |
− | | desc 5 = '''Performance Level'''<br><table><tr><td style="width: 50px;">'''9'''</td><td>Extreme (Ryzen Threadripper | + | | desc 5 = '''Performance Level'''<br><table><tr><td style="width: 50px;">'''9'''</td><td>Extreme (Ryzen Threadripper)</td></tr><tr><td>'''8'''</td><td>Highest (Ryzen 7)</td></tr><tr><td>'''6-7'''</td><td>High (Ryzen 5 & 7)</td></tr><tr><td>'''4-5'''</td><td>Mid (Ryzen 5)</td></tr><tr><td>'''1-3'''</td><td>Low (Ryzen 3)</td></tr></table> |
− | | desc 6 = '''Model Number'''<br> | + | | desc 6 = '''Model Number'''<br>Reserved for future speed bump/differentiator. Currently all models are "00". |
− | | desc 7 = '''Power Segment'''<br><table><tr><td style="width: 50px;">'''(none)'''</td><td>Standard Desktop</td></tr><tr><td>'''U'''</td><td>Standard Mobile</td></tr><tr><td>'''X'''</td><td>High Performance, with XFR | + | | desc 7 = '''Power Segment'''<br><table><tr><td style="width: 50px;">'''(none)'''</td><td>Standard Desktop</td></tr><tr><td>'''U'''</td><td>Standard Mobile</td></tr><tr><td>'''X'''</td><td>High Performance, with XFR</td></tr><tr><td>'''G'''</td><td>Desktop + [[IGP]]</td></tr><tr><td>'''T'''</td><td>Low-power Desktop</td></tr><tr><td>'''S'''</td><td>Low-power Desktop + [[IGP]]</td></tr><tr><td>'''M'''</td><td>Low-power Mobile</td></tr><tr><td>'''H'''</td><td>High-performance Mobile</td></tr></table> |
}} | }} | ||
Line 189: | Line 175: | ||
== Process Technology == | == Process Technology == | ||
{{see also|14 nm process}} | {{see also|14 nm process}} | ||
− | Zen is manufactured on [[Global Foundries]]' [[14 nm process]] Low Power Plus (14LPP). AMD's previous microarchitectures were based on [[32 nm|32]] and [[28 nm|28]] nanometer processes. The jump to 14 nm was part of AMD's attempt to remain competitive against Intel (Both {{intel| | + | Zen is manufactured on [[Global Foundries]]' [[14 nm process]] Low Power Plus (14LPP). AMD's previous microarchitectures were based on [[32 nm|32]] and [[28 nm|28]] nanometer processes. The jump to 14 nm was part of AMD's attempt to remain competitive against Intel (Both {{intel|SkyLake}} and {{intel|Kaby Lake}} are also manufactured on 14 nm). The move to 14 nm will bring along related benefits of a smaller node such as reduced heat, reduced power consumption, and higher density for identical designs. |
== Compatibility == | == Compatibility == | ||
Line 203: | Line 189: | ||
| style="background-color: #d6ffd8;" | Windows 10 || Support | | style="background-color: #d6ffd8;" | Windows 10 || Support | ||
|- | |- | ||
− | + | | Linux || Linux || style="background-color: #d6ffd8;" | Kernel 4.10 || Initial Support | |
− | |||
− | |||
|} | |} | ||
Line 214: | Line 198: | ||
! Compiler !! Arch-Specific || Arch-Favorable | ! Compiler !! Arch-Specific || Arch-Favorable | ||
|- | |- | ||
− | | [[AOCC]] || <code> | + | | [[AOCC]] || <code>‐march=znver1</code> || <code>-mtune=znver1</code> |
|- | |- | ||
| [[GCC]] || <code>-march=znver1</code> || <code>-mtune=znver1</code> | | [[GCC]] || <code>-march=znver1</code> || <code>-mtune=znver1</code> | ||
Line 221: | Line 205: | ||
|- | |- | ||
| [[Visual Studio]] || <code>/arch:AVX2</code> || ? | | [[Visual Studio]] || <code>/arch:AVX2</code> || ? | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|} | |} | ||
Line 242: | Line 211: | ||
=== Key changes from {{\\|Excavator}} === | === Key changes from {{\\|Excavator}} === | ||
− | * Zen was designed to succeed | + | * Zen was designed to succeed BOTH {{\\|Excavator}} (High-performance) and {{\\|Puma}} (Low-power) covering the entire range in one architecture |
** Cover the entire spectrum from fanless notebooks to high-performance desktops | ** Cover the entire spectrum from fanless notebooks to high-performance desktops | ||
** More aggressive clock gating with multi-level regions | ** More aggressive clock gating with multi-level regions | ||
Line 251: | Line 220: | ||
** From {{\\|Piledriver}} to Zen | ** From {{\\|Piledriver}} to Zen | ||
** Based on the industry-standardized SPECint_base2006 score compiled with GCC 4.6 -O2 at a fixed 3.4GHz | ** Based on the industry-standardized SPECint_base2006 score compiled with GCC 4.6 -O2 at a fixed 3.4GHz | ||
− | * Up to 3. | + | * Up to 3.7x performance/watt improvment |
* Return to conventional high-performance x86 design | * Return to conventional high-performance x86 design | ||
** Traditional design for cores without shared blocks (e.g. shared SIMD units) | ** Traditional design for cores without shared blocks (e.g. shared SIMD units) | ||
** Large beefier core design | ** Large beefier core design | ||
* Core engine | * Core engine | ||
− | ** Simultaneous Multithreading (SMT) support, 2 threads/core (see [[# | + | ** Simultaneous Multithreading (SMT) support, 2 threads/core (see [[#Simultaneous_MultiThreading_.28SMT.29|§ Simultaneous MultiThreading]] for details) |
** Branch Predictor | ** Branch Predictor | ||
*** Improved branch mispredictions | *** Improved branch mispredictions | ||
Line 262: | Line 231: | ||
**** Lower miss latency penalty | **** Lower miss latency penalty | ||
*** BP is now decoupled from fetch stage | *** BP is now decoupled from fetch stage | ||
− | ** Large | + | ** Large Op cache (2K instructions) |
** Wider μop dispatch (6, up from 4) | ** Wider μop dispatch (6, up from 4) | ||
** Larger instruction scheduler | ** Larger instruction scheduler | ||
Line 279: | Line 248: | ||
*** 64 KiB (double from previous capacity of 32 KiB) | *** 64 KiB (double from previous capacity of 32 KiB) | ||
*** Write-back L1 cache eviction policy (From write-through) | *** Write-back L1 cache eviction policy (From write-through) | ||
− | *** | + | *** 2x the bandwidth |
** L2 | ** L2 | ||
− | *** | + | *** 2x the bandwidth |
*** Faster L2 cache | *** Faster L2 cache | ||
** Faster L3 cache | ** Faster L3 cache | ||
+ | ** Large Op cache | ||
** Better L1$ and L2$ data prefetcher | ** Better L1$ and L2$ data prefetcher | ||
− | ** | + | ** 5x L3 bandwidth |
** Move elimination block added | ** Move elimination block added | ||
** Page Table Entry (PTE) Coalescing | ** Page Table Entry (PTE) Coalescing | ||
Line 369: | Line 339: | ||
** System DRAM: | ** System DRAM: | ||
*** 2 channels per die | *** 2 channels per die | ||
− | *** Summit Ridge: up to PC4-21300U (DDR4-2666 UDIMM) | + | *** Summit Ridge: up to PC4-21300U (DDR4-2666 UDIMM) |
− | *** Raven Ridge: up to PC4-23466U (DDR4-2933 UDIMM) | + | *** Raven Ridge: up to PC4-23466U (DDR4-2933 UDIMM) |
− | *** Naples: up to PC4-21300L (DDR4-2666 RDIMM/LRDIMM) | + | *** Naples: up to PC4-21300L (DDR4-2666 RDIMM/LRDIMM) |
− | *** ECC: x4 DRAM device failure correction (Chipkill), x8 SEC-DED ECC, Patrol and Demand scrubbing, Data poisoning | + | *** ECC support: x4 DRAM device failure correction (Chipkill), x8 SEC-DED ECC, Patrol and Demand scrubbing, Data poisoning |
Zen TLB consists of dedicated level one TLB for instruction cache and another one for data cache. | Zen TLB consists of dedicated level one TLB for instruction cache and another one for data cache. | ||
Line 401: | Line 371: | ||
Unlike many of Intel's recent microarchitectures (such as {{intel|Skylake|l=arch}} and {{intel|Kaby Lake|l=arch}}) which make use of a unified scheduler, AMD continue to use a split pipeline design. µOP are decoupled at the µOP Queue and are sent through the two distinct pipelines to either the Integer side or the FP side. The two sections are completely separate, each featuring separate schedulers, queues, and execution units. The Integer side splits up the µOPs via a set of individual schedulers that feed the various ALU units. On the floating point side, there is a different scheduler to handle the 128-bit FP operations. Zen support all modern {{x86|extensions|x86 extensions}} including {{x86|AVX}}/{{x86|AVX2}}, {{x86|BMI1}}/{{x86|BMI2}}, and {{x86|AES}}. Zen also supports {{x86|SHA}}, secure hash implementation instructions that are currently only found in [[Intel]]'s ultra-low power microarchitectures (e.g. {{intel|Goldmont|l=arch}}) but not in their mainstream processors. | Unlike many of Intel's recent microarchitectures (such as {{intel|Skylake|l=arch}} and {{intel|Kaby Lake|l=arch}}) which make use of a unified scheduler, AMD continue to use a split pipeline design. µOP are decoupled at the µOP Queue and are sent through the two distinct pipelines to either the Integer side or the FP side. The two sections are completely separate, each featuring separate schedulers, queues, and execution units. The Integer side splits up the µOPs via a set of individual schedulers that feed the various ALU units. On the floating point side, there is a different scheduler to handle the 128-bit FP operations. Zen support all modern {{x86|extensions|x86 extensions}} including {{x86|AVX}}/{{x86|AVX2}}, {{x86|BMI1}}/{{x86|BMI2}}, and {{x86|AES}}. Zen also supports {{x86|SHA}}, secure hash implementation instructions that are currently only found in [[Intel]]'s ultra-low power microarchitectures (e.g. {{intel|Goldmont|l=arch}}) but not in their mainstream processors. | ||
− | From the memory subsystem point of view, data is fed into the execution units from the [[L1D$]] via the load and store queue (both of which were almost doubled in capacity) via the two [[Address Generation Units]] (AGUs) at the rate of 2 loads and 1 store per cycle. Each core also has a 512 KiB level 2 cache. L2 feeds both the the level 1 data and level 1 instruction caches at 32B per cycle (32B can be | + | From the memory subsystem point of view, data is fed into the execution units from the [[L1D$]] via the load and store queue (both of which were almost doubled in capacity) via the two [[Address Generation Units]] (AGUs) at the rate of 2 loads and 1 store per cycle. Each core also has a 512 KiB level 2 cache. L2 feeds both the the level 1 data and level 1 instruction caches at 32B per cycle (32B can be send in either direction (bidirectional bus) each cycle). L2 is connected to the L3 cache which is shared across all cores. As with the L1 to L2 transfers, the L2 also transfers data to the L3 and vice versa at 32B per cycle (32B in either direction each cycle). |
{{clear}} | {{clear}} | ||
Line 417: | Line 387: | ||
* 512-entry L2 TLB, no 1G pages | * 512-entry L2 TLB, no 1G pages | ||
− | ===== | + | ===== fetching ===== |
Instructions are fetched from the [[L2 cache]] at the rate of 32B/cycle. Zen features an asymmetric [[level 1 cache]] with a 64 [[KiB]] [[instruction cache]], double the size of the L1 data cache. Depending on the branch prediction decision instructions may be fetched from the instruction cache or from the [[µOPs]] cache in which eliminates the need for performing the costly instruction decoding. | Instructions are fetched from the [[L2 cache]] at the rate of 32B/cycle. Zen features an asymmetric [[level 1 cache]] with a 64 [[KiB]] [[instruction cache]], double the size of the L1 data cache. Depending on the branch prediction decision instructions may be fetched from the instruction cache or from the [[µOPs]] cache in which eliminates the need for performing the costly instruction decoding. | ||
[[File:amd zen hc28 decode.png|left|300px]] | [[File:amd zen hc28 decode.png|left|300px]] | ||
− | On the traditional side of decode, instructions are fetched from the L1$ at 32B aligned bytes per cycle and go to the instruction byte buffer and through the pick stage to the decode. Actual tests show the effective throughput is generally much lower (around 16-20 bytes). This is slightly higher than the fetch window in [[Intel]]'s {{intel|Skylake}} which has a 16-byte fetch window. The size of the instruction byte buffer | + | On the traditional side of decode, instructions are fetched from the L1$ at 32B aligned bytes per cycle and go to the instruction byte buffer and through the pick stage to the decode. Actual tests show the effective throughput is generally much lower (around 16-20 bytes). This is slightly higher than the fetch window in [[Intel]]'s {{intel|Skylake}} which has a 16-byte fetch window. The size of the instruction byte buffer was not given by AMD but it's expected to be larger than the 16-entry structure found in {{amd|microarchitectures|their previous}} architecture. |
===== µOP cache & x86 tax ===== | ===== µOP cache & x86 tax ===== | ||
Line 427: | Line 397: | ||
The [[µOP cache]] used in Zen is not a [[trace cache]] and much closely resembles the one used by Intel in their microarchitectures since {{intel|Sandy Bridge|l=arch}}. The µOP cache is an independent unit not part of the [[L1I$]] and is not a necessarily a subset of the L1I cache either; I.e., there are instances where there could be a hit in the µOP cache but a miss in the L1$. This happens when an instruction that got stored in the µOP cache gets evicted from L1. During the fetch stage probing must be done from both paths. Zen has a specific unit called 'Micro-Tags' which does the probing and determines whether the instruction should be accessed from the µOP cache or from the L1I$. The µOP cache itself has a dedicated $tags for accessing those µOPs. | The [[µOP cache]] used in Zen is not a [[trace cache]] and much closely resembles the one used by Intel in their microarchitectures since {{intel|Sandy Bridge|l=arch}}. The µOP cache is an independent unit not part of the [[L1I$]] and is not a necessarily a subset of the L1I cache either; I.e., there are instances where there could be a hit in the µOP cache but a miss in the L1$. This happens when an instruction that got stored in the µOP cache gets evicted from L1. During the fetch stage probing must be done from both paths. Zen has a specific unit called 'Micro-Tags' which does the probing and determines whether the instruction should be accessed from the µOP cache or from the L1I$. The µOP cache itself has a dedicated $tags for accessing those µOPs. | ||
− | ===== | + | ===== decode ===== |
[[File:amd fastpath single-double (zen).svg|right|450px]] | [[File:amd fastpath single-double (zen).svg|right|450px]] | ||
Having to execute [[x86]], there are instructions that actually include multiple operations. Some of those operations cannot be realized efficiently in an OoOE design and therefore must be converted into simpler operations. In the front-end, complex x86 instructions are broken down into simpler fixed-length operations called [[macro-operations]] or MOPs (sometimes also called complex OPs or COPs). Those are often mistaken for being "[[RISC]]ish" in nature but they retain their CISC characteristics. MOPS can perform both an arithmetic operation and memory operation (e.g. you can read, modify, and write in a single MOP). MOPs can be further cracked into smaller simpler single fixed length operation called [[micro-operations]] (µOPs). µOPs are a fixed length operation that performs just a single operation (i.e., only a single load, store, or an arithmetic). Traditionally AMD used to distinguish between the two ops, however with Zen AMD simply refers to everything as µOPs although internally they are still two separate concepts. | Having to execute [[x86]], there are instructions that actually include multiple operations. Some of those operations cannot be realized efficiently in an OoOE design and therefore must be converted into simpler operations. In the front-end, complex x86 instructions are broken down into simpler fixed-length operations called [[macro-operations]] or MOPs (sometimes also called complex OPs or COPs). Those are often mistaken for being "[[RISC]]ish" in nature but they retain their CISC characteristics. MOPS can perform both an arithmetic operation and memory operation (e.g. you can read, modify, and write in a single MOP). MOPs can be further cracked into smaller simpler single fixed length operation called [[micro-operations]] (µOPs). µOPs are a fixed length operation that performs just a single operation (i.e., only a single load, store, or an arithmetic). Traditionally AMD used to distinguish between the two ops, however with Zen AMD simply refers to everything as µOPs although internally they are still two separate concepts. | ||
Line 433: | Line 403: | ||
Decoding is done by the 4 Zen decoders. The decode stage allows for four [[x86]] instructions to be decoded per cycle which are in turn sent to the µOP Queue. Previously, in the {{\\|Bulldozer}}/{{\\|Jaguar}}-based designs AMD had two paths: a FastPath Single which emitted a single MOP and a FastPath Double which emitted two MOPs which are in turn sent down the pipe to the schedulers. Michael Clark (Zen's lead architect) noted that Zen has significantly denser MOPs meaning almost all instructions will be a FastPath Single (i.e., one to one transformations). What would normally get broken down into two MOPs in {{\\|Bulldozer}} is now translated into a single dense MOP. It's for those reasons that while up to 8MOPs/cycle can be emitted, usually only 4MOPs/cycle are emitted from the [[instruction decoder|decoders]]. | Decoding is done by the 4 Zen decoders. The decode stage allows for four [[x86]] instructions to be decoded per cycle which are in turn sent to the µOP Queue. Previously, in the {{\\|Bulldozer}}/{{\\|Jaguar}}-based designs AMD had two paths: a FastPath Single which emitted a single MOP and a FastPath Double which emitted two MOPs which are in turn sent down the pipe to the schedulers. Michael Clark (Zen's lead architect) noted that Zen has significantly denser MOPs meaning almost all instructions will be a FastPath Single (i.e., one to one transformations). What would normally get broken down into two MOPs in {{\\|Bulldozer}} is now translated into a single dense MOP. It's for those reasons that while up to 8MOPs/cycle can be emitted, usually only 4MOPs/cycle are emitted from the [[instruction decoder|decoders]]. | ||
− | Dispatch is capable of sending up to 6 µOP to [[Integer]] EX and an | + | Dispatch is capable of sending up to 6 µOP to [[Integer]] EX and an additional 4 µOP to the [[Floating Point]] (FP) EX. Zen can dispatch to both at the same time (i.e. for a maximum of 10 µOP per cycle). |
====== MSROM ====== | ====== MSROM ====== | ||
Line 441: | Line 411: | ||
A number of optimization opportunities are exploited at this stage. | A number of optimization opportunities are exploited at this stage. | ||
====== Stack Engine ====== | ====== Stack Engine ====== | ||
− | At the decode stage Zen incorporates the the Stack Engine Memfile (SEM). Note that while AMD refers to SEM as a new unit, they have had a Stack Engine in their designs since {{\\|K10}}. The Memfile sits between the queue and dispatch monitoring the MOP traffic. The Memfile is capable of performing [[store-to-load forwarding]] right at dispatch for loads that trail behind known stores with physical addresses. Other things such as eliminating stack PUSH/POP operations are also done at this stage so they are effectively a zero-latency instructions; proceeding instructions that | + | At the decode stage Zen incorporates the the Stack Engine Memfile (SEM). Note that while AMD refers to SEM as a new unit, they have had a Stack Engine in their designs since {{\\|K10}}. The Memfile sits between the queue and dispatch monitoring the MOP traffic. The Memfile is capable of performing [[store-to-load forwarding]] right at dispatch for loads that trail behind known stores with physical addresses. Other things such as eliminating stack PUSH/POP operations are also done at this stage so they are effectively a zero-latency instructions; proceeding instructions that relay on the stack pointer are not delayed. This is a fairly effective low-power solution that off-loads some of the work that would otherwise be done by [[AGU]]. |
====== µOP-Fusion ====== | ====== µOP-Fusion ====== | ||
Line 451: | Line 421: | ||
==== Execution Engine ==== | ==== Execution Engine ==== | ||
[[File:amd zen hc28 integer.png|350px|right]] | [[File:amd zen hc28 integer.png|350px|right]] | ||
− | As mentioned early, Zen returns to a fully partitioned core design with a private L2 cache and private [[FP]]/[[SIMD]] units. Previously those units shared resources spanning two cores. Zen's Execution Engine (Back-End) is split into two major sections: [[integer]] & memory operations and [[floating point]] operations. The two sections are decoupled with independent [[register renaming|renaming]], [[schedulers]], [[queues]], and execution units. Both Integer and FP sections have access to the [[Retire Queue]] which is 192 entries | + | As mentioned early, Zen returns to a fully partitioned core design with a private L2 cache and private [[FP]]/[[SIMD]] units. Previously those units shared resources spanning two cores. Zen's Execution Engine (Back-End) is split into two major sections: [[integer]] & memory operations and [[floating point]] operations. The two sections are decoupled with independent [[register renaming|renaming]], [[schedulers]], [[queues]], and execution units. Both Integer and FP sections have access to the [[Retire Queue]] which is 192 entries and can [[retire]] 8 instructions per cycle (independent of either Integer or FP). The wider-than-dispatch retire allows Zen to catch up and free the resources much quicker (previous architectures saw bottleneck at this point in situations where an older op is stalling causing a reduction in performance due to retire needing to catch up to the front of the machine). |
Because the two regions are entirely divided, a penalty of one cycle latency will incur for operands that crosses boundaries; for example, if an [[operand]] of an integer arithmetic µOP depends on the result of a floating point µOP operation. This applies both ways. This is a similar to the inter-[[Common Data Bus]] exchanges in Intel's designs (e.g., {{intel|Skylake|l=arch}}) which incur a delay of 1 to 2 cycles when dependent operands cross domains. | Because the two regions are entirely divided, a penalty of one cycle latency will incur for operands that crosses boundaries; for example, if an [[operand]] of an integer arithmetic µOP depends on the result of a floating point µOP operation. This applies both ways. This is a similar to the inter-[[Common Data Bus]] exchanges in Intel's designs (e.g., {{intel|Skylake|l=arch}}) which incur a delay of 1 to 2 cycles when dependent operands cross domains. | ||
Line 479: | Line 449: | ||
==== Memory Subsystem ==== | ==== Memory Subsystem ==== | ||
[[File:amd zen hc28 memory.png|300px|right]] | [[File:amd zen hc28 memory.png|300px|right]] | ||
− | Loads and Stores are conducted via the two AGUs which can operate simultaneously. Zen has a | + | Loads and Stores are conducted via the two AGUs which can operate simultaneously. Zen has a much larger load queue capable of supporting 72 out-of-order loads (same as Intel's {{intel|Skylake|l=arch}}). There is also a 44-entry Store Queue. Zen employs a split TLB-data pipe design which allows TLB tag access to take place while the data cache is being fed in order to determine if the data is available and send their address to the L2 to start prefetching early on. Zen is capable of up to two loads per cycle (2x16B each) and up to one store per cycle (1x16B). The L1 TLB is 64-entry for all page sizes and the L2 TLB is a 1536-entry with no 1 GiB pages. |
− | Zen incorporates a 64 KiB 4-way set associative L1 instruction cache and a 32 KiB 8-way set associative | + | Zen incorporates a 64 KiB 4-way set associative L1 instruction cache and a 32 KiB 8-way set associative L2 data cache. Both the instruction cache and the data cache can fetch from the L2 cache at 32 Bytes per cycle. The L2 cache is a 512 KiB 8-way set associative unified cache, inclusive, and private to the core. The L2 cache can fetch and write 32B/cycle into the L3 (32B in either direction each cycle, i.e. bidirectional bus). |
== Infinity Fabric == | == Infinity Fabric == | ||
Line 585: | Line 555: | ||
[[File:10682-icon-neural-net-prediction-140x140.png|50px|left]] | [[File:10682-icon-neural-net-prediction-140x140.png|50px|left]] | ||
− | '''Neural Net Prediction''' - This appears to be largely marketing term for Zen's much beefier and more finely | + | '''Neural Net Prediction''' - This appears to be largely marketing term for Zen's much beefier and more finely tune [[branch prediction]] unit. Zen uses a [[perceptron branch predictor|hashed perceptron system]] to intelligently anticipate future code flows, allowing warming up of cold blocks in order to avoid possible waits. Most of that functionality is already found on every modern high-end microprocessor (including AMD's own previous microarchitectures). Because AMD has not disclosed any more specific information about BP, it can only be speculated that no new groundbreaking logic was introduced in Zen. |
{{clear}} | {{clear}} | ||
[[File:10682-icon-smart-prefetch-140x140.png|50px|left]] | [[File:10682-icon-smart-prefetch-140x140.png|50px|left]] | ||
Line 637: | Line 607: | ||
[[File:amd naples mcp.png|250px|right]] | [[File:amd naples mcp.png|250px|right]] | ||
− | In addition to the large amount of memory supported by the four Zeppelin, all EPYC offer the full 64 MiB a result | + | In addition to the large amount of memory supported by the four Zeppelin, all EPYC offer the full 64 MiB a result from 8 MiB from each of the 8 CCXs. The way binning is done for the various EPYC models is by disabing either 1, 2, 3, or 4 cores per CCX from each of the Zeppelin dies to form either [[8 cores|8]], [[16 cores|16]], [[24 cores|24]], or [[32 cores|32]]. |
<div style="display: block;"> | <div style="display: block;"> | ||
[[File:amd epyc interconnect.png|650px]] | [[File:amd epyc interconnect.png|650px]] | ||
− | This image originates from a slide presented at AMD EPYC Tech Day, June 20, 2017 and shows one layer of die interconnects on the EPYC | + | This image originates from a slide presented at AMD EPYC Tech Day, June 20, 2017 and shows one layer of die interconnects on the EPYC interposer. The pink lines fanning out at the top and bottom connect to the UMCs on the respective chip. The light blue and pink lines in the center are bidirectional GMI links. The UMC connections of the top left and bottom right chip, the GMI link from the bottom left to the top right chip, and the PCIe connections are not visible in this picture. Of the four GMI interfaces on each die only the three closest to the other dies are used. It should be noted that its creator pasted Zeppelin die shots onto the image and improperly reflected the top left and bottom right chip. In reality four identical dies are used, with the top left chip mounted in the same orientation as the top right chip, and both bottom chips rotated by 180 degrees.</div> |
{{clear}} | {{clear}} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
=== Modules (Zeppelin) === | === Modules (Zeppelin) === | ||
− | In order to reduce various development costs (e.g., [[masks]]), AMD kept the number of die variations to | + | In order to reduce various development costs (e.g., [[masks]]), AMD kept the number of die variations to minimum. Zen is composed of individual modules (i.e., dies) called '''Zeppelins''' that can be interconnected in a [[multi-chip module]] to form larger systems. The {{amd|Threadripper}} die is the same as the {{amd|Ryzen}} die. Likewise the {{amd|EPYC}} family uses the same die. The differences between the processors is how those dies are connected together and which features are enabled and exposed. |
<gallery widths=550px heights=300px> | <gallery widths=550px heights=300px> | ||
Line 795: | Line 749: | ||
− | :[[File:amd zen octa-core die shot.png | + | :[[File:amd zen octa-core die shot.png|950px]] |
Line 904: | Line 858: | ||
{{comp table end}} | {{comp table end}} | ||
− | == | + | == References == |
− | * | + | * Clark, Mike. "A new x86 core architecture for the next generation of computing." Hot Chips 28 Symposium (HCS), 2016 IEEE. IEEE, 2016. |
− | |||
− | |||
− | |||
* AMD x86 Memory Encryption Technologies, Linux Security Summit 2016, David Kaplan, Security Architect, August 25, 2016 | * AMD x86 Memory Encryption Technologies, Linux Security Summit 2016, David Kaplan, Security Architect, August 25, 2016 | ||
* Lisa Su, AMD CEO, AMD: New Horizon Live Event | * Lisa Su, AMD CEO, AMD: New Horizon Live Event | ||
* Lisa Su, AMD CEO, AMD Annual Meeting of Shareholders Q4 2016 | * Lisa Su, AMD CEO, AMD Annual Meeting of Shareholders Q4 2016 | ||
* Meet the AMD Experts - AMD Monthly Partner Training, January 2017 | * Meet the AMD Experts - AMD Monthly Partner Training, January 2017 | ||
− | * IEEE | + | * Singh, Teja, et al. "3.2 Zen: A next-generation high-performance x86 core." Solid-State Circuits Conference (ISSCC), 2017 IEEE International. IEEE, 2017. |
* AMD 'Tech Day', February 22, 2017 | * AMD 'Tech Day', February 22, 2017 | ||
* AMD Infinity Fabric introduction by Mark Papermaster, April 6, 2017 | * AMD Infinity Fabric introduction by Mark Papermaster, April 6, 2017 | ||
* AMD Zen at GDC 2017, March 3, 2017 | * AMD Zen at GDC 2017, March 3, 2017 | ||
+ | * Processor Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1 Processors | ||
* AMD 2017 Financial Analyst Day, May 16, 2017 | * AMD 2017 Financial Analyst Day, May 16, 2017 | ||
* AMD EPYC Tech Day, June 20, 2017 | * AMD EPYC Tech Day, June 20, 2017 | ||
− | |||
* AMD Ryzen Processor With Radeon Vega Graphics, October, 2017 | * AMD Ryzen Processor With Radeon Vega Graphics, October, 2017 | ||
− | |||
− | |||
== Documents == | == Documents == |
Facts about "Zen - Microarchitectures - AMD"
codename | Zen + |
core count | 4 +, 6 +, 8 +, 16 +, 24 +, 32 + and 12 + |
designer | AMD + |
first launched | March 2, 2017 + |
full page name | amd/microarchitectures/zen + |
instance of | microarchitecture + |
instruction set architecture | x86-64 + |
manufacturer | GlobalFoundries + |
microarchitecture type | CPU + |
name | Zen + |
pipeline stages | 19 + |
process | 14 nm (0.014 μm, 1.4e-5 mm) + |