From WikiChip
Editing amd/microarchitectures/zen 2

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 6: Line 6:
 
|manufacturer=GlobalFoundries
 
|manufacturer=GlobalFoundries
 
|manufacturer 2=TSMC
 
|manufacturer 2=TSMC
|introduction=July 2019
+
|introduction=2019
|process=GloFo 14LPP
+
|process=14 nm
|process 2=TSMC N7
+
|process 2=7 nm
|process 3=GloFo 12LP
+
|process 3=12 nm
|cores=4
 
|cores 2=6
 
|cores 3=8
 
|cores 4=12
 
|cores 5=16
 
|cores 6=24
 
|cores 7=32
 
|cores 8=64
 
 
|type=Superscalar
 
|type=Superscalar
 
|oooe=Yes
 
|oooe=Yes
Line 53: Line 45:
 
|extension 27=UMIP
 
|extension 27=UMIP
 
|extension 28=CLZERO
 
|extension 28=CLZERO
|core name=Renoir (APU/Mobile)
+
|core name=Rome
|core name 2=Matisse (Desktop)
 
|core name 3=Castle Peak (HEDT)
 
|core name 4=Rome (Server)
 
 
|predecessor=Zen+
 
|predecessor=Zen+
 
|predecessor link=amd/microarchitectures/zen+
 
|predecessor link=amd/microarchitectures/zen+
Line 62: Line 51:
 
|successor link=amd/microarchitectures/zen 3
 
|successor link=amd/microarchitectures/zen 3
 
}}
 
}}
'''Zen 2''' is [[AMD]]'s successor to {{\\|Zen+}}, and is a [[7 nm process]] [[microarchitecture]] for mainstream mobile, desktops, workstations, and servers. Zen 2 was replaced by {{\\|Zen 3}}.
+
'''Zen 2''' is [[AMD]]'s successor to {{\\|Zen+}}, and is a [[7 nm process]] [[microarchitecture]] for mainstream mobile, desktops, workstations, and servers. Zen 2 will eventually be replaced by {{\\|Zen 3}}.
  
 
For performance desktop and mobile computing, Zen is branded as {{amd|Athlon}}, {{amd|Ryzen 3}}, {{amd|Ryzen 5}}, {{amd|Ryzen 7}}, {{amd|Ryzen 9}}, and {{amd|Ryzen Threadripper}} processors. For servers, Zen is branded as {{amd|EPYC}}.
 
For performance desktop and mobile computing, Zen is branded as {{amd|Athlon}}, {{amd|Ryzen 3}}, {{amd|Ryzen 5}}, {{amd|Ryzen 7}}, {{amd|Ryzen 9}}, and {{amd|Ryzen Threadripper}} processors. For servers, Zen is branded as {{amd|EPYC}}.
Line 73: Line 62:
 
[[File:amd zen2-3 roadmap.png|thumb|right|Zen 2 on the roadmap]]
 
[[File:amd zen2-3 roadmap.png|thumb|right|Zen 2 on the roadmap]]
  
'''Product Codenames:'''
 
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
Line 85: Line 73:
 
|-
 
|-
 
| {{amd|Renoir|l=core}} || Up to 8/16 || Mainstream APUs with {{\\|Vega}} GPUs
 
| {{amd|Renoir|l=core}} || Up to 8/16 || Mainstream APUs with {{\\|Vega}} GPUs
|}
 
 
'''Architectural Codenames:'''
 
{| class="wikitable"
 
|-
 
! Arch !! Codename
 
 
|-
 
|-
| Core || Valhalla
+
| {{amd|Dali|l=core}} || Up to 4/8? || Low-power/Cost-sensitive embedded processors with {{\\|Vega}} GPU
|-
 
| CCD || Aspen Highlands
 
 
|}
 
|}
  
Line 110: Line 90:
 
! colspan="11" | Mainstream
 
! colspan="11" | Mainstream
 
|-
 
|-
| [[File:amd ryzen 3 logo.png|75px|link=Ryzen 3]] || {{amd|Ryzen 3}} || Entry level Performance || [[quad-core|Quad]]  || {{tchk|yes}} || {{tchk|yes}} || {{tchk|some}} || {{tchk|some}}<sup>1</sup> || {{tchk|some}}<sup>2</sup> || {{tchk|no}}
+
| [[File:amd ryzen 3 logo.png|75px|link=Ryzen 3]] || {{amd|Ryzen 3}} || Entry level Performance || colspan="8" | ?
 
|-
 
|-
| [[File:amd ryzen 5 logo.png|75px|link=Ryzen 5]] || {{amd|Ryzen 5}} || Mid-range Performance || [[hexa-core|Hexa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|some}} || {{tchk|some}}<sup>1</sup> || {{tchk|some}}<sup>2</sup> || {{tchk|no}}
+
| [[File:amd ryzen 5 logo.png|75px|link=Ryzen 5]] || {{amd|Ryzen 5}} || Mid-range Performance || [[hexa-core|Hexa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}}
 
|-
 
|-
| [[File:amd ryzen 7 logo.png|75px|link=Ryzen 7]] || {{amd|Ryzen 7}} || High-end Performance || [[octa-core|Octa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|some}} || {{tchk|some}}<sup>1</sup> || {{tchk|some}}<sup>2</sup> || {{tchk|no}}
+
| [[File:amd ryzen 7 logo.png|75px|link=Ryzen 7]] || {{amd|Ryzen 7}} || High-end Performance || [[octa-core|Octa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}}
 
|-
 
|-
| [[File:amd ryzen 9 logo.png|75px|link=Ryzen 9]] || {{amd|Ryzen 9}} || High-end Performance || [[12 cores|12]]-[[16 cores|16]] (Desktop) <br /> [[octa-core|Octa]] (Mobile) || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|some}}<sup>1</sup> || {{tchk|some}}<sup>2</sup> || {{tchk|no}}
+
| || {{amd|Ryzen 9}} || High-end Performance || [[12 cores|12]]-[[16 cores|16]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}}
 
|-
 
|-
 
! colspan="10" | Enthusiasts / Workstations
 
! colspan="10" | Enthusiasts / Workstations
Line 131: Line 111:
 
! colspan="10" | Embedded / Edge
 
! colspan="10" | Embedded / Edge
 
|-
 
|-
| [[File:epyc embedded logo.png|75px|link=amd/epyc embedded]] || {{amd|EPYC Embedded}} || Embedded / Edge Server Processor  || [[8 cores|8]]-[[64 cores|64]] || {{tchk|no}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}}
+
| [[File:epyc embedded logo.png|75px|link=amd/epyc embedded]] || {{amd|EPYC Embedded}} || Embedded / Edge Server Processor  || colspan="8" | ?
 
|-
 
|-
| [[File:ryzen embedded logo.png|75px|link=amd/ryzen embedded]] || {{amd|Ryzen Embedded}} || Embedded APUs  || [[6 cores|6]]-[[8 cores|8]] || {{tchk|no}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}}
+
| [[File:ryzen embedded logo.png|75px|link=amd/ryzen embedded]] || {{amd|Ryzen Embedded}} || Embedded APUs  || colspan="8" | ?
 
|}
 
|}
<sup>1</sup> Only available on G, GE, H, HS, HX and U SKUs. <br />
 
<sup>2</sup> ECC support is unavailable on AMD APUs.
 
  
 
== Process technology ==
 
== Process technology ==
 
Zen 2 comprises multiple different components:
 
Zen 2 comprises multiple different components:
  
* The Core Complex Die (CCD) is fabricated on [[TSMC]] [[7 nm process|7 nm HPC process]]
+
* The Core Chiplet Die (CCD) is fabricated on [[TSMC]] [[7 nm process|7 nm HPC process]]
 
* The client I/O Die (cIOD) is fabricated on [[GlobalFoundries]] [[12 nm process]]
 
* The client I/O Die (cIOD) is fabricated on [[GlobalFoundries]] [[12 nm process]]
 
* The server I/O Die (sIOD) is fabricated on [[GlobalFoundries]] [[14 nm process]]
 
* The server I/O Die (sIOD) is fabricated on [[GlobalFoundries]] [[14 nm process]]
Line 163: Line 141:
  
  
<small>=== Key changes from {{\\|Zen+}} ===
+
=== Key changes from {{\\|Zen+}} ===
 
* [[7 nm process]] (from [[12 nm]])
 
* [[7 nm process]] (from [[12 nm]])
 
** I/O die utilizes [[12 nm]]
 
** I/O die utilizes [[12 nm]]
Line 217: Line 195:
  
 
Furthermore, the User-Mode Instruction Prevention ({{x86|UMIP}}) extension.
 
Furthermore, the User-Mode Instruction Prevention ({{x86|UMIP}}) extension.
</small>
 
  
 
=== Block Diagram ===
 
=== Block Diagram ===
Line 250: Line 227:
 
** L3 Cache:
 
** L3 Cache:
 
*** Matisse, Castle Peak, Rome: 16 MiB/CCX, shared across all cores
 
*** Matisse, Castle Peak, Rome: 16 MiB/CCX, shared across all cores
*** Renoir: 4 MiB/CCX, shared across all cores
+
*** TBD: 4 MiB/CCX, shared across all cores
 
*** 16-way set associative
 
*** 16-way set associative
 
**** 16,384 sets, 64 B line size
 
**** 16,384 sets, 64 B line size
Line 293: Line 270:
  
 
==== Branch Prediction Unit ====
 
==== Branch Prediction Unit ====
The branch prediction unit guides instruction fetching and attempts to predict branches and their target to avoid pipeline stalls or the pursuit of incorrect execution paths. The Zen 2 BPU almost doubles the branch target buffer capacity, doubles the size of the indirect target array, and introduces a TAGE predictor. According to AMD it exhibits a 30% lower misprediction rate than its perceptron counterpart in the {{\\|Zen}}/{{\\|Zen+}} microarchitecture.
+
The branch prediction unit guides instruction fetching and attempts to predict branches and their target to avoid pipeline stalls or the pursuit of incorrect execution paths. The Zen 2 BPU almost doubles the branch target buffer capacity, doubles the size of the indirect target array, and introduces a TAGE predictor. According to AMD it exhibits a 30% lower misprediction rate than its counterpart in the {{\\|Zen}}/{{\\|Zen+}} microarchitecture.
  
 
Once per cycle the next address logic determines if branch instructions have been identified in the current 64-byte instruction fetch block, and if so, consults several branch prediction facilities about the most likely target and calculates a new fetch block address. If no branches are expected it calculates the address of the next sequential block. Branches are evaluated much later in the integer execution unit which provides the actual branch outcome to redirect instruction fetching and refine the predictions. The dispatch unit can also cause redirects to handle mispredictions and exceptions.
 
Once per cycle the next address logic determines if branch instructions have been identified in the current 64-byte instruction fetch block, and if so, consults several branch prediction facilities about the most likely target and calculates a new fetch block address. If no branches are expected it calculates the address of the next sequential block. Branches are evaluated much later in the integer execution unit which provides the actual branch outcome to redirect instruction fetching and refine the predictions. The dispatch unit can also cause redirects to handle mispredictions and exceptions.
Line 374: Line 351:
 
The integer execution (EX) unit consists of a dedicated rename unit, five schedulers, a 180-entry physical register file (PRF), four ALU and three AGU pipelines, and a 224-entry retire queue shared with the floating point unit. The depth of the four ALU scheduler queues increased from 14 to 16 entries in Zen 2. The two AGU schedulers with a 14-entry queue of the Zen/Zen+ microarchitecture were replaced by one unified scheduler with a 28-entry queue. A third AGU pipeline only for store operations was added as well. The size of the PRF increased from 168 to 180 entries, the capacity of the retire queue from 192 to 224 entries.
 
The integer execution (EX) unit consists of a dedicated rename unit, five schedulers, a 180-entry physical register file (PRF), four ALU and three AGU pipelines, and a 224-entry retire queue shared with the floating point unit. The depth of the four ALU scheduler queues increased from 14 to 16 entries in Zen 2. The two AGU schedulers with a 14-entry queue of the Zen/Zen+ microarchitecture were replaced by one unified scheduler with a 28-entry queue. A third AGU pipeline only for store operations was added as well. The size of the PRF increased from 168 to 180 entries, the capacity of the retire queue from 192 to 224 entries.
  
The retire queue and the integer and floating point rename units form the retire control unit (RCU) tracking instructions, registers, and dependencies in the out-of-order execution units. Macro-ops stay in the retire queue until completion or until an exception occurs. When all macro-ops constituting an instruction completed successfully the instruction becomes eligible for retirement. Instructions are retired in program order. The retire queue can track up to 224 macro-ops in flight, 112 per thread in SMT mode, and retire up to eight macro-ops per cycle.
+
The retire queue and the integer and floating point rename units form the retire control unit (RCU) tracking instructions, registers, and dependencies in the out-of-order execution units. Macro-ops stay in the retire queue until completion or until an exception occurs. When all macro-ops constituting an instruction completed successfully the instruction becomes eligible for retirement. Instructions are retired in program order. The retire queue can track up to 224 macro-ops in flight and retire up to eight macro-ops per cycle.
  
 
The rename unit receives up to six macro-ops per cycle from the dispatch unit. It maps general purpose architectural registers and temporary registers used by microcoded instructions to physical registers and allocates physical registers to receive ALU results. The PRF has 180 entries. Up to 38 registers per thread are mapped to architectural or temporary registers, the rest are available for out-of-order renames.
 
The rename unit receives up to six macro-ops per cycle from the dispatch unit. It maps general purpose architectural registers and temporary registers used by microcoded instructions to physical registers and allocates physical registers to receive ALU results. The PRF has 180 entries. Up to 38 registers per thread are mapped to architectural or temporary registers, the rest are available for out-of-order renames.
Line 402: Line 379:
 
Zen 2 inherits the move elimination and XMM register merge optimizations from its predecessors. Register to register moves are performed by renaming the destination register and do not occupy any scheduling or execution resources. XMM register merging occurs when SSE instructions such as SQRTSS leave the upper 64 or 96 bits of the destination register unchanged, causing a dependency on previous instructions writing to this register. AMD family 15h and later processors can eliminate this dependency in a sequence of scalar FP instructions by recording in the RAT if those bits were zeroed. By setting a Z-bit in the RAT the rename unit also tracks if an architectural register was completely zeroed. All-zero registers are not mapped, the zero data bits are injected at the bypass network which conserves power and PRF entries, and allows for more instructions in flight. Earlier AMD microarchitectures, apparently including Zen/Zen+, recognize zeroing idioms such as XORPS combining a register with itself and eliminate the dependency on the source register. AMD did not disclose if or how this is implemented in Zen 2. Family 16h processors recognize zeroing idioms in the instruction decode unit and set the Z-bit on the destination register in the floating point rename unit, completing the operation without consuming scheduling or execution resources.
 
Zen 2 inherits the move elimination and XMM register merge optimizations from its predecessors. Register to register moves are performed by renaming the destination register and do not occupy any scheduling or execution resources. XMM register merging occurs when SSE instructions such as SQRTSS leave the upper 64 or 96 bits of the destination register unchanged, causing a dependency on previous instructions writing to this register. AMD family 15h and later processors can eliminate this dependency in a sequence of scalar FP instructions by recording in the RAT if those bits were zeroed. By setting a Z-bit in the RAT the rename unit also tracks if an architectural register was completely zeroed. All-zero registers are not mapped, the zero data bits are injected at the bypass network which conserves power and PRF entries, and allows for more instructions in flight. Earlier AMD microarchitectures, apparently including Zen/Zen+, recognize zeroing idioms such as XORPS combining a register with itself and eliminate the dependency on the source register. AMD did not disclose if or how this is implemented in Zen 2. Family 16h processors recognize zeroing idioms in the instruction decode unit and set the Z-bit on the destination register in the floating point rename unit, completing the operation without consuming scheduling or execution resources.
  
As in the Zen/Zen+ microarchitecture the 64-entry non-scheduling queue decouples dispatch and the FP scheduler. This allows dispatch to send operations to the integer side, in particular to expedite floating point loads and store address calculations, while the FP scheduler, whose capacity cannot be arbitrarily increased for complexity reasons, is busy with higher latency FP operations. The 36-entry out-of-order scheduler issues up to four micro-ops per cycle to the execution pipelines. A 160-entry physical register file holds the speculative and committed contents of architectural and temporary registers. The PRF has 8 read ports and 4 write ports for ALU results, each 256 bits wide in Zen 2, and two additional write ports supporting up to two 256-bit load operations per cycle, up from two 128-bit loads in Zen. The load convert (LDCVT) logic converts data to the internal register file format. The FPU is capable of <i>superforwarding</i> load data to dependent instructions through the bypass network, obviating a write and read of the PRF. The bypass network enables back-to-back execution of dependent instructions by forwarding results from the FP ALUs to the previous pipeline stage.
+
As in the Zen/Zen+ microarchitecture the 64-entry non-scheduling queue decouples dispatch and the FP scheduler. This allows dispatch to send operations to the integer side, in particular to expedite floating point loads and store address calculations, while the FP scheduler, whose capacity cannot be arbitrarily increased for complexity reasons, is busy with higher latency FP operations. The 36-entry out-of-order scheduler issues up to four micro-ops per cycle to the execution pipelines. A 160-entry physical register file holds the speculative and committed contents of architectural and temporary registers. The PRF has 8 read ports and 4 write ports for ALU results, each 256 bits wide in Zen 2, and two additional write ports supporting up to two 256-bit load operations per cycle, up from two 128-bit loads in Zen. The FPU is capable of <i>superforwarding</i> load data to dependent instructions through the bypass network, obviating a write and read of the PRF. The bypass network enables back-to-back execution of dependent instructions by forwarding results from the FP ALUs to the previous pipeline stage.
  
 
The floating point pipelines are asymmetric, each supporting a different set of operations, and the ALUs are grouped in domains, to conserve die space and reduce signal path lengths which permits higher clock frequencies. The number of execution resources available reflects the density of different instruction types in x86 code. Each pipe receives two operands from the PRF or bypass network. Pipe 0 and 1 support
 
The floating point pipelines are asymmetric, each supporting a different set of operations, and the ALUs are grouped in domains, to conserve die space and reduce signal path lengths which permits higher clock frequencies. The number of execution resources available reflects the density of different instruction types in x86 code. Each pipe receives two operands from the PRF or bypass network. Pipe 0 and 1 support
Line 420: Line 397:
 
A two-level translation lookaside buffer (TLB) assists load and store address translation. The fully-associative L1 data TLB contains 64 entries and holds 4-Kbyte, 2-Mbyte, and 1-Gbyte page table entries. The L2 data TLB is a unified 12-way set-associative cache with 2048 entries, up from 1536 entries in Zen, holding 4-Kbyte and 2-Mbyte page table entries, as well as page directory entries (PDEs) to speed up DTLB and ITLB table walks. 1-Gbyte pages are <i>smashed</i> into 2-Mbyte entries but installed as 1-Gbyte entries when reloaded into the L1 TLB.
 
A two-level translation lookaside buffer (TLB) assists load and store address translation. The fully-associative L1 data TLB contains 64 entries and holds 4-Kbyte, 2-Mbyte, and 1-Gbyte page table entries. The L2 data TLB is a unified 12-way set-associative cache with 2048 entries, up from 1536 entries in Zen, holding 4-Kbyte and 2-Mbyte page table entries, as well as page directory entries (PDEs) to speed up DTLB and ITLB table walks. 1-Gbyte pages are <i>smashed</i> into 2-Mbyte entries but installed as 1-Gbyte entries when reloaded into the L1 TLB.
  
Two hardware page table walkers handle L2 TLB misses, presumably one serving the DTLB, another the ITLB. In addition to the PDE entries, the table walkers include a 64-entry page directory cache which holds page-map-level-4 and page-directory-pointer entries speeding up DTLB and ITLB walks. Like the Zen/Zen+ microarchitecture, Zen 2 supports page table entry (PTE) coalescing. When the table walker loads a PTE, which occupies 8 bytes in the x86-64 architecture, from memory it also examines the other PTEs in the same 64-byte cache line. If a 16-Kbyte aligned block of four consecutive 4-Kbyte pages are also consecutive and 16-Kbyte aligned in physical address space and have identical page attributes, they are stored into a single TLB entry greatly improving the efficiency of this cache. This is only done when the processor is operating in long mode.
+
Two hardware page table walkers handle L2 TLB misses, presumably one serving the DTLB, another the ITLB. In addition to the PDE entries, the table walkers include a 64-entry page directory cache which holds page-map-level-4 and page-directory-pointer entries speeding up DTLB and ITLB walks. The Zen/Zen+ microarchitecture supports page table entry (PTE) coalescing; this feature is probably also present in Zen 2. When the table walker loads a PTE, which occupies 8 bytes in the x86-64 architecture, from memory it also examines the seven other PTEs in the same 64-byte cache line. If they assign a contiguous 32 KiB block of physical pages with the same attributes to the 8 virtual pages, they are coalesced into a single TLB entry greatly improving the efficiency of this cache.
  
 
The LS unit relies on a 32 KiB, 8-way set associative, write-back, ECC-protected L1 data cache (DC). It supports two loads, if they access different DC banks, and one store per cycle, each up to 256 bits wide. The line width is 64 bytes, however cache stores are aligned to a 32-byte boundary. Loads spanning a 64-byte boundary and stores spanning a 32-byte boundary incur a penalty of one cycle. In the Zen/Zen+ microarchitecture 256-bit vectors are loaded and stored as two 128-bit halves; the load and store boundaries are 32 and 16 bytes respectively. Zen 2 can load and store 256-bit vectors in a single operation, but stores must be 32-byte aligned now to avoid the penalty. As in Zen aligned and unaligned load and store instructions (for example MOVUPS/MOVAPS) provide identical performance.
 
The LS unit relies on a 32 KiB, 8-way set associative, write-back, ECC-protected L1 data cache (DC). It supports two loads, if they access different DC banks, and one store per cycle, each up to 256 bits wide. The line width is 64 bytes, however cache stores are aligned to a 32-byte boundary. Loads spanning a 64-byte boundary and stores spanning a 32-byte boundary incur a penalty of one cycle. In the Zen/Zen+ microarchitecture 256-bit vectors are loaded and stored as two 128-bit halves; the load and store boundaries are 32 and 16 bytes respectively. Zen 2 can load and store 256-bit vectors in a single operation, but stores must be 32-byte aligned now to avoid the penalty. As in Zen aligned and unaligned load and store instructions (for example MOVUPS/MOVAPS) provide identical performance.
Line 433: Line 410:
  
 
== Core Complex ==
 
== Core Complex ==
Zen 2 organizes CPU cores in a core complex (CCX). A CCX comprises four cores sharing a 16 MiB, 16-way set associative, write-back, ECC protected, L3 cache. The L3 capacity doubled over the Zen/Zen+ microarchitecture. The cache is divided into four slices of 4 MiB capacity. Each core can access all slices with the same average load-to-use latency of 39 cycles, compared to 35 cycles in the previous generation. The Zen CCX is a flexible design allowing AMD to omit cores or cache slices in APUs and embedded processors. All Zen 2-based processors introduced as of late 2019 have the same CCX configuration, only the number of usable cores and L3 slices varies by processor model.
 
 
The width of a L3 cache line is 64 bytes. The data path between the L3 and L2 caches is 32 bytes wide. AMD did not disclose the size of miss buffers. Processors based on the Zen/Zen+ microarchitecture support 50 outstanding misses per core from L2 to L3, 96 from L3 to memory.<!-- EPYC Tech Day 2017, ISSCC 2018. -->
 
 
Each CPU core is supported by a private L2 cache. The L3 cache is a victim cache filled from L2 victims of all four cores and exclusive of L2 unless the data in the L3 cache is likely being accessed by multiple cores, or is requested by an instruction fetch.(non-inclusive hierarchy)
 
 
The L3 cache maintains shadow tags for all cache lines of each L2 cache in the CCX. This simplifies coupled fill/victim transactions between the L2 and L3 cache, and allows the L3 cache to act as a probe filter for requests between the L2 caches in the CCX, external probes and, taking advantage of its knowledge that a cache line shared by two or more L2 caches is exclusive to this CCX, probe traffic to the rest of the system. If a core misses in its L2 cache and the L3 cache, and the shadow tags indicate a hit in another L2 cache, a cache-to-cache transfer within the CCX is initiated. CCXs are not directly connected, even if they reside on the same die. Requests leaving the CCX pass through the scalable data fabric on the I/O die.
 
 
Zen 2 introduces the AMD64 Technology Platform Quality of Service Extensions which aim for compatibility with Intel Resource Director Technology, specifically CMT, MBM, CAT, and CDP.<!-- AMD did not advertise CDP support in Zen 2 but the feature is documented in AMD Publ #56375. --> An AMD-specific PQE-BW extension supports read bandwidth enforcement equivalent to Intel's MBA. The L3 cache is not a last level cache shared by all cores in a package as on Intel CPUs, so each CCX corresponds to one QoS domain. L2 QoS monitoring and enforcement is not supported.
 
 
Resources are tracked using a Resource Monitoring ID (RMID) which abstracts an application, thread, VM, or cgroup. For each logical processor only one RMID is active at a given time. CMT (Cache Monitoring Technology) allows an OS, hypervisor, or VMM to measure the amount of L3 cache occupied by a thread. MBM (Memory Bandwidth Monitoring) counts read requests to the rest of the system. CAT (Cache Allocation Technology) divides the L3 cache into a number of logical segments, possibly corresponding to ways, and allows system software to restrict a thread to an arbitrary, possibly empty, set of segments. It should be noted that still only a single copy of data is stored in the L3 cache if the data is accessed by threads with mutually exclusive sets. CDP (Code and Data Prioritization) extends CAT by differentiating between sets for code and data accesses. MBA (Memory Bandwidth Allocation) and PQE-BW (Platform Quality of Service Enforcement for Memory Bandwidth) limits the memory bandwidth a thread can consume within its QoS domain. Benefits of QoS include the ability to protect time critical processes from cache intensive background tasks, to reduce contention by scheduling threads according to their resource usage, and to mitigate interference from <i>noisy neighbors</i> in multitenant virtual machines.
 
 
Rome, Castle Peak, and Matisse are multi-die designs combining an I/O die tailored for their market and between one and eight identical core complex dies (CCDs), each containing two independent core complexes, a system management unit (SMU), and a global memory interconnect version 2 (GMI2) interface.
 
 
The GMI2 interface extends the scalable data fabric from the I/O die to the CCDs, presumably a bi-directional 32-lane [[amd/infinity_fabric|IFOP]] link comparable to the die-to-die links in first and second generation EPYC and Threadripper processors. According to AMD the die-to-die bandwidth increased from 16 B read + 16 B write to 32 B read + 16 B write per fclk.
 
 
Inferring its function from earlier Family 17h processors the SMU is a microcontroller which captures temperatures, voltage and current levels, adjusts CPU core frequencies and voltages, and applies local limits in a fast local closed loop and a global loop with a master SMU on the I/O die. The SMUs communicate through the scalable control fabric, presumably including a dedicated single lane IFOP SerDes on each CCD.
 
  
 
== Rome ==
 
== Rome ==
Line 455: Line 415:
  
 
The centralized I/O die incorporates eight {{amd|Infinity Fabric}} links, 128 [[PCIe]] Gen 4 lanes, and eight [[DDR4]] memory channels. The full capabilities of the I/O have not been disclosed yet. Attached to the I/O die are eight compute dies - each with eight Zen 2 core - for a total of 64 cores and 128 threads per chip.
 
The centralized I/O die incorporates eight {{amd|Infinity Fabric}} links, 128 [[PCIe]] Gen 4 lanes, and eight [[DDR4]] memory channels. The full capabilities of the I/O have not been disclosed yet. Attached to the I/O die are eight compute dies - each with eight Zen 2 core - for a total of 64 cores and 128 threads per chip.
 
== Die ==
 
=== Zen 2 CPU core ===
 
* TSMC [[N7|7-nanometer process]]
 
* 13 metal layers<ref name="isscc2020j-zen2">Singh, Teja; Rangarajan, Sundar; John, Deepesh; Schreiber, Russell; Oliver, Spence; Seahra, Rajit; Schaefer, Alex (2020). <i>2.1 Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core</i>. 2020 IEEE International Solid-State Circuits Conference. pp. 42-44. doi:[https://doi.org/10.1109/ISSCC19947.2020.9063113 10.1109/ISSCC19947.2020.9063113]</ref>
 
* 475,000,000 transistors incl. 512 KiB L2 cache and one 4 MiB L3 cache slice<ref name="isscc2020j-zen2"/>
 
* Core size incl. L2 cache and one L3 cache slice: 7.83 mm²<ref name="isscc2020j-zen2"/>
 
* Core size incl. L2 cache: 3.64 mm² (estimated)
 
 
=== Core Complex Die ===
 
* TSMC [[N7|7-nanometer process]]
 
* 13 metal layers<ref name="isscc2020j-zen2"/>
 
* 3,800,000,000 transistors<ref name="isscc2020p-chiplet">Naffziger, Samuel. <i>[https://www.slideshare.net/AMD/amd-chiplet-architecture-for-highperformance-server-and-desktop-products AMD Chiplet Architecture for High-Performance Server and Desktop Products]</i>. IEEE ISSCC 2020, February 17, 2020.</ref>
 
* Die size: 74 mm²<ref name="isscc2020p-chiplet"/><ref name="isscc2020j-chiplet">Naffziger, Samuel; Lepak, Kevin; Paraschou, Milam; Subramony, Mahesh (2020). <i>2.2 AMD Chiplet Architecture for High-Performance Server and Desktop Products</i>. 2020 IEEE International Solid-State Circuits Conference. pp. 44-45. doi:[https://doi.org/10.1109/ISSCC19947.2020.9063103 10.1109/ISSCC19947.2020.9063103]</ref>
 
* CCX size: 31.3 mm²<ref name="e3-2019-nhg"><i>Next Horizon Gaming</i>. Electronic Entertainment Expo 2019 (E3 2019), June 10, 2019.</ref><ref name="isscc2020p-zen2">Singh, Teja; Rangarajan, Sundar; John, Deepesh; Schreiber, Russell; Oliver, Spence; Seahra, Rajit; Schaefer, Alex. <i>[https://www.slideshare.net/AMD/zen-2-the-amd-7nm-energyefficient-highperformance-x8664-microprocessor-core Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core]</i>. IEEE ISSCC 2020, February 17, 2020.</ref>
 
* 2 × 16 MiB L3 cache: 2 × 16.8 mm² (estimated)
 
 
:[[File:AMD_Zen_2_CCD.jpg|500px]]
 
 
=== Client I/O Die ===
 
* GlobalFoundries [[14_nm_lithography_process#GlobalFoundries|12-nanometer process]]
 
* 2,090,000,000 transistors<ref name="isscc2020p-chiplet"/><ref name="isscc2020j-chiplet"/>
 
* 125 mm² die size<ref name="isscc2020p-chiplet"/><ref name="isscc2020j-chiplet"/>
 
* Reused as AMD X570 chipset
 
 
=== Server I/O Die ===
 
* GlobalFoundries [[14_nm_lithography_process#GlobalFoundries|12-nanometer process]]
 
* 8,340,000,000 transistors<ref name="isscc2020p-chiplet"/><ref name="isscc2020j-chiplet"/>
 
* 416 mm² die size<ref name="isscc2020p-chiplet"/><ref name="isscc2020j-chiplet"/>
 
 
=== Renoir Die ===
 
* TSMC [[N7|7-nanometer process]]
 
* 13 metal layers<ref name="hc32-renoir">Arora, Sonu; Bouvier Dan; Weaver, Chris. <i>[https://www.hotchips.org/archives AMD Next Generation 7nm Ryzen™ 4000 APU "Renoir"]</i>. Hot Chips 32, August 17, 2020.</ref>
 
* 9,800,000,000 transistors<ref name="hc32-renoir"/>
 
* 156 mm² die size<ref name="hc32-renoir"/>
 
 
:[[File:renoir die.png|500px]]
 
  
 
== All Zen 2 Chips ==
 
== All Zen 2 Chips ==
  
<!-- NOTE:
+
<!-- NOTE:  
 
           This table is generated automatically from the data in the actual articles.
 
           This table is generated automatically from the data in the actual articles.
 
           If a microprocessor is missing from the list, an appropriate article for it needs to be
 
           If a microprocessor is missing from the list, an appropriate article for it needs to be
Line 564: Line 487:
 
</table>
 
</table>
 
{{comp table end}}
 
{{comp table end}}
 
== Designers ==
 
* David Suggs, Zen 2 core chief architect<ref name="mm2020-zen2">Suggs, David; Subramony, Mahesh; Bouvier, Dan (2020). <i>The AMD "Zen 2" Processor</i>. IEEE Micro. 40 (4): 45-52. doi:[https://doi.org/10.1109/MM.2020.2974217 10.1109/MM.2020.2974217]</ref>
 
* Mahesh Subramony, "Matisse" SoC architect<ref name="mm2020-zen2"/>
 
  
 
== Bibliography ==
 
== Bibliography ==
Line 575: Line 494:
 
* AMD 'Next Horizon', November 6, 2018
 
* AMD 'Next Horizon', November 6, 2018
 
* AMD. Lisa Su ''Keynote''. May 26, 2019
 
* AMD. Lisa Su ''Keynote''. May 26, 2019
 
+
* AMD 'Next Horizon Gaming' event at E3, June 10, 2019
<references/>
 
  
 
== See Also ==
 
== See Also ==
 
* Intel {{intel|Ice Lake|l=arch}}
 
* Intel {{intel|Ice Lake|l=arch}}

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)
codenameZen 2 +
core count4 +, 6 +, 8 +, 12 +, 16 +, 24 +, 32 + and 64 +
designerAMD +
first launchedJuly 2019 +
full page nameamd/microarchitectures/zen 2 +
instance ofmicroarchitecture +
instruction set architecturex86-64 +
manufacturerTSMC + and GlobalFoundries +
microarchitecture typeCPU +
nameZen 2 +
pipeline stages19 +