From WikiChip
Editing amd/microarchitectures/zen 2
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.
Latest revision | Your text | ||
Line 6: | Line 6: | ||
|manufacturer=GlobalFoundries | |manufacturer=GlobalFoundries | ||
|manufacturer 2=TSMC | |manufacturer 2=TSMC | ||
− | |introduction= | + | |introduction=2019 |
− | |process= | + | |process=14 nm |
− | |process 2= | + | |process 2=7 nm |
− | |process 3 | + | |process 3=12 nm |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|type=Superscalar | |type=Superscalar | ||
|oooe=Yes | |oooe=Yes | ||
Line 31: | Line 23: | ||
|extension 5=SSE3 | |extension 5=SSE3 | ||
|extension 6=SSSE3 | |extension 6=SSSE3 | ||
− | |extension 7 | + | |extension 7=SSE4.1 |
− | + | |extension 8=SSE4.2 | |
− | |extension | + | |extension 9=POPCNT |
− | |extension | + | |extension 10=AVX |
− | |extension | + | |extension 11=AVX2 |
− | |extension | + | |extension 12=AES |
− | |extension | + | |extension 13=PCLMUL |
− | |extension | + | |extension 14=RDRND |
− | |extension | + | |extension 15=F16C |
− | + | |extension 16=BMI | |
− | |extension | + | |extension 17=BMI2 |
− | + | |extension 18=RDSEED | |
− | |extension | + | |extension 19=ADCX |
− | |extension | + | |extension 20=PREFETCHW |
− | |extension | + | |extension 21=CLFLUSHOPT |
− | |extension | + | |extension 22=XSAVE |
− | |extension | + | |extension 23=SHA |
− | |extension | + | |extension 24=CLZERO |
− | |extension | + | |core name=Rome |
− | |extension | ||
− | |extension | ||
− | |||
− | |core name | ||
− | |||
− | |||
− | |||
|predecessor=Zen+ | |predecessor=Zen+ | ||
|predecessor link=amd/microarchitectures/zen+ | |predecessor link=amd/microarchitectures/zen+ | ||
Line 62: | Line 47: | ||
|successor link=amd/microarchitectures/zen 3 | |successor link=amd/microarchitectures/zen 3 | ||
}} | }} | ||
− | '''Zen 2''' is [[AMD]]'s successor to {{\\|Zen+}}, and is a [[7 nm process]] [[microarchitecture]] for mainstream mobile, desktops, workstations, and servers. Zen 2 | + | '''Zen 2''' is [[AMD]]'s successor to {{\\|Zen+}}, and is a [[7 nm process]] [[microarchitecture]] for mainstream mobile, desktops, workstations, and servers. Zen 2 will eventually be replaced by {{\\|Zen 3}}. |
For performance desktop and mobile computing, Zen is branded as {{amd|Athlon}}, {{amd|Ryzen 3}}, {{amd|Ryzen 5}}, {{amd|Ryzen 7}}, {{amd|Ryzen 9}}, and {{amd|Ryzen Threadripper}} processors. For servers, Zen is branded as {{amd|EPYC}}. | For performance desktop and mobile computing, Zen is branded as {{amd|Athlon}}, {{amd|Ryzen 3}}, {{amd|Ryzen 5}}, {{amd|Ryzen 7}}, {{amd|Ryzen 9}}, and {{amd|Ryzen Threadripper}} processors. For servers, Zen is branded as {{amd|EPYC}}. | ||
Line 73: | Line 58: | ||
[[File:amd zen2-3 roadmap.png|thumb|right|Zen 2 on the roadmap]] | [[File:amd zen2-3 roadmap.png|thumb|right|Zen 2 on the roadmap]] | ||
− | |||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
Line 85: | Line 69: | ||
|- | |- | ||
| {{amd|Renoir|l=core}} || Up to 8/16 || Mainstream APUs with {{\\|Vega}} GPUs | | {{amd|Renoir|l=core}} || Up to 8/16 || Mainstream APUs with {{\\|Vega}} GPUs | ||
− | |||
− | |||
− | |||
− | |||
|- | |- | ||
− | + | | {{amd|Dali|l=core}} || Up to 4/8? || Low-power/Cost-sensitive embedded processors with {{\\|Vega}} GPU | |
− | | | ||
− | | | ||
− | | | ||
− | | | ||
|} | |} | ||
Line 110: | Line 86: | ||
! colspan="11" | Mainstream | ! colspan="11" | Mainstream | ||
|- | |- | ||
− | | [[File:amd ryzen 3 logo.png|75px|link=Ryzen 3]] || {{amd|Ryzen 3}} || Entry level Performance || | + | | [[File:amd ryzen 3 logo.png|75px|link=Ryzen 3]] || {{amd|Ryzen 3}} || Entry level Performance || colspan="8" | ? |
|- | |- | ||
− | | [[File:amd ryzen 5 logo.png|75px|link=Ryzen 5]] || {{amd|Ryzen 5}} || Mid-range Performance || [[hexa-core|Hexa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk| | + | | [[File:amd ryzen 5 logo.png|75px|link=Ryzen 5]] || {{amd|Ryzen 5}} || Mid-range Performance || [[hexa-core|Hexa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}} |
|- | |- | ||
− | | [[File:amd ryzen 7 logo.png|75px|link=Ryzen 7]] || {{amd|Ryzen 7}} || High-end Performance || [[octa-core|Octa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk| | + | | [[File:amd ryzen 7 logo.png|75px|link=Ryzen 7]] || {{amd|Ryzen 7}} || High-end Performance || [[octa-core|Octa]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}} |
|- | |- | ||
− | | | + | | || {{amd|Ryzen 9}} || High-end Performance || [[12 cores|12]]-[[16 cores|16]] || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|no}} || {{tchk|yes}} || {{tchk|no}} |
|- | |- | ||
! colspan="10" | Enthusiasts / Workstations | ! colspan="10" | Enthusiasts / Workstations | ||
Line 131: | Line 107: | ||
! colspan="10" | Embedded / Edge | ! colspan="10" | Embedded / Edge | ||
|- | |- | ||
− | | [[File:epyc embedded logo.png|75px|link=amd/epyc embedded]] || {{amd|EPYC Embedded}} || Embedded / Edge Server Processor || | + | | [[File:epyc embedded logo.png|75px|link=amd/epyc embedded]] || {{amd|EPYC Embedded}} || Embedded / Edge Server Processor || colspan="8" | ? |
|- | |- | ||
− | | [[File:ryzen embedded logo.png|75px|link=amd/ryzen embedded]] || {{amd|Ryzen Embedded}} || Embedded APUs || | + | | [[File:ryzen embedded logo.png|75px|link=amd/ryzen embedded]] || {{amd|Ryzen Embedded}} || Embedded APUs || colspan="8" | ? |
|} | |} | ||
− | |||
− | |||
== Process technology == | == Process technology == | ||
Zen 2 comprises multiple different components: | Zen 2 comprises multiple different components: | ||
− | * The Core | + | * The Core Chiplet Die (CCD) is fabricated on [[TSMC]] [[7 nm process|7 nm HPC process]] |
* The client I/O Die (cIOD) is fabricated on [[GlobalFoundries]] [[12 nm process]] | * The client I/O Die (cIOD) is fabricated on [[GlobalFoundries]] [[12 nm process]] | ||
* The server I/O Die (sIOD) is fabricated on [[GlobalFoundries]] [[14 nm process]] | * The server I/O Die (sIOD) is fabricated on [[GlobalFoundries]] [[14 nm process]] | ||
Line 163: | Line 137: | ||
− | + | === Key changes from {{\\|Zen+}} === | |
* [[7 nm process]] (from [[12 nm]]) | * [[7 nm process]] (from [[12 nm]]) | ||
** I/O die utilizes [[12 nm]] | ** I/O die utilizes [[12 nm]] | ||
Line 176: | Line 150: | ||
*** Increased dispatch bandwidth | *** Increased dispatch bandwidth | ||
** Back-end | ** Back-end | ||
− | + | *** Increased retire bandwidth (??-wide, up from 8-wide) | |
*** FPU | *** FPU | ||
**** 2x wider datapath (256-bit, up from 128-bit) | **** 2x wider datapath (256-bit, up from 128-bit) | ||
Line 215: | Line 189: | ||
* <code>{{x86|RDPID}}</code> - Read Processor ID | * <code>{{x86|RDPID}}</code> - Read Processor ID | ||
* <code>{{x86|RDPRU}}</code> - Read Processor Register | * <code>{{x86|RDPRU}}</code> - Read Processor Register | ||
− | |||
− | |||
− | |||
=== Block Diagram === | === Block Diagram === | ||
Line 250: | Line 221: | ||
** L3 Cache: | ** L3 Cache: | ||
*** Matisse, Castle Peak, Rome: 16 MiB/CCX, shared across all cores | *** Matisse, Castle Peak, Rome: 16 MiB/CCX, shared across all cores | ||
− | *** | + | *** TBD: 4 MiB/CCX, shared across all cores |
*** 16-way set associative | *** 16-way set associative | ||
**** 16,384 sets, 64 B line size | **** 16,384 sets, 64 B line size | ||
Line 293: | Line 264: | ||
==== Branch Prediction Unit ==== | ==== Branch Prediction Unit ==== | ||
− | The branch prediction unit guides instruction fetching and attempts to predict branches and their target to avoid pipeline stalls or the pursuit of incorrect execution paths. The Zen 2 BPU almost doubles the branch target buffer capacity, doubles the size of the indirect target array, and introduces a TAGE predictor. According to AMD it exhibits a 30% lower misprediction rate than its | + | The branch prediction unit guides instruction fetching and attempts to predict branches and their target to avoid pipeline stalls or the pursuit of incorrect execution paths. The Zen 2 BPU almost doubles the branch target buffer capacity, doubles the size of the indirect target array, and introduces a TAGE predictor. According to AMD it exhibits a 30% lower misprediction rate than its counterpart in the {{\\|Zen}}/{{\\|Zen+}} microarchitecture. |
Once per cycle the next address logic determines if branch instructions have been identified in the current 64-byte instruction fetch block, and if so, consults several branch prediction facilities about the most likely target and calculates a new fetch block address. If no branches are expected it calculates the address of the next sequential block. Branches are evaluated much later in the integer execution unit which provides the actual branch outcome to redirect instruction fetching and refine the predictions. The dispatch unit can also cause redirects to handle mispredictions and exceptions. | Once per cycle the next address logic determines if branch instructions have been identified in the current 64-byte instruction fetch block, and if so, consults several branch prediction facilities about the most likely target and calculates a new fetch block address. If no branches are expected it calculates the address of the next sequential block. Branches are evaluated much later in the integer execution unit which provides the actual branch outcome to redirect instruction fetching and refine the predictions. The dispatch unit can also cause redirects to handle mispredictions and exceptions. | ||
Line 338: | Line 309: | ||
They are prefetched from addresses generated by the branch prediction unit and prepared in the prediction queue. | They are prefetched from addresses generated by the branch prediction unit and prepared in the prediction queue. | ||
− | A 20-entry instruction byte queue decouples the fetch and decode units. Each entry holds 16 instruction bytes aligned on a 16-byte boundary. 10 entries are available to each thread in SMT mode | + | A 20-entry instruction byte queue decouples the fetch and decode units. Each entry holds 16 instruction bytes aligned on a 16-byte boundary. 10 entries are available to each |
+ | thread in SMT mode. | ||
An align stage scans two 16-byte windows per cycle to determine the boundaries of up to four x86 instructions. The length of x86 instructions is variable and ranges from 1 to | An align stage scans two 16-byte windows per cycle to determine the boundaries of up to four x86 instructions. The length of x86 instructions is variable and ranges from 1 to | ||
15 bytes. Only the first slot can pick instructions longer than 8 bytes. There is no penalty for decoding instructions with many prefixes in AMD Family 16h and later processors. | 15 bytes. Only the first slot can pick instructions longer than 8 bytes. There is no penalty for decoding instructions with many prefixes in AMD Family 16h and later processors. | ||
− | In another pipeline stage | + | In another pipeline stage the instruction decoder converts up to four x86 instructions per cycle to macro-ops. According to AMD the instruction decoder can send up to four |
+ | instructions per cycle to the op cache and micro-op queue. This has to encompass at least four, and various sources suggest no more than four, macro-ops. | ||
A macro-op is a fixed length, uniform representation of (usually) one x86 instruction, comprising an ALU and/or a memory operation. The latter can be a load, a store, or a load | A macro-op is a fixed length, uniform representation of (usually) one x86 instruction, comprising an ALU and/or a memory operation. The latter can be a load, a store, or a load | ||
Line 370: | Line 343: | ||
=== Execution Engine === | === Execution Engine === | ||
AMD stated that both the dispatch bandwidth and the retire bandwidth has been increased. | AMD stated that both the dispatch bandwidth and the retire bandwidth has been increased. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
==== Floating Point Unit ==== | ==== Floating Point Unit ==== | ||
Line 398: | Line 352: | ||
Zen 2 doubles the width of the physical registers, execution units, and data paths to 256 bits. The L1 data cache bandwidth was doubled to match. The number of micro-ops issued by the FP scheduler remains four, implying most AVX-256 instructions decode to a single macro-op which conserves queue entries and reduces pressure on RCU and scheduling resources. AMD did not disclose how the FPU was restructured. Die shots suggest two execution blocks splitting the PRF and FP ALUs, one operating on the lower 128 bits of a YMM register, executing x87, MMX, SSE, and AVX instructions, the other on the upper 128 bits for AVX-256 instructions. This improvement doubles the peak throughput of AVX-256 instructions to four per cycle, or in other words, up to 32 [[FLOPs]]/cycle in single precision or up to 16 [[FLOPs]]/cycle in double precision. Another improvement reduces the latency of double-precision vector multiplications from 4 to 3 cycles, equal to the latency of single-precision multiplications. The latency of fused multiply-add (FMA) instructions remains 5 cycles. | Zen 2 doubles the width of the physical registers, execution units, and data paths to 256 bits. The L1 data cache bandwidth was doubled to match. The number of micro-ops issued by the FP scheduler remains four, implying most AVX-256 instructions decode to a single macro-op which conserves queue entries and reduces pressure on RCU and scheduling resources. AMD did not disclose how the FPU was restructured. Die shots suggest two execution blocks splitting the PRF and FP ALUs, one operating on the lower 128 bits of a YMM register, executing x87, MMX, SSE, and AVX instructions, the other on the upper 128 bits for AVX-256 instructions. This improvement doubles the peak throughput of AVX-256 instructions to four per cycle, or in other words, up to 32 [[FLOPs]]/cycle in single precision or up to 16 [[FLOPs]]/cycle in double precision. Another improvement reduces the latency of double-precision vector multiplications from 4 to 3 cycles, equal to the latency of single-precision multiplications. The latency of fused multiply-add (FMA) instructions remains 5 cycles. | ||
− | The rename unit receives up to four macro-ops per cycle from dispatch. It maps x87, MMX, SSE, and AVX architectural registers and temporary registers used by microcoded instructions to physical registers | + | The rename unit receives up to four macro-ops per cycle from dispatch. It maps x87, MMX, SSE, and AVX architectural registers and temporary registers used by microcoded instructions to physical registers. MMX and x87 registers occupy the lowest 64 or 80 bits of a PR. The <i>Zen/Zen+</i> rename unit allocates two 128-bit PRs for each YMM register. Only one PR is needed for SSE and AVX-128 instructions, the upper half of the destination YMM register in another PR remains unchanged or is zeroed, respectively, which consumes no execution resources. (SSE instructions behave this way for compatibility with legacy software which is unaware of, and does not preserve, the upper half of YMM registers through library calls or interrupts.) Zen 2 allocates a single 256-bit PR and tracks in the register allocation table (RAT) if the upper half of the YMM register was zeroed. This necessitates register merging when SSE and AVX instructions are mixed and the upper half of the YMM register contains non-zero data. To avoid this the AVX ISA exposes an SSE mode where the FPU maintains the upper half of YMM registers separately. Zen 2 handles transitions between the SSE and AVX mode by microcode which takes approximately 100 cycles in either direction. Zeroing the upper half of all YMM registers with the VZEROUPPER or VZEROALL instruction before executing SSE instructions prevents the transition. |
Zen 2 inherits the move elimination and XMM register merge optimizations from its predecessors. Register to register moves are performed by renaming the destination register and do not occupy any scheduling or execution resources. XMM register merging occurs when SSE instructions such as SQRTSS leave the upper 64 or 96 bits of the destination register unchanged, causing a dependency on previous instructions writing to this register. AMD family 15h and later processors can eliminate this dependency in a sequence of scalar FP instructions by recording in the RAT if those bits were zeroed. By setting a Z-bit in the RAT the rename unit also tracks if an architectural register was completely zeroed. All-zero registers are not mapped, the zero data bits are injected at the bypass network which conserves power and PRF entries, and allows for more instructions in flight. Earlier AMD microarchitectures, apparently including Zen/Zen+, recognize zeroing idioms such as XORPS combining a register with itself and eliminate the dependency on the source register. AMD did not disclose if or how this is implemented in Zen 2. Family 16h processors recognize zeroing idioms in the instruction decode unit and set the Z-bit on the destination register in the floating point rename unit, completing the operation without consuming scheduling or execution resources. | Zen 2 inherits the move elimination and XMM register merge optimizations from its predecessors. Register to register moves are performed by renaming the destination register and do not occupy any scheduling or execution resources. XMM register merging occurs when SSE instructions such as SQRTSS leave the upper 64 or 96 bits of the destination register unchanged, causing a dependency on previous instructions writing to this register. AMD family 15h and later processors can eliminate this dependency in a sequence of scalar FP instructions by recording in the RAT if those bits were zeroed. By setting a Z-bit in the RAT the rename unit also tracks if an architectural register was completely zeroed. All-zero registers are not mapped, the zero data bits are injected at the bypass network which conserves power and PRF entries, and allows for more instructions in flight. Earlier AMD microarchitectures, apparently including Zen/Zen+, recognize zeroing idioms such as XORPS combining a register with itself and eliminate the dependency on the source register. AMD did not disclose if or how this is implemented in Zen 2. Family 16h processors recognize zeroing idioms in the instruction decode unit and set the Z-bit on the destination register in the floating point rename unit, completing the operation without consuming scheduling or execution resources. | ||
− | As in the Zen/Zen+ microarchitecture the 64-entry non-scheduling queue decouples dispatch and the FP scheduler. This allows dispatch to send operations to the integer side, in particular to expedite floating point loads and store address calculations, while the FP scheduler, whose capacity cannot be arbitrarily increased for complexity reasons, is busy with higher latency FP operations. The 36-entry out-of-order scheduler issues up to four micro-ops per cycle to the execution pipelines. A 160-entry physical register file holds the speculative and committed contents of architectural and temporary registers. The PRF has 8 read ports and 4 write ports for ALU results, each 256 bits wide in Zen 2, and two additional write ports supporting up to two 256-bit load operations per cycle, up from two 128-bit loads in Zen | + | As in the Zen/Zen+ microarchitecture the 64-entry non-scheduling queue decouples dispatch and the FP scheduler. This allows dispatch to send operations to the integer side, in particular to expedite floating point loads and store address calculations, while the FP scheduler, whose capacity cannot be arbitrarily increased for complexity reasons, is busy with higher latency FP operations. The 36-entry out-of-order scheduler issues up to four micro-ops per cycle to the execution pipelines. A 160-entry physical register file holds the speculative and committed contents of architectural and temporary registers. The PRF has 8 read ports and 4 write ports for ALU results, each 256 bits wide in Zen 2, and two additional write ports supporting up to two 256-bit load operations per cycle, up from two 128-bit loads in Zen. The FPU is capable of <i>superforwarding</i> load data to dependent instructions through the bypass network, obviating a write and read of the PRF. The bypass network enables back-to-back execution of dependent instructions by forwarding results from the FP ALUs to the previous pipeline stage. |
The floating point pipelines are asymmetric, each supporting a different set of operations, and the ALUs are grouped in domains, to conserve die space and reduce signal path lengths which permits higher clock frequencies. The number of execution resources available reflects the density of different instruction types in x86 code. Each pipe receives two operands from the PRF or bypass network. Pipe 0 and 1 support | The floating point pipelines are asymmetric, each supporting a different set of operations, and the ALUs are grouped in domains, to conserve die space and reduce signal path lengths which permits higher clock frequencies. The number of execution resources available reflects the density of different instruction types in x86 code. Each pipe receives two operands from the PRF or bypass network. Pipe 0 and 1 support | ||
Line 410: | Line 364: | ||
Same as Zen/Zen+ the Zen 2 FPU handles denormal floating-point values natively, this can still incur a small penalty in some instances (MUL/DIV/SQRT). | Same as Zen/Zen+ the Zen 2 FPU handles denormal floating-point values natively, this can still incur a small penalty in some instances (MUL/DIV/SQRT). | ||
− | + | === Load-Store Unit === | |
The load-store unit handles memory reads and writes. The width of data paths and buffers doubled from 128 bits in the {{\\|Zen}}/{{\\|Zen+}} microarchitecture to 256 bits. | The load-store unit handles memory reads and writes. The width of data paths and buffers doubled from 128 bits in the {{\\|Zen}}/{{\\|Zen+}} microarchitecture to 256 bits. | ||
Line 420: | Line 374: | ||
A two-level translation lookaside buffer (TLB) assists load and store address translation. The fully-associative L1 data TLB contains 64 entries and holds 4-Kbyte, 2-Mbyte, and 1-Gbyte page table entries. The L2 data TLB is a unified 12-way set-associative cache with 2048 entries, up from 1536 entries in Zen, holding 4-Kbyte and 2-Mbyte page table entries, as well as page directory entries (PDEs) to speed up DTLB and ITLB table walks. 1-Gbyte pages are <i>smashed</i> into 2-Mbyte entries but installed as 1-Gbyte entries when reloaded into the L1 TLB. | A two-level translation lookaside buffer (TLB) assists load and store address translation. The fully-associative L1 data TLB contains 64 entries and holds 4-Kbyte, 2-Mbyte, and 1-Gbyte page table entries. The L2 data TLB is a unified 12-way set-associative cache with 2048 entries, up from 1536 entries in Zen, holding 4-Kbyte and 2-Mbyte page table entries, as well as page directory entries (PDEs) to speed up DTLB and ITLB table walks. 1-Gbyte pages are <i>smashed</i> into 2-Mbyte entries but installed as 1-Gbyte entries when reloaded into the L1 TLB. | ||
− | Two hardware page table walkers handle L2 TLB misses, presumably one serving the DTLB, another the ITLB. In addition to the PDE entries, the table walkers include a 64-entry page directory cache which holds page-map-level-4 and page-directory-pointer entries speeding up DTLB and ITLB walks. | + | Two hardware page table walkers handle L2 TLB misses, presumably one serving the DTLB, another the ITLB. In addition to the PDE entries, the table walkers include a 64-entry page directory cache which holds page-map-level-4 and page-directory-pointer entries speeding up DTLB and ITLB walks. The Zen/Zen+ microarchitecture supports page table entry (PTE) coalescing; this feature is probably also present in Zen 2. When the table walker loads a PTE, which occupies 8 bytes in the x86-64 architecture, from memory it also examines the seven other PTEs in the same 64-byte cache line. If they assign a contiguous 32 KiB block of physical pages with the same attributes to the 8 virtual pages, they are coalesced into a single TLB entry greatly improving the efficiency of this cache. |
The LS unit relies on a 32 KiB, 8-way set associative, write-back, ECC-protected L1 data cache (DC). It supports two loads, if they access different DC banks, and one store per cycle, each up to 256 bits wide. The line width is 64 bytes, however cache stores are aligned to a 32-byte boundary. Loads spanning a 64-byte boundary and stores spanning a 32-byte boundary incur a penalty of one cycle. In the Zen/Zen+ microarchitecture 256-bit vectors are loaded and stored as two 128-bit halves; the load and store boundaries are 32 and 16 bytes respectively. Zen 2 can load and store 256-bit vectors in a single operation, but stores must be 32-byte aligned now to avoid the penalty. As in Zen aligned and unaligned load and store instructions (for example MOVUPS/MOVAPS) provide identical performance. | The LS unit relies on a 32 KiB, 8-way set associative, write-back, ECC-protected L1 data cache (DC). It supports two loads, if they access different DC banks, and one store per cycle, each up to 256 bits wide. The line width is 64 bytes, however cache stores are aligned to a 32-byte boundary. Loads spanning a 64-byte boundary and stores spanning a 32-byte boundary incur a penalty of one cycle. In the Zen/Zen+ microarchitecture 256-bit vectors are loaded and stored as two 128-bit halves; the load and store boundaries are 32 and 16 bytes respectively. Zen 2 can load and store 256-bit vectors in a single operation, but stores must be 32-byte aligned now to avoid the penalty. As in Zen aligned and unaligned load and store instructions (for example MOVUPS/MOVAPS) provide identical performance. | ||
Line 430: | Line 384: | ||
The LS unit supports memory type range register (MTRR) and the page attribute table (PAT) extensions. Write-combining, if enabled for a memory range, merges multiple stores targeting locations within the address range of a write buffer to reduce memory bus utilization. Write-combining is also used for non-temporal store instructions such as MOVNTI. The LS unit can gather writes from 8 different 64-byte cache lines. | The LS unit supports memory type range register (MTRR) and the page attribute table (PAT) extensions. Write-combining, if enabled for a memory range, merges multiple stores targeting locations within the address range of a write buffer to reduce memory bus utilization. Write-combining is also used for non-temporal store instructions such as MOVNTI. The LS unit can gather writes from 8 different 64-byte cache lines. | ||
− | Each core benefits from a private 512 KiB, 8-way set associative, write-back, ECC-protected L2 cache. The line width is 64 bytes. The data path between the L1 data or instruction cache and the L2 cache is 32 bytes wide. The L2 cache has a variable load-to-use latency of no less than 12 cycles | + | Each core benefits from a private 512 KiB, 8-way set associative, write-back, ECC-protected L2 cache. The line width is 64 bytes. The data path between the L1 data or instruction cache and the L2 cache is 32 bytes wide. The L2 cache has a variable load-to-use latency of no less than 12 cycles. |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Rome == | == Rome == | ||
Line 455: | Line 390: | ||
The centralized I/O die incorporates eight {{amd|Infinity Fabric}} links, 128 [[PCIe]] Gen 4 lanes, and eight [[DDR4]] memory channels. The full capabilities of the I/O have not been disclosed yet. Attached to the I/O die are eight compute dies - each with eight Zen 2 core - for a total of 64 cores and 128 threads per chip. | The centralized I/O die incorporates eight {{amd|Infinity Fabric}} links, 128 [[PCIe]] Gen 4 lanes, and eight [[DDR4]] memory channels. The full capabilities of the I/O have not been disclosed yet. Attached to the I/O die are eight compute dies - each with eight Zen 2 core - for a total of 64 cores and 128 threads per chip. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== All Zen 2 Chips == | == All Zen 2 Chips == | ||
− | <!-- NOTE: | + | <!-- NOTE: |
This table is generated automatically from the data in the actual articles. | This table is generated automatically from the data in the actual articles. | ||
If a microprocessor is missing from the list, an appropriate article for it needs to be | If a microprocessor is missing from the list, an appropriate article for it needs to be | ||
Line 564: | Line 462: | ||
</table> | </table> | ||
{{comp table end}} | {{comp table end}} | ||
− | |||
− | |||
− | |||
− | |||
== Bibliography == | == Bibliography == | ||
Line 575: | Line 469: | ||
* AMD 'Next Horizon', November 6, 2018 | * AMD 'Next Horizon', November 6, 2018 | ||
* AMD. Lisa Su ''Keynote''. May 26, 2019 | * AMD. Lisa Su ''Keynote''. May 26, 2019 | ||
− | + | * AMD 'Next Horizon Gaming' event at E3, June 10, 2019 | |
− | |||
== See Also == | == See Also == | ||
* Intel {{intel|Ice Lake|l=arch}} | * Intel {{intel|Ice Lake|l=arch}} |
Facts about "Zen 2 - Microarchitectures - AMD"
codename | Zen 2 + |
core count | 4 +, 6 +, 8 +, 12 +, 16 +, 24 +, 32 + and 64 + |
designer | AMD + |
first launched | July 2019 + |
full page name | amd/microarchitectures/zen 2 + |
instance of | microarchitecture + |
instruction set architecture | x86-64 + |
manufacturer | TSMC + and GlobalFoundries + |
microarchitecture type | CPU + |
name | Zen 2 + |
pipeline stages | 19 + |