From WikiChip
Editing amd/microarchitectures/zen 2

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 396: Line 396:
 
In the Zen/Zen+ microarchitecture the floating point physical registers, execution units, and data paths are 128 bits wide. For efficiency AVX-256 instructions which perform the same operation on the 128-bit upper and lower half of a YMM register are decoded into two macro-ops which pass through the FPU individually as execution resources become available and retire together. Accordingly the peak throughput is four SSE/AVX-128 instructions or two AVX-256 instructions per cycle.
 
In the Zen/Zen+ microarchitecture the floating point physical registers, execution units, and data paths are 128 bits wide. For efficiency AVX-256 instructions which perform the same operation on the 128-bit upper and lower half of a YMM register are decoded into two macro-ops which pass through the FPU individually as execution resources become available and retire together. Accordingly the peak throughput is four SSE/AVX-128 instructions or two AVX-256 instructions per cycle.
  
Zen 2 doubles the width of the physical registers, execution units, and data paths to 256 bits. The L1 data cache bandwidth was doubled to match. The number of micro-ops issued by the FP scheduler remains four, implying most AVX-256 instructions decode to a single macro-op which conserves queue entries and reduces pressure on RCU and scheduling resources. AMD did not disclose how the FPU was restructured. Die shots suggest two execution blocks splitting the PRF and FP ALUs, one operating on the lower 128 bits of a YMM register, executing x87, MMX, SSE, and AVX instructions, the other on the upper 128 bits for AVX-256 instructions. This improvement doubles the peak throughput of AVX-256 instructions to four per cycle, or in other words, up to 32 [[FLOPs]]/cycle in single precision or up to 16 [[FLOPs]]/cycle in double precision. Another improvement reduces the latency of double-precision vector multiplications from 4 to 3 cycles, equal to the latency of single-precision multiplications. The latency of fused multiply-add (FMA) instructions remains 5 cycles.
+
Zen 2 doubles the width of the physical registers, execution units, and data paths to 256 bits. The L1 data cache bandwidth was doubled to match. The number of micro-ops issued by the FP scheduler remains four, implying most AVX-256 instructions decode to a single macro-op which conserves queue entries and reduces pressure on RCU and scheduling resources. AMD did not disclose how the FPU was restructured. Die shots suggest two execution blocks splitting the PRF and FP ALUs, one operating on the lower 128 bits of a YMM register, executing x87, MMX, SSE, and AVX instructions, the other on the upper 128 bits for AVX-256 instructions. This improvement doubles the peak throughput of AVX-256 instructions to four per cycle, or in other words, up to 32 [[FLOPs]]/cycle in single precision or up to 16 [[FLOPs]]/cycle in double precision. (Citation needed - This would mean the 16 core 3950X would have a theoretical 2.2 TFLOPs of FP32 compute at 4.3 GHZ) Another improvement reduces the latency of double-precision vector multiplications from 4 to 3 cycles, equal to the latency of single-precision multiplications. The latency of fused multiply-add (FMA) instructions remains 5 cycles.
  
 
The rename unit receives up to four macro-ops per cycle from dispatch. It maps x87, MMX, SSE, and AVX architectural registers and temporary registers used by microcoded instructions to physical registers. The floating point control/status register is renamed as well. MMX and x87 registers occupy the lowest 64 or 80 bits of a PR. The <i>Zen/Zen+</i> rename unit allocates two 128-bit PRs for each YMM register. Only one PR is needed for SSE and AVX-128 instructions, the upper half of the destination YMM register in another PR remains unchanged or is zeroed, respectively, which consumes no execution resources. (SSE instructions behave this way for compatibility with legacy software which is unaware of, and does not preserve, the upper half of YMM registers through library calls or interrupts.) Zen 2 allocates a single 256-bit PR and tracks in the register allocation table (RAT) if the upper half of the YMM register was zeroed. This necessitates register merging when SSE and AVX instructions are mixed and the upper half of the YMM register contains non-zero data. To avoid this the AVX ISA exposes an SSE mode where the FPU maintains the upper half of YMM registers separately. Zen 2 handles transitions between the SSE and AVX mode by microcode which takes approximately 100 cycles in either direction. Zeroing the upper half of all YMM registers with the VZEROUPPER or VZEROALL instruction before executing SSE instructions prevents the transition.
 
The rename unit receives up to four macro-ops per cycle from dispatch. It maps x87, MMX, SSE, and AVX architectural registers and temporary registers used by microcoded instructions to physical registers. The floating point control/status register is renamed as well. MMX and x87 registers occupy the lowest 64 or 80 bits of a PR. The <i>Zen/Zen+</i> rename unit allocates two 128-bit PRs for each YMM register. Only one PR is needed for SSE and AVX-128 instructions, the upper half of the destination YMM register in another PR remains unchanged or is zeroed, respectively, which consumes no execution resources. (SSE instructions behave this way for compatibility with legacy software which is unaware of, and does not preserve, the upper half of YMM registers through library calls or interrupts.) Zen 2 allocates a single 256-bit PR and tracks in the register allocation table (RAT) if the upper half of the YMM register was zeroed. This necessitates register merging when SSE and AVX instructions are mixed and the upper half of the YMM register contains non-zero data. To avoid this the AVX ISA exposes an SSE mode where the FPU maintains the upper half of YMM registers separately. Zen 2 handles transitions between the SSE and AVX mode by microcode which takes approximately 100 cycles in either direction. Zeroing the upper half of all YMM registers with the VZEROUPPER or VZEROALL instruction before executing SSE instructions prevents the transition.

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)
codenameZen 2 +
core count4 +, 6 +, 8 +, 12 +, 16 +, 24 +, 32 + and 64 +
designerAMD +
first launchedJuly 2019 +
full page nameamd/microarchitectures/zen 2 +
instance ofmicroarchitecture +
instruction set architecturex86-64 +
manufacturerTSMC + and GlobalFoundries +
microarchitecture typeCPU +
nameZen 2 +
pipeline stages19 +