From WikiChip
Editing centaur/microarchitectures/cha

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 175: Line 175:
  
 
=== Back-end ===
 
=== Back-end ===
The back-end deals with the [[out-of-order]] execution of instructions. CNS introduces major improvements to the back-end over prior generations. From the front-end, micro-operations are fetched from the micro-op queue which decouples the front-end from the back-end. Each cycle, up to four instructions can be renamed (and later retire). This is an increase from the previously 3-wide [[instruction rename|rename]]. The widening of rename and retire match the decode rate of the front-end. Once renamed, micro-operations are sent to the scheduler. For fused operations, instructions will remain fused throughout the pipeline and will retire fused. Therefore retire has a max retirement bandwidth of 5 x86 instructions per clock.
+
The back-end deals with the [[out-of-order]] execution of instructions. CNS makes major improvements to the back-end over prior generations. From the front-end, micro-operations are fetched from the micro-operation queue which decouples the front-end from the back-end. Each cycle, up to four instructions can be renamed (and later retire). This is an increase from the previously 3-wide [[instruction rename|rename]]. The widening of rename and retire match the decode rate of the front-end. Once renamed, micro-operations are sent to the scheduler.
  
Prior Centaur chips were manufactured on relatively older [[process nodes]] such as [[65 nm]] and later [[45 nm]]. The move to a more leading-edge node ([[TSMC]] [[16-nanometer]] [[FinFET]], in this case) provided them with a significantly higher transistor budget. Centaur takes advantage of that to build a wider out-of-order core. To that end, Centaur's CNS core supports 192 OoO instructions in-flight. This is identical to both {{intel|Haswell|Intel Haswell|l=arch}} and {{amd|Zen|AMD Zen|l=arch}}.
+
Prior Centaur chips were manufactured on relatively older [[process nodes]] such as [[65 nm]] and later [[45 nm]]. The move to a more leading-edge node ([[TSMC]] [[16-nanometer]] [[FinFET]], in this case) provided them with a significantly higher transistor budget. Centaur takes advantage of that to build a wider out-of-order core. To that end, Centaur’s CNS core supports 192 OoO instructions in-flight. This is identical to both {{intel|Haswell|Intel Haswell|l=arch}} and {{amd|Zen|AMD Zen|l=arch}}.
  
 
==== Execution ports ====
 
==== Execution ports ====
Line 188: Line 188:
 
CNS incorporates three dedicated ports for [[floating-point]] and vector operations. Two of the ports support [[fused-multiply-add|FMA operations]] while the third has the divide and crypto units. All three pipes are 256-bit wide. In terms of raw compute power, the total [[FLOPS]] per core is 16 double-precision FLOPs/cycle – reaching parity with AMD {{amd|Zen 2|l=arch}} as well as Intel's {{intel|Haswell|l=arch}}, {{intel|Broadwell|l=arch}}, and {{intel|Skylake (Client)|l=arch}}.  
 
CNS incorporates three dedicated ports for [[floating-point]] and vector operations. Two of the ports support [[fused-multiply-add|FMA operations]] while the third has the divide and crypto units. All three pipes are 256-bit wide. In terms of raw compute power, the total [[FLOPS]] per core is 16 double-precision FLOPs/cycle – reaching parity with AMD {{amd|Zen 2|l=arch}} as well as Intel's {{intel|Haswell|l=arch}}, {{intel|Broadwell|l=arch}}, and {{intel|Skylake (Client)|l=arch}}.  
  
CNS added extensive [[x86]] ISA support, including new support for {{x86|AVX-512}}. CNS supports all the AVX-512 extensions supported by Intel's {{intel|Skylake (Server)|l=arch}} as well as those found in {{intel|Palm Cove|l=arch}}. From an implementation point of view, Centaur's CNS cores Vector lanes are 256-wide, therefore AVX-512 operations are cracked into two 256-wide operations which are then scheduled independently. In other words, there is no throughput advantage here. The design is similar to how AMD dealt with AVX-256 in their {{amd|Zen core|l=arch}} where operations had to be executed as two 128-bit wide operations. Note that the implementation of AVX-512 on CNS usually exhibits no downclocking. The design of the core was such that it's designed to operate at the full frequency of the core and the rest of the SoC. Centaur does implement a power management engine that's capable of downclocking for certain power-sensitive SKUs if necessary.
+
CNS added extensive [[x86]] ISA support, including new support for {{x86|AVX-512}}. CNS supports all the AVX-512 extensions supported by Intel's {{intel|Skylake (Server)|l=arch}} as well as those found in {{intel|Palm Cove|l=arch}}. From an implementation point of view, Centaur’s CNS cores Vector lanes are 256-wide, therefore AVX-512 operations are cracked into two 256-wide operations which are then scheduled independently. In other words, there is no throughput advantage here. The design is similar to how AMD dealt with AVX-256 in their {{amd|Zen core|l=arch}} where operations had to be executed as two 128-bit wide operations. Note that the implementation of AVX-512 on CNS usually exhibits no downclocking. The design of the core was such that it's designed to operate at the full frequency of the core and the rest of the SoC. Centaur does implement a power management engine that's capable of downclocking for certain power-sensitive SKUs if necessary.
  
 
=== Memory subsystem ===
 
=== Memory subsystem ===
 
[[File:cns mem subsys.svg|right|450px]]
 
[[File:cns mem subsys.svg|right|450px]]
The memory subsystem on CNS features three ports - two generic AGUs and one store AGU port. In other words, two of the ports are capable of dispatching loads or store addresses and an additional port that is capable of dispatching a store address only. CNS supports 116 memory operations in-flight. The MOB consists of a 72-entry load-buffer and a 44-entry store buffer. Data store operations are done via the execution units that forward the data to the store buffer. On CNS, up to 4 values may be forwarded to store - they may be any combination of up to four integers and two media values.
+
The memory subsystem on CNS features three ports - two generic AGUs and one store AGU port. CNS supports 116 memory operations in-flight. The MOB consists of a 72-entry load-buffer and a 44-entry store buffer.
  
CNS features a [[level 1 data cache]] with a capacity of 32 KiB. The cache is organized as 8 ways of 64 sets. It is fully multi-ported, capable of supporting 2 reads and 1 write every cycle. Each port is 32B wide, therefore 512-bit memory operations, like the arithmetic counterparts, have to be cracked into two 256-bit operations. With two load operations, CNS can do a single 512-bit operation each cycle. Note that although there is a single write port, if two 256-bit stores are consecutive stores to a single [[cache line]], they can both be committed in a single cycle.
+
CNS features a [[level 1 data cache]] with a capacity of 32 KiB. The cache is organized as 8 ways of 64 sets. It is fully multi-ported, supporting 2 reads and 1 write every cycle. Each port is 32B wide, therefore 512-bit memory operations, like the arithmetic counterparts, have to be cracked into two 256-bit operations. With two load operations, CNS can do a single 512-bit operation each cycle.
 
 
Lines to the L1 data cache as well as the L1 instruction cache come from the [[level 2 cache]]. Each core features a private [[level 2 cache]] with a capacity of 256 KiB which is organized as a 16-way set associative. Each cycle, up to 512 bits may be transferred from the L2 to the either of the L1 caches.
 
  
 
== NCORE NPU ==
 
== NCORE NPU ==

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)

This page is a member of 1 hidden category: