From WikiChip
Editing centaur/microarchitectures/cha

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 71: Line 71:
 
! Core !! Extended<br>Family !! Family !! Extended<br>Model !! Model !! Stepping
 
! Core !! Extended<br>Family !! Family !! Extended<br>Model !! Model !! Stepping
 
|-
 
|-
| rowspan="2" | Centaur CNS || 0 || 0x6 || 0x4 || 0x7 || 0x2
+
| rowspan="2" | Centaur CNS || 0 || 0x6 || 0x4 || 0x7 || 0x1
 
|-
 
|-
| colspan="5" | Family 6 Model 71 Stepping 2
+
| colspan="5" | Family 6 Model 71 Stepping 1
 
|}
 
|}
 +
  
 
== Architecture ==
 
== Architecture ==
Line 132: Line 133:
  
 
=== Instruction set ===
 
=== Instruction set ===
The CHA SoC integrates up to eight cores, each featuring the {{x86|x86-64}} [[ISA]] along with following {{x86|extensions}}:
+
The CHA SoC integrates up to eight cores, each featuring the {{x86|x86-64}} [[ISA]] along with fullowing {{x86|extensions}}:
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 151: Line 152:
 
== Overview ==
 
== Overview ==
 
[[File:cha soc overview.svg|thumb|right|CHA Overview]]
 
[[File:cha soc overview.svg|thumb|right|CHA Overview]]
Announced in 2019 and expected to be introduced in 2020, '''CHA''' (pronounced ''C-H-A'') is a new ground-up [[x86]] SoC designed by [[Centaur]] for the server, edge, and AI market. Fabricated on TSMC [[16 nm process]], the chip integrates eight high-performance [[x86]] "CNS" cores along with a brand new clean-sheet design "NCORE" [[neural processor]]. CHA is a fully integrated SoC. It incorporates both the [[source bridge]] and [[north bridge]] on-die. All the cores, along with the NCORE, the southbridge, and memory controller are all [[ring interconnect|interconnected on a ring]]. The chip supports up to quad-channel [[DDR4 memory]] and up to 44 PCIe Gen 3 lanes. Likewise, the southbridge provides all the usual legacy I/O functionality. Targetting the server market as well, CHA adds the ability to directly link to a second CHA SoC in a 2-way [[multiprocessing]] configuration.  
+
Announced in 2019 and expected to be introduced in 2020, '''CHA''' (pronounced ''C-H-A'') is a new ground-up [[x86]] SoC designed by [[Centaur]] for the server, edge, and AI market. Fabricated on TSMC [[16 nm process]], the chip integrates eight high-performance [[x86]] "CNS" cores along with a brand new clean-sheet design high-performance "NCORE" [[neural processor]]. This is the first server x86 chip to integrate an AI [[accelerator]]. The integrated NPU is designed to allow for a reduction of platform cost by offering an AI inference coprocessor "free" on-die along with the standard server-class x86 cores. For many workloads, this accelerator means it's no longer required to add a third-party PCIe-based [[accelerator card]] unless a considerably higher performance is required.
  
This is the first server x86 chip to integrate an AI [[accelerator]]. The CHA SoC features new CNS cores which introduce considerably higher [[single-thread performance]] over the prior designs. The cores also introduce the {{x86|AVX-512}} extension in order to offer better performance, flexibility, and offer better ISA compatibility with other [[x86]] vendors such as Intel. The integrated NPU is designed to allow for a reduction of platform cost by offering an AI inference coprocessor "free" on-die along with the standard server-class x86 cores. For many workloads, the on-die specialized inference acceleration means it's no longer required to add a third-party PCIe-based [[accelerator card]].
+
The CHA SoC features new CNS cores which introduce considerably higher [[single-thread performance]]. The cores also introduce the {{x86|AVX-512}} extension in order to offer better performance and more flexibility.
 +
 
 +
CHA is a fully integrated SoC. It incorporates both the [[source bridge]] and [[north bridge]] on-die. The chip supports for up to quad-channel [[DDR4 memory]] and up to 44 PCIe Gen 3 lanes. Additionally, the southbridge provides all the usual legacy I/O functionality. Additionally, CHA supports the ability to directly link to a second CHA SoC in a 2-way [[multiprocessing]] configuration. All the cores, along with the NCORE, the southbridge, and memory controller are all [[ring interconnect|interconnected on a ring]].
  
 
== CNS Core ==
 
== CNS Core ==
Line 167: Line 170:
 
Each cycle, up to 32 bytes (half a line) of the instruction stream are fetched from the [[instruction cache]] into the instruction pre-decode queue. Since [[x86]] instructions may range from a single byte to 15 bytes, this buffer receives an unstructured byte stream which is then marked at the instruction boundary. In addition to marking instruction boundaries, the pre-decode also does various prefix processing. From the pre-decode queue, up to five individual instructions are fed into the formatted instruction queue (FIQ).
 
Each cycle, up to 32 bytes (half a line) of the instruction stream are fetched from the [[instruction cache]] into the instruction pre-decode queue. Since [[x86]] instructions may range from a single byte to 15 bytes, this buffer receives an unstructured byte stream which is then marked at the instruction boundary. In addition to marking instruction boundaries, the pre-decode also does various prefix processing. From the pre-decode queue, up to five individual instructions are fed into the formatted instruction queue (FIQ).
  
Prior to getting sent to decode, the FIQ has the ability to do [[macro-op fusion]]. CNS can detect certain pairs of adjacent instructions such as a simple arithmetic operation followed by a conditional jump and couple them together such that they get decoded at the same time into a fused operation. This was improved further with the new CNS core.
+
Prior to getting sent to decode, the FIQ has the ability to do [[macro-fusion]]. CNS can detect certain pairs of adjacent instructions such as a simple arithmetic operation followed by a conditional jump and couple them together such that they get decoded at the same time into a fused operation. This was improved further with the new CNS core.
  
 
[[File:cns decode.svg|right|500px]]
 
[[File:cns decode.svg|right|500px]]
Line 217: Line 220:
 
==== Neural processing unit (NPU) ====
 
==== Neural processing unit (NPU) ====
 
Each cycle, the neural processing unit reads data out of one or two of the four registers in the neural data unit. Alternatively, input data can also be moved from one neural register to the next. This is designed to efficiently handle fully connected neural networks. The neural processing unit (NPU) does various computations such as MAC operations, shifting, min/max, and various other functions designed to add flexibility in terms of support in preparation for future AI functionalities and operations. There is also extensive support for predication with 8 predication registers. The unit is optimized for 8-bit integers (9-bit calculations) but can also operate on [[16-bit integers]] as well as [[bfloat16]]. Wider [[data types]] allow for higher precision but they incur a latency penalty. 8-bit operations can be done in a single cycle while 16-bit integer and floating-point operations require three cycles to complete. The neural processing unit incorporates a 32-bit 4K accumulator which can operate in both 32b-integer and 32b-[[floating-point]] modes. The accumulator saturates on overflows to prevent wrap-around (e.g., the biggest positive to biggest negative). Following the millions or billions of repeated MAC operations, the output is sent to the output unit for post-processing.
 
Each cycle, the neural processing unit reads data out of one or two of the four registers in the neural data unit. Alternatively, input data can also be moved from one neural register to the next. This is designed to efficiently handle fully connected neural networks. The neural processing unit (NPU) does various computations such as MAC operations, shifting, min/max, and various other functions designed to add flexibility in terms of support in preparation for future AI functionalities and operations. There is also extensive support for predication with 8 predication registers. The unit is optimized for 8-bit integers (9-bit calculations) but can also operate on [[16-bit integers]] as well as [[bfloat16]]. Wider [[data types]] allow for higher precision but they incur a latency penalty. 8-bit operations can be done in a single cycle while 16-bit integer and floating-point operations require three cycles to complete. The neural processing unit incorporates a 32-bit 4K accumulator which can operate in both 32b-integer and 32b-[[floating-point]] modes. The accumulator saturates on overflows to prevent wrap-around (e.g., the biggest positive to biggest negative). Following the millions or billions of repeated MAC operations, the output is sent to the output unit for post-processing.
 
<table class="wikitable">
 
<tr><th colspan="4">Peak Compute</th></tr>
 
<tr><th>Data Type</th><td>[[Int8]]</td><td>[[Int16]]</td><td>[[bfloat16]]</td></tr>
 
<tr><th>MACs/cycle</th><td>4,096</td><td>682.67</td><td>682.67</td></tr>
 
<tr><th>Peak OPs</th><td>20.5 [[TOPS]]</td><td>3.42 [[TOPS]]</td><td>3.42 [[TFLOPS]]</td></tr>
 
<tr><th>Frequency</th><td colspan="3" style="text-align: center">2.5 GHz</td></tr>
 
</table>
 
  
 
==== Output unit ====
 
==== Output unit ====
Line 234: Line 229:
  
 
=== Communication ===
 
=== Communication ===
There are a number of ways the NCORE can be communicated with. The individual CNS cores can directly read and write to the NCORE using the [[virtual address space]] of the process (e.g., <code>open()</code>). AVX512 mov operations can also be used. The cores can also read the control and status registers. In turn, the NCORE can interrupt back to the core for follow-up post-processing. The two [[DMA controllers]] in the NCORE are also capable of communicating with the cache slices in the cores, the DRAM controllers, and optionally, other PCIe I/O devices.
+
There are a number of ways the NCORE can be communicated with. The individual CNS cores can directly read and write to the NCORE using the virtual address space (e.g., <code>open()</code>). AVX512 mov operations can also be used. The cores can also read the control and status registers. In turn, the NCORE can interrupt back to the core for follow-up post-processing. The two DMA controllers in the NCORE are also capable of communicating with the cache slices in the cores, the DRAM controllers, and optionally, other PCIe I/O devices.
  
 
=== Instructions ===
 
=== Instructions ===
[[Instructions]] are 128-bit wide and execute in 1 clock cycle (including 0-cycle branches). Most instructions typically require 64-80 bits (roughly 1/2-3/4). Detailed definitions of the instructions are not made public as they are designed to be highly hardware-dependent designed for software to simplify the hardware and extract additional power efficiency. To that end, the instructions will likely change with new hardware versions.
+
Instructions are 128-bit wide and execute in 1 clock cycle (including 0-cycle branches). Most instructions typically require 64-80 bits (roughly 1/2-3/4). Detailed definitions of the instructions are not made public as they are designed to be highly hardware-dependent designed for software to simplify the hardware and extract additional power efficiency. To that end, the instructions will likely change with new hardware versions.
  
 
* 30b: control of 2 RAM read & index operations
 
* 30b: control of 2 RAM read & index operations
Line 259: Line 254:
 
== Die ==
 
== Die ==
 
=== SoC ===
 
=== SoC ===
* [[TSMC]] [[16 nm process]] (16FFC)
+
* [[TSMC]] [[16 nm process]]
 
*  194 mm²
 
*  194 mm²
  
:[[File:centaur cha soc die (2).png|500px|class=wikichip_ogimage]]
+
:[[File:cha soc.png|600px|class=wikichip_ogimage]]
 
 
 
 
:[[File:centaur cha soc die (2) annotated.png|500px]]
 
 
 
 
 
:[[File:cha soc.png|500px]]
 
  
 
=== Core group ===
 
=== Core group ===
: ~63.2 mm²
+
: ~62.5 mm²
  
:[[File:cha core group 2.png|400px]]
+
:[[File:cha core group.png|500px]]
 
 
 
 
:[[File:cha core group.png|400px]]
 
  
 
==== CNS Core ====
 
==== CNS Core ====
: ~4.29 mm²
+
: ~4.4 mm²
 
 
:[[File:cha cns core die 2.png|250px]]
 
 
 
  
:[[File:cha cns core die.png|250px]]
+
:[[File:cha cns core die.png|300px]]
  
 
=== NCORE ===
 
=== NCORE ===
Line 294: Line 277:
 
<div style="float: left;">[[File:cha soc ncore.png|300px]]</div>
 
<div style="float: left;">[[File:cha soc ncore.png|300px]]</div>
 
<div style="float: left;">[[File:cha soc ncore (2).png|300px]]</div>
 
<div style="float: left;">[[File:cha soc ncore (2).png|300px]]</div>
<div style="float: left;">[[File:cha soc ncore 3.png|300px]]</div>
 
 
</div>
 
</div>
  

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)

This page is a member of 1 hidden category: