From WikiChip
Difference between revisions of "nudt/matrix-2000"
< nudt

(Architecture)
(Overview: minor grammar)
 
(6 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
{{nudt title|Matrix-2000}}
 
{{nudt title|Matrix-2000}}
{{mpu
+
{{chip
 
|name=Matrix-2000
 
|name=Matrix-2000
 
|image=matrix-2000 (front).png
 
|image=matrix-2000 (front).png
Line 17: Line 17:
 
}}
 
}}
 
[[File:matrix-2000 (back).png|300px|thumb|right|Matrix-2000 Ceramic LGA package back side.]]
 
[[File:matrix-2000 (back).png|300px|thumb|right|Matrix-2000 Ceramic LGA package back side.]]
'''Matrix-2000''' is a {{arch|64}} [[128-core]] [[many-core processor]] designed by [[NUDT]] and introduced in [[2017]]. This chip was designed exclusively as an accelerator for [[China]]'s [[Tianhe-2]] supercomputer in order to upgrade and replace the aging [[Intel]]'s {{intel|Xeon Phi|Knights Corner}} accelerators after the Obama administration banned the sale of high-performance accelerators to China. The Matrix-2000 features 128 [[RISC]] cores operating at 1.2 GHz achieving 2.46 / 4.92 [[TFLOPS]] (DP/SP) with a peak power dissipation of 240 W.
+
'''Matrix-2000''' ('''MT-2000''') is a {{arch|64}} [[128-core]] [[many-core processor]] designed by [[NUDT]] and introduced in [[2017]]. This chip was designed exclusively as an accelerator for [[China]]'s [[Tianhe-2]] supercomputer in order to upgrade and replace the aging [[Intel]]'s {{intel|Xeon Phi|Knights Corner}} accelerators after the Obama administration banned the sale of high-performance accelerators to China. The Matrix-2000 features 128 [[RISC]] cores operating at 1.2 GHz achieving 2.46 / 4.92 [[TFLOPS]] (DP/SP) with a peak power dissipation of 240 W.
  
 
The Matrix-2000 is said to be fabricated on a leading edge process technology in China, although the exact process info was not disclosed.
 
The Matrix-2000 is said to be fabricated on a leading edge process technology in China, although the exact process info was not disclosed.
Line 28: Line 28:
 
<blockquote>Intel was informed in August by the U.S Department of Commerce that an export license was required for the shipment of Xeon and Xeon Phi parts for use in specific previously disclosed supercomputer projects with Chinese customer INSPUR. Intel complied with the notification and applied for the license which was denied. We are in compliance with the U.S. law.</blockquote>
 
<blockquote>Intel was informed in August by the U.S Department of Commerce that an export license was required for the shipment of Xeon and Xeon Phi parts for use in specific previously disclosed supercomputer projects with Chinese customer INSPUR. Intel complied with the notification and applied for the license which was denied. We are in compliance with the U.S. law.</blockquote>
  
Due to the ban NUDT was unable to obtain the Xeon Phis they've hoped for in order to upgrade the system. To achieve the desired upgrades without the embargoed parts, NUDT developed the Matrix-2000 accelerators. While not nearly as powerful as {{intel|Knights Landing|l=arch}}, the chips were more powerful than the first-generate {{intel|Knights Corner|l=arch}} parts they have replaced. While original (KL) system was planned to exceed 110 [[PFLOPS]] using the Intel parts, the Matrix-2000 managed to achieve 94.97 PFLOPS.
+
Due to the ban, NUDT was unable to obtain the Xeon Phis they'd hoped for in order to upgrade the system. Although the Matrix-2000 was in the planning prior to the ban, to achieve the desired upgrades without the embargoed parts, NUDT accelerated the development of the Matrix-2000 accelerators. While not nearly as powerful as {{intel|Knights Landing|l=arch}}, the chips were more powerful than the first-generate {{intel|Knights Corner|l=arch}} parts they have replaced. While the original (KL) system was planned to exceed 110 [[PFLOPS]] using the Intel parts, the Matrix-2000 managed to achieve 94.97 PFLOPS.
  
 
== Architecture ==
 
== Architecture ==
Line 36: Line 36:
  
 
=== NoC ===
 
=== NoC ===
Four SuperNodes make up the chip. Each SN features three Fast Interconnect Transport (FIT) links. FITs are a point-to-point interconnect with a bidirectional bandwidth of 25.6 GB/s per link and a reported round-trip delay of roughly 20 ns. Each FIT includes a cyclic redundancy check (CRC) and retry mechanism to ensure correct transmission. Each port is used to connect to each of the other SNs. The Matrix-2000 supports DMA mode in order to improve the FIT link bandwidth utilization with a reported utilization of 93.8% reported in said mode.
+
Four SuperNodes make up the chip. Each SN features three Fast Interconnect Transport (FIT) links. FITs are a point-to-point interconnect with a bidirectional bandwidth of 25.6 GB/s per link and a reported round-trip delay of roughly 20 ns. Each FIT includes a cyclic redundancy check (CRC) and retries mechanism to ensure correct transmission. Each port is used to connect to each of the other SNs. The Matrix-2000 supports DMA mode in order to improve the FIT link bandwidth utilization with a reported utilization of 93.8% reported in said mode.
 +
 
 +
 
 +
::[[File:matrix-2000 supernode connections.svg|400px]]
  
 
=== SuperNode (SN) ===
 
=== SuperNode (SN) ===
Each SuperNode [[network on a chip]] (NoC) is implemented as a 4 by 2 mesh topology for a total of 8 CPU Clusters. Each cluster consist of a router, a directory control unit (DCU), 4 CPU [[physical core|cores]] and a shared cache. Attached to each SuperNode are two DDR4 memory controllers at opposite ends. With 4 cores per node and 8 nodes per SuperNode, there are a total of 32 cores per SN. Compliance to cache coherence is done by the core.
+
Each SuperNode [[network on a chip]] (NoC) is implemented as a 4 by 2 mesh topology for a total of 8 CPU Clusters. Each cluster consists of a router, a directory control unit (DCU), 4 CPU [[physical core|cores]] and a shared cache. Attached to each SuperNode are two DDR4 memory controllers at opposite ends. With 4 cores per node and 8 nodes per SuperNode, there are a total of 32 cores per SN. Compliance to cache coherence is done by the core.
  
 
Routing is done via the router at every one of the clusters. The router has four communication channels: Response, Request, Snoop, and Acknowledgement. Each channel is 128-bit wide.
 
Routing is done via the router at every one of the clusters. The router has four communication channels: Response, Request, Snoop, and Acknowledgement. Each channel is 128-bit wide.
 +
 +
 +
::[[File:matrix-2000 supernode.svg|750px]]
 +
  
 
==== Core ====
 
==== Core ====
Line 48: Line 55:
 
Operating at 1.2 GHz, each core has a peak performance of 19.2 GFLOPs (1.2 GHz * 16 FLOP/cycle). With 32 such cores in each SuperNode, the peak performance of each SN is 614.4 GFLOPS. Likewise, with four SN per chip, the peak chip performance is 2.458 TFLOPS double precision or 4.916 TFLOPS single-precision.
 
Operating at 1.2 GHz, each core has a peak performance of 19.2 GFLOPs (1.2 GHz * 16 FLOP/cycle). With 32 such cores in each SuperNode, the peak performance of each SN is 614.4 GFLOPS. Likewise, with four SN per chip, the peak chip performance is 2.458 TFLOPS double precision or 4.916 TFLOPS single-precision.
  
It's worth noting that the core fully supports IEEE-standard double precision and single precision (64/32-bit FP) arithmetic but no half precision (16-bit).
+
It's worth noting that the core fully supports IEEE-standard double precision and single precision (64/32-bit FP) arithmetic but no half-precision (16-bit).
  
 
== Memory controller ==
 
== Memory controller ==
Line 68: Line 75:
 
* Third International High Performance Computing Forum 2017 (IHPCF2017)
 
* Third International High Performance Computing Forum 2017 (IHPCF2017)
 
* [http://www.icl.utk.edu/files/publications/2017/icl-utk-970-2017.pdf REPORT ON THE TIANHE-2A SYSTEM], Tech Report No. ICL-UT-17-04, Jack Dongarra, University of Tennessee, Knoxville, Oak Ridge National Laboratory, September 24 2017
 
* [http://www.icl.utk.edu/files/publications/2017/icl-utk-970-2017.pdf REPORT ON THE TIANHE-2A SYSTEM], Tech Report No. ICL-UT-17-04, Jack Dongarra, University of Tennessee, Knoxville, Oak Ridge National Laboratory, September 24 2017
 +
 +
[[category:supercomputing in china]]

Latest revision as of 06:26, 19 July 2019

Edit Values
Matrix-2000
matrix-2000 (front).png
Matrix-2000, package front
General Info
DesignerNUDT
Model NumberMatrix-2000
Introduction2015 (announced)
2017 (launched)
General Specs
Frequency1,200 MHz
Microarchitecture
TechnologyCMOS
Word Size64 bit
Cores128
Threads128
Electrical
Power dissipation240 W
Packaging
PackageFCCLGA-4201 (CLGA)
Dimension66 mm x 66 mm
Contacts4201
Matrix-2000 Ceramic LGA package back side.

Matrix-2000 (MT-2000) is a 64-bit 128-core many-core processor designed by NUDT and introduced in 2017. This chip was designed exclusively as an accelerator for China's Tianhe-2 supercomputer in order to upgrade and replace the aging Intel's Knights Corner accelerators after the Obama administration banned the sale of high-performance accelerators to China. The Matrix-2000 features 128 RISC cores operating at 1.2 GHz achieving 2.46 / 4.92 TFLOPS (DP/SP) with a peak power dissipation of 240 W.

The Matrix-2000 is said to be fabricated on a leading edge process technology in China, although the exact process info was not disclosed.

Overview[edit]

The original TianHe-2 (Milkyway-2) was powered by 16,000 servers consisting of Intel's Xeon and Xeon Phi accelerators. Those accelerators were based on Knights Corner. Originally, NUDT announced they would be upgrading the supercomputer to Intel's then-latest Phi Knights Landing accelerators. The new supercomputer was renamed 'TianHe-2A'. In February 2015, under the Obama administration, the Department of Commerce blacklisted NSCC-G (the site of the TianHe-2A) and NUDT as well as the previous supercomputer center. The DoC cited concerns regarding nuclear explosive devices and other related computer research and simulations.

Chuck Mulloy, an Intel spokesperson later gave the following statement:

Intel was informed in August by the U.S Department of Commerce that an export license was required for the shipment of Xeon and Xeon Phi parts for use in specific previously disclosed supercomputer projects with Chinese customer INSPUR. Intel complied with the notification and applied for the license which was denied. We are in compliance with the U.S. law.

Due to the ban, NUDT was unable to obtain the Xeon Phis they'd hoped for in order to upgrade the system. Although the Matrix-2000 was in the planning prior to the ban, to achieve the desired upgrades without the embargoed parts, NUDT accelerated the development of the Matrix-2000 accelerators. While not nearly as powerful as Knights Landing, the chips were more powerful than the first-generate Knights Corner parts they have replaced. While the original (KL) system was planned to exceed 110 PFLOPS using the Intel parts, the Matrix-2000 managed to achieve 94.97 PFLOPS.

Architecture[edit]

The Matrix-2000 consists 128 cores, eight DDR4 memory channels, and x16 PCIe lanes. The chip consists of four supernodes (SN) consisting of 32 cores each operating at 1.2 GHz with a peak power dissipation of 240 Watts.

matrix-2000.svg

NoC[edit]

Four SuperNodes make up the chip. Each SN features three Fast Interconnect Transport (FIT) links. FITs are a point-to-point interconnect with a bidirectional bandwidth of 25.6 GB/s per link and a reported round-trip delay of roughly 20 ns. Each FIT includes a cyclic redundancy check (CRC) and retries mechanism to ensure correct transmission. Each port is used to connect to each of the other SNs. The Matrix-2000 supports DMA mode in order to improve the FIT link bandwidth utilization with a reported utilization of 93.8% reported in said mode.


matrix-2000 supernode connections.svg

SuperNode (SN)[edit]

Each SuperNode network on a chip (NoC) is implemented as a 4 by 2 mesh topology for a total of 8 CPU Clusters. Each cluster consists of a router, a directory control unit (DCU), 4 CPU cores and a shared cache. Attached to each SuperNode are two DDR4 memory controllers at opposite ends. With 4 cores per node and 8 nodes per SuperNode, there are a total of 32 cores per SN. Compliance to cache coherence is done by the core.

Routing is done via the router at every one of the clusters. The router has four communication channels: Response, Request, Snoop, and Acknowledgement. Each channel is 128-bit wide.


matrix-2000 supernode.svg


Core[edit]

Each core is a reduced instruction set computer (RISC) featuring an in-order pipeline with 8 to 12 stages. The core incorporates an extended 256-bit vector instruction set architecture along with two 256-bit vector processing units (VPU). Each core is capable of performing 16 double-precision floating point operations each cycle.

Operating at 1.2 GHz, each core has a peak performance of 19.2 GFLOPs (1.2 GHz * 16 FLOP/cycle). With 32 such cores in each SuperNode, the peak performance of each SN is 614.4 GFLOPS. Likewise, with four SN per chip, the peak chip performance is 2.458 TFLOPS double precision or 4.916 TFLOPS single-precision.

It's worth noting that the core fully supports IEEE-standard double precision and single precision (64/32-bit FP) arithmetic but no half-precision (16-bit).

Memory controller[edit]

The Matrix-2000 supports eight channels of DDR4 operating at 2,400 MT/s distributed among the SuperNodes with 2 controllers per SN.

[Edit/Modify Memory Info]

ram icons.svg
Integrated Memory Controller
Max TypeDDR4-2400
Supports ECCYes
Controllers8
Channels8
Width64 bit
Max Bandwidth143.1 GiB/s
146,534.4 MiB/s
153.652 GB/s
153,652.455 MB/s
0.14 TiB/s
0.154 TB/s
Bandwidth
Single 17.88 GiB/s
Double 35.76 GiB/s
Quad 71.53 GiB/s
Octa 143.1 GiB/s

References[edit]

  • Third International High Performance Computing Forum 2017 (IHPCF2017)
  • REPORT ON THE TIANHE-2A SYSTEM, Tech Report No. ICL-UT-17-04, Jack Dongarra, University of Tennessee, Knoxville, Oak Ridge National Laboratory, September 24 2017
Facts about "Matrix-2000 - NUDT"
Has subobject
"Has subobject" is a predefined property representing a container construct and is provided by Semantic MediaWiki.
Matrix-2000 - NUDT#package +
base frequency1,200 MHz (1.2 GHz, 1,200,000 kHz) +
core count128 +
designerNUDT +
first announced2015 +
first launched2017 +
full page namenudt/matrix-2000 +
has ecc memory supporttrue +
instance ofmicroprocessor +
ldate2017 +
main imageFile:matrix-2000 (front).png +
main image captionMatrix-2000, package front +
max memory bandwidth143.1 GiB/s (146,534.4 MiB/s, 153.652 GB/s, 153,652.455 MB/s, 0.14 TiB/s, 0.154 TB/s) +
max memory channels8 +
model numberMatrix-2000 +
nameMatrix-2000 +
packageFCCLGA-4201 +
power dissipation240 W (240,000 mW, 0.322 hp, 0.24 kW) +
supported memory typeDDR4-2400 +
technologyCMOS +
thread count128 +
word size64 bit (8 octets, 16 nibbles) +