From WikiChip
Editing nec/microarchitectures/sx-aurora
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.
Latest revision | Your text | ||
Line 25: | Line 25: | ||
|predecessor link=nec/microarchitectures/sx-ace | |predecessor link=nec/microarchitectures/sx-ace | ||
}} | }} | ||
− | '''SX-Aurora''' is | + | '''SX-Aurora''' is [[NEC]]'s successor to the {{\\|SX-ACE}}, a [[16 nm]] microarchitecture for [[vector processors]] first introduced in [[2018]]. |
== History == | == History == | ||
Line 32: | Line 32: | ||
== Architecture == | == Architecture == | ||
=== Key changes from {{\\|SX-ACE}} === | === Key changes from {{\\|SX-ACE}} === | ||
− | |||
− | |||
* [[16 nm process]] (from [[28 nm]]) | * [[16 nm process]] (from [[28 nm]]) | ||
* 1.6x frequency (1.6 GHz, up from 1 GHz) | * 1.6x frequency (1.6 GHz, up from 1 GHz) | ||
Line 42: | Line 40: | ||
** 3x [[FLOPs]]/cycle (192 FLOPs/cycle, up from 64 FLOPs/cycle) | ** 3x [[FLOPs]]/cycle (192 FLOPs/cycle, up from 64 FLOPs/cycle) | ||
* Memory | * Memory | ||
− | ** | + | ** 16 MiB L3 [[LLC]] |
− | |||
** 6x [[HBM2]] (from 12x [[DDR3]]) | ** 6x [[HBM2]] (from 12x [[DDR3]]) | ||
*** 4.7x memory bandwidth (1.2 TB/s, up from 256 GB/s) | *** 4.7x memory bandwidth (1.2 TB/s, up from 256 GB/s) | ||
Line 50: | Line 47: | ||
== Block Diagram == | == Block Diagram == | ||
=== Entire SoC === | === Entire SoC === | ||
− | :[[File:sx-aurora block diagram.svg| | + | :[[File:sx-aurora block diagram.svg|700px]] |
=== Vector core === | === Vector core === | ||
− | :[[File:sx-aurora vector core block diagram.svg| | + | :[[File:sx-aurora vector core block diagram.svg|1200px]] |
== Memory Hierarchy == | == Memory Hierarchy == | ||
Line 80: | Line 77: | ||
== Overview == | == Overview == | ||
− | [[File:sx-aurora overview.svg|thumb|right| | + | [[File:sx-aurora overview.svg|thumb|right|400px|Overview of the SX-Aurora]] |
The SX-Aurora is [[NEC]]'s successor to the {{\\|SX-ACE}}, a [[vector processor]] designed for [[high-performance]] scientific/research applications and supercomputers. The SX-Aurora deviates from all prior chips in the kind of markets it's designed to address. Therefore, NEC made slightly different design choice compared to prior generations of vector processors. In an attempt to broaden their market, NEC extended beyond supercomputers to the conventional server and workstation market. This is done through the use of [[PCIe]]-based [[accelerator cards]]. | The SX-Aurora is [[NEC]]'s successor to the {{\\|SX-ACE}}, a [[vector processor]] designed for [[high-performance]] scientific/research applications and supercomputers. The SX-Aurora deviates from all prior chips in the kind of markets it's designed to address. Therefore, NEC made slightly different design choice compared to prior generations of vector processors. In an attempt to broaden their market, NEC extended beyond supercomputers to the conventional server and workstation market. This is done through the use of [[PCIe]]-based [[accelerator cards]]. | ||
Line 107: | Line 104: | ||
=== Vector processing unit === | === Vector processing unit === | ||
[[File:sx-aurora-vpu.svg|thumb|right|vector processing unit (VPU) and 32 VPPs|400px]] | [[File:sx-aurora-vpu.svg|thumb|right|vector processing unit (VPU) and 32 VPPs|400px]] | ||
− | The bulk of the compute work is done on the vector processing unit (VPU). The VPU has a fairly simple pipeline, though it does | + | The bulk of the compute work is done on the vector processing unit (VPU). The VPU has a fairly simple pipeline, though it does employes [[out-of-order scheduling]]. [[Instructions]] issued by the SPU are sent to the [[instruction buffer]] where they await renaming, reordering, and scheduling. NEC renames the 64 vector registers (VRs) into 256 physical registers. There is support for enhanced preloading and avoids [[WAR]]/[[WAW]] dependencies. Scheduling is relatively simple. There is a dedicated pipeline for complex operations. Things such as vector summation, division, mask [[population count]], are sent to this execution unit. The dedicate execution unit for complex operations is there to prevent stalls due to the high latency involved in those operations. |
− | The majority of the operations are handled by the vector parallel pipeline (VPP). The SX-Aurora doubles the number of VPPs per VPU from the | + | The majority of the operations are handled by the vector parallel pipeline (VPP). The SX-Aurora doubles the number of VPPs per VPU from the SX-ACE. Each VPU now has 32 VPPs - all identical. Note that all of the control logic described before are outside of the VPP which is relatively a simple block of vector execution. The VPP has an eight-port vector register, 16 mask registers, and six execution pipes, and a set of forwarding logic between them. |
− | The six execution pipes include three [[floating-point]] pipes, two integer [[ALU]]s, and a complex and store pipe for data output. Note that ALU1 and the Store pipe share the same read ports. Likewise, FMA2 and ALU0 share a read port. All in all, the effective number of pipelines executing each cycle is actually four. Compared to the | + | The six execution pipes include three [[floating-point]] pipes, two integer [[ALU]]s, and a complex and store pipe for data output. Note that ALU1 and the Store pipe share the same read ports. Likewise, FMA2 and ALU0 share a read port. All in all, the effective number of pipelines executing each cycle is actually four. Compared to the SX-Ace, the SX-Aurora now has one extra FMA unit per VPP. |
The peak theoretical performance that can be achieved is 3 FMAs per VPP per cycle. With 32 VPPs per VPU, there are a total of 96 FMAs/cycle for a total of 192 DP FLOPs/cycle. With a peak frequency of 1.6 GHz for the SX-Aurora Tsubasa vector processor, each VPU has a peak performance of 307.2 [[gigaFLOPS]]. Each FMA can perform operations on packed data types. That is, the single-precision floating-point is doubled through the packing of 2 32-bit elements for a peak performance of 614.4 [[gigaFLOPS]]. | The peak theoretical performance that can be achieved is 3 FMAs per VPP per cycle. With 32 VPPs per VPU, there are a total of 96 FMAs/cycle for a total of 192 DP FLOPs/cycle. With a peak frequency of 1.6 GHz for the SX-Aurora Tsubasa vector processor, each VPU has a peak performance of 307.2 [[gigaFLOPS]]. Each FMA can perform operations on packed data types. That is, the single-precision floating-point is doubled through the packing of 2 32-bit elements for a peak performance of 614.4 [[gigaFLOPS]]. | ||
== Memory subsystem == | == Memory subsystem == | ||
− | + | {{empty section}} | |
− | + | :[[File:sx-aurora memory subsystem.svg|thumb|700px|center|SX-Aurora Memory Subsystem.]] | |
− | |||
− | [[File:sx-aurora memory subsystem.svg|700px|center | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |- | ||
− | |||
− | |||
− | |||
− | |||
== Mesh interconnect == | == Mesh interconnect == | ||
− | + | {{empty section}} | |
− | + | :[[File:sx-aurora 2d 16-layer mesh.svg|thumb|900px|center|SX-Aurora utilizes a 16-layer 2D mesh.]] | |
− | |||
− | [[File:sx-aurora 2d 16-layer mesh.svg|900px|center | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Package == | == Package == | ||
Line 164: | Line 141: | ||
:[[File:sx-aurora-package-xsection.svg|800px]] | :[[File:sx-aurora-package-xsection.svg|800px]] | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Vector engine (VE) card == | == Vector engine (VE) card == | ||
− | + | {{empty section}} | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Die == | == Die == | ||
Line 216: | Line 152: | ||
== Bibliography == | == Bibliography == | ||
− | * {{ | + | * {{hcbib|30}} |
* Supercomputing 2018, NEC Aurora Forum | * Supercomputing 2018, NEC Aurora Forum | ||
− | |||
* ''Some information was obtained directly from NEC'' | * ''Some information was obtained directly from NEC'' |
Facts about "SX-Aurora - Microarchitectures - NEC"
codename | SX-Aurora + |
core count | 8 + |
designer | NEC + |
first launched | 2018 + |
full page name | nec/microarchitectures/sx-aurora + |
instance of | microarchitecture + |
manufacturer | TSMC + |
name | SX-Aurora + |
pipeline stages | 8 + |