From WikiChip
Editing nec/microarchitectures/sx-aurora

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 118: Line 118:
 
Maintaining a high [[bytes Per FLOP]] is important for vector operations that rely on large data sets. With over five times the FLOPS per core, the SX-Aurora had to significantly improve the memory subsystem to prevent workloads from memory bottlenecking, thereby preventing them from reaching the peak compute power of the chip. The {{\\|SX-Ace}} reached 256 GB/s of [[memory bandwidth]] using a whopping 16 channels of DDR3 memory. It becomes impossible to increase this further to bring a sufficiently large improvement in bandwidth. For this reason, NEC opted to use [[HBM2]] memory instead. The SX-Aurora has six HBM2 modules delivering 1.22 TB/s of bandwidth, nearly 5-fold improvement over the {{\\|SX-Ace}}. However, despite the large memory bandwidth improvement, the SX-Aurora achieves 0.5 [[bytes/FLOPs]] which is half of the {{\\|SX-Ace}}.
 
Maintaining a high [[bytes Per FLOP]] is important for vector operations that rely on large data sets. With over five times the FLOPS per core, the SX-Aurora had to significantly improve the memory subsystem to prevent workloads from memory bottlenecking, thereby preventing them from reaching the peak compute power of the chip. The {{\\|SX-Ace}} reached 256 GB/s of [[memory bandwidth]] using a whopping 16 channels of DDR3 memory. It becomes impossible to increase this further to bring a sufficiently large improvement in bandwidth. For this reason, NEC opted to use [[HBM2]] memory instead. The SX-Aurora has six HBM2 modules delivering 1.22 TB/s of bandwidth, nearly 5-fold improvement over the {{\\|SX-Ace}}. However, despite the large memory bandwidth improvement, the SX-Aurora achieves 0.5 [[bytes/FLOPs]] which is half of the {{\\|SX-Ace}}.
  
The SX-Aurora got rid of the 1 MiB assignable data buffer (ADB) from the {{\\|SX-Ace}} and added a memory side cache designed to avoid snoop traffic. It's worth pointing out that the new LLC does retain an ADB-like feature whereby the priority of a [[cache line]] is controlled via a flag for vector memory access instructions. The caches are sliced into eight 2 MiB chunks which consist of 16 [[memory banks]] each for a total of 128 memory banks. The LLC is [[inclusive]] of both the [[L1]] and [[L2]]. The [[last level cache|LLC]] interfaces with the IMC at 200 GB/s per chunk (1600 TB/s in total) and those provide a memory bandwidth of 1.22 TB/s through it's 6 HBM2 modules.
+
The caches are sliced into eight 2 MiB chunks which consist of 16 [[memory banks]] each. The [[last level cache|LLC]] interfaces with the IMC at 200 GB/s per chunk (1600 TB/s in total) and those provide a memory bandwidth of 1.22 TB/s through it's 6 HBM2 modules.
  
 
[[File:sx-aurora memory subsystem.svg|700px|center]]
 
[[File:sx-aurora memory subsystem.svg|700px|center]]

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)