From WikiChip
Editing intel/microarchitectures/sunny cove

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 115: Line 115:
 
* Performance
 
* Performance
 
** [[IPC]] uplift ([[Intel]] self-reported average 18-20% IPC across proxy benchmarks such as [[SPEC CPU2006]]/[[SPEC CPU2017]])
 
** [[IPC]] uplift ([[Intel]] self-reported average 18-20% IPC across proxy benchmarks such as [[SPEC CPU2006]]/[[SPEC CPU2017]])
 
 
* Front-end
 
* Front-end
 
** 1.5x larger µOP cache (2.3K entries, up from 1536)
 
** 1.5x larger µOP cache (2.3K entries, up from 1536)
Line 121: Line 120:
 
** Improved [[branch predictor]]
 
** Improved [[branch predictor]]
 
** ITLB
 
** ITLB
*** 2x 2M page entries (16 entries, up from 8)
+
*** Double 2M page entries (16 entries, up from 8)
 
** Larger IDQ (70 µOPs, up from 64)
 
** Larger IDQ (70 µOPs, up from 64)
 
** LSD can detect up to 70 µOP loops (up from 64)
 
** LSD can detect up to 70 µOP loops (up from 64)
 
* Back-end
 
* Back-end
** Wider allocation (6-way, up from 5-way in skylake and 4-way in broadwell)
+
** Wider allocation (5-way, up from 4-way)
** Delivery Throughout remain 6 uops, same as Skylake.
 
** Wider decoding width with an additional simple decoder is added (from 3 simple + 1 complex in skylake’s 4 way wide decoder  to 4 simple + 1 complex in Sunny cove 5 way wide decoder)
 
 
** 1.6x larger ROB (352, up from 224 entries)
 
** 1.6x larger ROB (352, up from 224 entries)
 
** Scheduler
 
** Scheduler
*** 1.65x larger scheduler (160-entry, up from 97 entries)
+
*** Larger scheduler (160, up from 97 entries)
 
*** Larger dispatch (10-way, up from 8-way)
 
*** Larger dispatch (10-way, up from 8-way)
** 1.55x larger integer register file (280-entry, up from 180)
 
** 1.33x larger vector register file (224-entry, up from 168)
 
** Distributed scheduling queues (4 scheduling queues, up from 2)
 
*** New dedicated queue for store data
 
*** Replaced 2 generic AGUs with two load AGUs
 
*** Load/Store pair have dedicated queues
 
**** New paired store capabilities
 
 
* Execution Engine
 
* Execution Engine
 
** Execution ports rebalanced
 
** Execution ports rebalanced
 
** 2x store data ports (up from 1)
 
** 2x store data ports (up from 1)
 
** 2x store address AGU (up from 1)
 
** 2x store address AGU (up from 1)
 +
** New paired store capabilities
 +
** Replaced 2 generic AGUs with two load AGUs
 
* Memory subsystem
 
* Memory subsystem
** Data Cache
 
*** DTLB now split for load and stores
 
*** Store
 
**** DTLB 4 KiB TLB competitively shared (from fixed partitioning)
 
**** DTLB 2 MiB / 4 MiB TLB competitively shared (from fixed partitioning)
 
**** 2x larger DTLB 1 GiB page entries (8-entry, up from 4)
 
*** Load
 
**** New DTLB store
 
**** 16-entry, all page sizes
 
** STLB
 
*** Single unified TLB for all pages (from 4 KiB+2/4 MiB and seperate 1 GiB)
 
*** STLB uses dynamic partitioning (from partition fixed partitioning)
 
 
** LSU
 
** LSU
 
*** 1.8x more inflight loads (128, up from 72 entries)
 
*** 1.8x more inflight loads (128, up from 72 entries)
Line 162: Line 142:
 
** 2x larger L2 cache (512 KiB, up from 256 KiB)
 
** 2x larger L2 cache (512 KiB, up from 256 KiB)
 
*** Larger STLBs
 
*** Larger STLBs
**** 1.33x larger 4k table (2048 entries, up from 1536)
+
**** Larger 1G table (1024-entry, up from 16)
 +
**** Larger 4k table (2048 entries, up from 1536)
 +
**** New 1,024-entry 2M/4M table
 
** 5-Level Paging
 
** 5-Level Paging
 
*** Large virtual address (57 bits, up from 48 bits)
 
*** Large virtual address (57 bits, up from 48 bits)
Line 195: Line 177:
  
 
=== Block diagram ===
 
=== Block diagram ===
:[[File: Sunny_cove_block_diagram.png|950px]]
+
:[[File:sunny cove block diagram.svg|950px]]
 
 
=== Memory Hierarchy ===
 
* Cache
 
** L0 µOP cache:
 
*** 2,304 µOPs, 8-way set associative
 
**** 48 sets, 6-µOP line size
 
**** statically divided between threads, per core, inclusive with L1I
 
** L1I Cache:
 
*** 32 [[KiB]], 8-way set associative
 
**** 64 sets, 64 B line size
 
**** shared by the two threads, per core
 
** L1D Cache:
 
*** 48 KiB, 12-way set associative
 
*** 64 sets, 64 B line size
 
*** shared by the two threads, per core
 
*** 4 cycles for fastest load-to-use (simple pointer accesses)
 
**** 5 cycles for complex addresses
 
*** bandwidth
 
**** 2x 64 B/cycle load + 1x64 B/cycle store
 
**** OR 2x32 B/cycle store
 
*** Write-back policy
 
** L2 Cache:
 
*** Client
 
**** Unified, 512 KiB, 8-way set associative
 
**** 1024 sets, 64 B line size
 
*** Server
 
**** Unified, 1,280 KiB, 20-way set associative
 
**** 1024 sets, 64 B line size
 
*** Non-inclusive
 
*** 13 cycles for fastest load-to-use
 
*** 64 B/cycle bandwidth to L1$
 
*** Write-back policy
 
 
 
Sunny Cove TLB consists of a dedicated L1 TLB for instruction cache (ITLB) and another one for data cache (DTLB). Additionally, there is a unified L2 TLB (STLB).
 
* TLBs:
 
** ITLB
 
*** 4 KiB page translations:
 
**** 128 entries; 8-way set associative
 
**** dynamic partitioning
 
*** 2 MiB / 4 MiB page translations:
 
**** 16 entries per thread; fully associative
 
**** Duplicated for each thread
 
** DTLB
 
*** Load
 
**** 4 KiB page translations:
 
***** 64 entries; 4-way set associative
 
***** competitively shared
 
**** 2 MiB / 4 MiB page translations:
 
***** 32 entries; 4-way set associative
 
***** competitively shared
 
**** 1G page translations:
 
***** 8 entries; 8-way set associative
 
***** competitively partition
 
*** Store
 
**** All pages:
 
***** 16 entries; 16-way set associative
 
***** competitively partition
 
** STLB
 
*** All pages:
 
**** 2,048 entire; 16-way set associative
 
**** Parititoning:
 
***** 4 KiB pages can use all 2,048 entries
 
***** 2/4 MiB pages can use 1,024 entries (8-way sets), shared with 4 KiB
 
***** 1 GiB pages can use 1,024 entries (8-way sets), shared with 4 KiB pages
 
  
 
== Overview ==
 
== Overview ==
Sunny Cove is Intel's microarchitecture for their [[big core|big CPU core]] which is incorporated into a number of client and server chips that succeed {{\\|Palm Cove}} (and effectively the {{\\|Skylake (client)|Skylake}} series of derivatives). Sunny Cove is a [[big core]] implemented which is incorporated into numerous chips made by Intel including {{\\|Lakefield}}, {{\\|Ice Lake (Client)}}, and {{\\|Ice Lake (Server)}}, as well as the [[Nervana]] {{nervana|NNP}} accelerator. Sunny Cove introduces a large set of enhancements that improves the performance of legacy code and new code through the extraction of parallelism as well as new features. Those include a deep [[out-of-window]] pipeline, a wider execution back-end, higher load-store bandwidth, lower effective access latencies, and bigger caches.
+
Sunny Cove is Intel's microarchitecture for the CPU core which is incorporated into a number of client and server chips that succeed {{\\|Palm Cove}} (and effectively the {{\\|Skylake (client)|Skylake}} series of derivatives). Sunny Cove is just the core which is implemented in a numerous chips made by Intel including {{\\|Lakefield}}, {{\\|Ice Lake (Client)}}, {{\\|Ice Lake (Server)}}, and the [[Nervana]] {{nervana|NNP}} accelerator. Sunny Cove introduces a large set of enhancements that significantly improves the performance of legacy code and new code through the extraction of parallelism as well as new features. Those include a significantly deep [[out-of-window]] pipeline, a wider execution back-end, higher load-store bandwidth, lower effective access latencies, and bigger caches.
  
 
== Pipeline ==
 
== Pipeline ==

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)
codenameSunny Cove +
core count2 +, 4 +, 8 +, 10 +, 12 +, 16 +, 18 +, 20 +, 24 +, 26 +, 28 +, 32 +, 36 +, 38 + and 40 +
designerIntel +
first launched2019 +
full page nameintel/microarchitectures/sunny cove +
instance ofmicroarchitecture +
instruction set architecturex86-64 +
manufacturerIntel +
microarchitecture typeCPU +
nameSunny Cove +
phase-out2021 +
pipeline stages (max)19 +
pipeline stages (min)14 +
process10 nm (0.01 μm, 1.0e-5 mm) +