From WikiChip
Difference between revisions of "nvidia/microarchitectures/denver"
< nvidia

(References: rev4.pdf IEEE HotChips 26 (HC26), 2014] - Darrell Boggs "Nvidia's Denver Processor")
(Architecture)
 
(18 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{nvidia title|denver}}
+
{{nvidia title|Denver|arch}}
 
{{microarchitecture
 
{{microarchitecture
| atype         = CPU
+
|atype=CPU
| name         = Denver
+
|name=Denver
| designer     = Nvidia
+
|designer=Nvidia
| manufacturer = TSMC
+
|manufacturer=TSMC
| introduction = 2014
+
|introduction=2014
| phase-out    =
+
|process=28 nm
| process       = 28 nm
+
|process 2=16 nm
| process 2     = 16 nm
+
|cores=2
| cores         = 2
+
|type=Superscalar
| cores 2       = 4
+
|oooe=No
| cores N      =  
+
|decode=2-way
 +
|isa=ARMv8
 +
|l1i=128 KiB
 +
|l1i per=core
 +
|l1i desc=4-way set associative
 +
|l1d=64 KiB
 +
|l1d per=core
 +
|l1d desc=4-way set associative
 +
|l2=2 MiB
 +
|l2 per=cluster
 +
|l2 desc=16-way set associative
 +
|successor=Carmel
 +
|successor link=nvidia/microarchitectures/carmel
 +
}}
 +
'''Denver''' is a CPU microarchitecture from [[Nvidia]] introduced in 2014, capable of executing ARMv8 code natively and with help of dynamic code optimization. Native ARM decoder can issue up to 2 instructions per cycle, and up to 7 micro-operations are started per cycle when dynamic code translation is used.
 +
 
 +
== Architecture ==
 +
Denver is 7-wide in-order superscalar. It has ARMv8 hardware decoder (A32, T32, and A64 modes) which can generate up to 2 micro-ops per cycle. Also it can execute up to 7 micro-ops per-cycle directly from L1i cache. Denver has 7 execution units: 1 branch, 2 integer (1 has hardware multiply module), 2 FP/NEON (128-bit), 2 Load/Store units.
  
| type          = <!-- e.g. "Superscalar" -->
+
Denver 2 has dynamic branch prediction with Branch Target Buffer and Global History Buffer (Conditional Direction Predictor - gshare-agree). It also has Return Stack Buffer, Indirect Target Predictor and static predictor.
| type 2        =
 
| type N        =
 
| oooe          = No
 
| speculative  = <!-- Yes or No only -->
 
| renaming      = <!-- Yes or No only -->
 
| stages        = <!-- ONLY IF FIXED SIZE, otherwise use below for range -->
 
| stages min    =
 
| stages max    =
 
| decode        = 2-way
 
  
| isa          = ARMv8
+
Pipeline of Denver 1 has 15 stages, mispredict penalty is 13 cycles.
| isa 2        =
 
| isa N        =
 
| feature      =
 
| extension    =
 
| extension 2  =
 
| extension N  =
 
  
| l1i          = 128 KiB
+
<!-->{| style="overflow-x: scroll; white-space: nowrap; font-size: 1.2em; border-spacing: 10px; border-collapse: separate; "-->
| l1i per      = core
+
{| class="wikitable"
| l1i desc      = 4-way set associative
+
| Stage name: || IP1 || IC2 || IW3 || IN4 || IN5 || SB1  || SB2 || EB0 || EB1 || EA2 || ED3 || EL4 || EE5 || ES6 || EW7
| l1d          = 64 KiB
+
|-
| l1d per      = core
+
| Stage action: || ITLB || I$ Rd || Way Sel || Decode || Fetch Q || Pick || Sched || RF Rd || Bypass || Ld Addr || D$ Read || Bypass || ALU/Execute || St Addr || RF Wr
| l1d desc      = 4-way set associative
+
|-
| l2            = 2 MiB
+
|}
| l2 per        = core
 
| l2 desc      = 16-way set associative
 
| l3            =
 
| l3 per        =
 
| l3 desc      =
 
  
| core name        =
+
=== Dynamic Code Optimization ===
| core name 2      =
+
For often executed code optimization micro-interrupt can be generated and firmware-based optimizer is started. Using "Dynamic Profile Information" optimizer can translate ARMv8 instructions into optimized microcode sequence and save it into Optimization Cache.
| core name N      =
 
  
| predecessor      =  
+
Then Denver will execute code directly from Optimization Cache (part of 128 MiB microcode carve-out) without using hardware ARMv8 decoder. Several microcode sequences may be chained.
| predecessor link =  
+
 
| successor        =  
+
In 2014 Nvidia listed several optimizations for the dynamic code translation:
| successor link  =  
+
*Unrolls Loops
| successor 2      =  
+
*Renames registers
| successor 2 link =
+
*Reorders Loads and Stores
| successor N      =  
+
*Improves control flow
| successor N link =  
+
*Removes unused computation
 +
*Hoists redundant computation
 +
*Sinks uncommonly executed computation
 +
*Improves scheduling
 +
 
 +
== Cache ==
 +
 
 +
For two cores of Denver total cache size is:
 +
 
 +
{{cache size
 +
|l1 cache=384 KiB
 +
|l1i cache=256 KiB
 +
|l1i break=2x128 KiB
 +
|l1i desc=4-way set associative
 +
|l1d cache=128KiB
 +
|l1d break=2x64 KiB
 +
|l1d desc=4-way set associative
 +
|l2 cache=2 MiB
 +
|l2 break=1x2 MiB
 +
|l2 desc=16-way set associative
 
}}
 
}}
'''Denver''' is a CPU microarchitecture from [[Nvidia]] introduced in 2014, capable of executing ARMv8 code natively and with help of dynamic code optimization. Native ARM decoder can issue up to 2 instructions per cycle, and up to 7 micro-operations are started per cycle when dynamic code translation is used.
 
  
== Architecture ==
+
L1i TLB has 128 entries for 4 KiB pages and is 4-way set-associative.
 +
L1d TLB has 280 entries and supports 4 KiB, 64 KiB, 1 MiB and 2 MiB pages.
 +
TLB walk is accelerated by L2 TLB of 2048 entries in 4-way set-associative buffer.
 +
 
 +
== Features ==
 +
{{arm features
 +
|thumb=No
 +
|thumb2=Yes
 +
|thumbee=No
 +
|vfpv1=No
 +
|vfpv2=No
 +
|vfpv3=Yes
 +
|vfpv3-d16=No
 +
|vfpv3-f16=No
 +
|vfpv4=Yes
 +
|vfpv4-d16=No
 +
|vfpv5=No
 +
|neon=Yes
 +
|trustzone=No
 +
|jazelle=No
 +
|wmmx=No
 +
|wmmx2=No
 +
|pmuv3=Yes
 +
|crc32=Yes
 +
|crypto=Yes
 +
|fp=No
 +
|fp16=No
 +
|profile=No
 +
|ras=No
 +
|simd=Yes
 +
|rdm=No
 +
}}
  
 
== Products ==
 
== Products ==
Denver is used in Tegra K1-64.
+
Denver is used in Nvidia's Tegra K1-64 (2014, 28 nm, model T132). It is used in Google's Nexus 9 tablet, produced by HTC.
 +
 
 +
Denver 2 is used in Nvidia's Tegra X2 "Parker" (2016, 16 nm, model T186). Parker SoC has 4 [[Cortex-A57]] cores and two Denver-2 cores. It is used in Nvidia Drive PX2 and Nvidia Jetson TX2.
  
 
== Die ==
 
== Die ==
Line 115: Line 162:
 
* NVIDIA’S FIRST CPU IS A WINNER. Denver Uses Dynamic Translation to Outperform Mobile Rivals. - Linley Gwennap (August 18, 2014) <!-- Nvidia_Denverreprint.pdf -->
 
* NVIDIA’S FIRST CPU IS A WINNER. Denver Uses Dynamic Translation to Outperform Mobile Rivals. - Linley Gwennap (August 18, 2014) <!-- Nvidia_Denverreprint.pdf -->
 
* [https://www.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-11-day1-epub/HC26.11-2-Mobile-Processors-epub/HC26.11.234-Denver-Darrell.Boggs-NVIDIA-rev4.pdf IEEE HotChips 26 (HC26), 2014] - Darrell Boggs "Nvidia's Denver Processor"
 
* [https://www.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-11-day1-epub/HC26.11-2-Mobile-Processors-epub/HC26.11.234-Denver-Darrell.Boggs-NVIDIA-rev4.pdf IEEE HotChips 26 (HC26), 2014] - Darrell Boggs "Nvidia's Denver Processor"
 +
* Parker Series SoC Technical Reference Manual, Nvidia
 +
* https://www.anandtech.com/tag/project-denver
 
[[category:nvidia]]
 
[[category:nvidia]]

Latest revision as of 12:36, 15 October 2025

Edit Values
Denver µarch
General Info
Arch TypeCPU
DesignerNvidia
ManufacturerTSMC
Introduction2014
Process28 nm, 16 nm
Core Configs2
Pipeline
TypeSuperscalar
OoOENo
Decode2-way
Instructions
ISAARMv8
Cache
L1I Cache128 KiB/core
4-way set associative
L1D Cache64 KiB/core
4-way set associative
L2 Cache2 MiB/cluster
16-way set associative
Succession

Denver is a CPU microarchitecture from Nvidia introduced in 2014, capable of executing ARMv8 code natively and with help of dynamic code optimization. Native ARM decoder can issue up to 2 instructions per cycle, and up to 7 micro-operations are started per cycle when dynamic code translation is used.

Architecture[edit]

Denver is 7-wide in-order superscalar. It has ARMv8 hardware decoder (A32, T32, and A64 modes) which can generate up to 2 micro-ops per cycle. Also it can execute up to 7 micro-ops per-cycle directly from L1i cache. Denver has 7 execution units: 1 branch, 2 integer (1 has hardware multiply module), 2 FP/NEON (128-bit), 2 Load/Store units.

Denver 2 has dynamic branch prediction with Branch Target Buffer and Global History Buffer (Conditional Direction Predictor - gshare-agree). It also has Return Stack Buffer, Indirect Target Predictor and static predictor.

Pipeline of Denver 1 has 15 stages, mispredict penalty is 13 cycles.

Stage name: IP1 IC2 IW3 IN4 IN5 SB1 SB2 EB0 EB1 EA2 ED3 EL4 EE5 ES6 EW7
Stage action: ITLB I$ Rd Way Sel Decode Fetch Q Pick Sched RF Rd Bypass Ld Addr D$ Read Bypass ALU/Execute St Addr RF Wr

Dynamic Code Optimization[edit]

For often executed code optimization micro-interrupt can be generated and firmware-based optimizer is started. Using "Dynamic Profile Information" optimizer can translate ARMv8 instructions into optimized microcode sequence and save it into Optimization Cache.

Then Denver will execute code directly from Optimization Cache (part of 128 MiB microcode carve-out) without using hardware ARMv8 decoder. Several microcode sequences may be chained.

In 2014 Nvidia listed several optimizations for the dynamic code translation:

  • Unrolls Loops
  • Renames registers
  • Reorders Loads and Stores
  • Improves control flow
  • Removes unused computation
  • Hoists redundant computation
  • Sinks uncommonly executed computation
  • Improves scheduling

Cache[edit]

For two cores of Denver total cache size is:

[Edit/Modify Cache Info]

hierarchy icon.svg
Cache Organization
Cache is a hardware component containing a relatively small and extremely fast memory designed to speed up the performance of a CPU by preparing ahead of time the data it needs to read from a relatively slower medium such as main memory.

The organization and amount of cache can have a large impact on the performance, power consumption, die size, and consequently cost of the IC.

Cache is specified by its size, number of sets, associativity, block size, sub-block size, and fetch and write-back policies.

Note: All units are in kibibytes and mebibytes.
L1$384 KiB
393,216 B
0.375 MiB
L1I$256 KiB
262,144 B
0.25 MiB
2x128 KiB4-way set associative 
L1D$128KiB
131,072 B
0.125 MiB
2x64 KiB4-way set associative 

L2$2 MiB
2,048 KiB
2,097,152 B
0.00195 GiB
  1x2 MiB16-way set associative 

L1i TLB has 128 entries for 4 KiB pages and is 4-way set-associative. L1d TLB has 280 entries and supports 4 KiB, 64 KiB, 1 MiB and 2 MiB pages. TLB walk is accelerated by L2 TLB of 2048 entries in 4-way set-associative buffer.

Features[edit]

[Edit/Modify Supported Features]

Cog-icon-grey.svg
Supported ARM Extensions & Processor Features
Thumb-2Thumb-2 Extension
VFPv3Vector Floating Point (VFP) v3 Extension
VFPv4Vector Floating Point (VFP) v4 Extension
NEONAdvanced SIMD extension
PMUv3ARMv8 PMUv3 Performance Monitors Extension
CRC32CRC-32 checksum Extension
CryptoCryptographic Extension
SIMDAdvanced SIMD extension

Products[edit]

Denver is used in Nvidia's Tegra K1-64 (2014, 28 nm, model T132). It is used in Google's Nexus 9 tablet, produced by HTC.

Denver 2 is used in Nvidia's Tegra X2 "Parker" (2016, 16 nm, model T186). Parker SoC has 4 Cortex-A57 cores and two Denver-2 cores. It is used in Nvidia Drive PX2 and Nvidia Jetson TX2.

Die[edit]

All Denver Chips[edit]

 List of all Denver Chips
 Main processorIGP
ModelLaunchedDesignerFamilyProcessCoreCTL2$L3$FrequencyMax MemDesignerNameFrequency
Count: 0


References[edit]

codenameDenver +
core count2 +
designerNvidia +
first launched2014 +
full page namenvidia/microarchitectures/denver +
instance ofmicroarchitecture +
instruction set architectureARMv8 +
l1$ size384 KiB (393,216 B, 0.375 MiB) +
l1d$ description4-way set associative +
l1d$ size128 KiB (131,072 B, 0.125 MiB) +
l1i$ description4-way set associative +
l1i$ size256 KiB (262,144 B, 0.25 MiB) +
l2$ description16-way set associative +
l2$ size2 MiB (2,048 KiB, 2,097,152 B, 0.00195 GiB) +
manufacturerTSMC +
microarchitecture typeCPU +
nameDenver +
process28 nm (0.028 μm, 2.8e-5 mm) + and 16 nm (0.016 μm, 1.6e-5 mm) +