Editing nvidia/microarchitectures/denver

{{nvidia title|denver}}
{{microarchitecture
| atype         = CPU
| name          = Denver
| designer      = Nvidia
| manufacturer  = TSMC <!-- ??? -->>
| introduction  = 2014
| phase-out     =
| process       = 28 nm
| cores         = 2
| cores 2       = 
| cores N       = 

| type          = Superscalar
| type 2        = 
| type N        = 
| oooe          = No
| speculative   = <!-- Yes or No only -->
| renaming      = <!-- Yes or No only -->
| stages        = <!-- ONLY IF FIXED SIZE, otherwise use below for range -->
| stages min    = 
| stages max    =
| decode        = 2-way

| isa           = ARMv8
| isa 2         = 
| isa N         = 
| feature       = 
| extension     = 
| extension 2   = 
| extension N   = 

| l1i           = 128 KiB
| l1i per       = core
| l1i desc      = 4-way set associative
| l1d           = 64 KiB
| l1d per       = core
| l1d desc      = 4-way set associative
| l2            = 2 MiB
| l2 per        = core
| l2 desc       = 16-way set associative
| l3            = 
| l3 per        = 
| l3 desc       = 

| core name        =
| core name 2      =
| core name N      =

| predecessor      = 
| predecessor link = 
| successor        = 
| successor link   = 
| successor 2      = 
| successor 2 link = 
| successor N      = 
| successor N link = 
}}
'''Denver''' is a CPU microarchitecture from [[Nvidia]] introduced in 2014, capable of executing ARMv8 code natively and with help of dynamic code optimization. Native ARM decoder can issue up to 2 instructions per cycle, and up to 7 micro-operations are started per cycle when dynamic code translation is used.

== Architecture ==
Denver is 7-wide superscalar. It has ARMv8 hardware decoder which can generate up to 2 micro-ops per cycle. Also it can execute up to 7 micro-ops per-cycle directly from L1i cache. Denver has 7 execution units: 1 branch, 2 integer (1 has hardware multiply module), 2 FP/NEON (128-bit), 2 Load/Store units.

Pipeline has 15 stages: IP1 (ITLB), IC2 (I$ Rd), IW3 (Way Sel), IN4 (Dec), IN5 (PB), SB1 (Pick), SB2 (Sch), EB0 (RF Rd), EB1 (Bypass), EA2, ED3, EL4(Bypass), EE5 (ALU), ES6, EW7 (RF wr). Mispredict penalty is 13 cycles.

=== Dynamic Code Optimization ===
For often executed code optimization micro-interrupt can be generated and firmware-based optimizer is started. Using "Dynamic Profile Information" optimizer can translate ARMv8 instructions into optimized microcode sequence and save it into Optimization Cache. Then Denver will execute code directly from Optimization Cache without using hardware ARMv8 decoder. Several microcode sequences may be chained

In 2014 Nvidia listed several optimizations for the dynamic code translation:
*Unrolls Loops
*Renames registers
*Reorders Loads and Stores
*Improves control flow
*Removes unused computation
*Hoists redundant computation
*Sinks uncommonly executed computation
*Improves scheduling

== Products ==
Denver is used in Nvidia's Tegra K1-64 (2014, 28 nm)

Denver 2 is used in Nvidia's Terga Parker (2016, 16 nm TSMC). Parker SoC also uses 4 [[Cortex-A57]] cores.

== Die ==

== All Denver Chips ==
<!-- NOTE: 
           This table is generated automatically from the data in the actual articles.
           If a microprocessor is missing from the list, an appropriate article for it needs to be
           created and tagged accordingly.

           Missing a chip? please dump its name here: http://en.wikichip.org/wiki/WikiChip:wanted_chips
-->
{{comp table start}}
<table class="comptable sortable tc18 tc19 tc20 tc21 tc22 tc23">
<tr class="comptable-header"><th>&nbsp;</th><th colspan="25">List of all Denver Chips</th></tr>
<tr class="comptable-header"><th>&nbsp;</th><th colspan="10">Main processor</th><th colspan="3">IGP</th></tr>
{{comp table header 1|cols=Launched, Designer, Family, Process, Core, C, T, L2$, L3$, Frequency, Max Mem, Designer, Name, Frequency}}
{{#ask: [[Category:all microprocessor models]] [[microarchitecture::denver]]
 |?full page name
 |?model number
 |?first launched
 |?designer
 |?microprocessor family
 |?process
 |?core name
 |?core count
 |?thread count
 |?l2$ size
 |?l3$ size
 |?base frequency#GHz
 |?max memory#GiB
 |?integrated gpu designer
 |?integrated gpu
 |?integrated gpu base frequency
 |format=template
 |template=proc table 3
 |searchlabel=
 |sort=microprocessor family, model number
 |order=asc,asc
 |userparam=15
 |mainlabel=-
 |limit=100
 |valuesep=,
}}
{{comp table count|ask=[[Category:all microprocessor models]] [[microarchitecture::denver]]}}
</table>
{{comp table end}}


== References ==
* NVIDIA’S FIRST CPU IS A WINNER. Denver Uses Dynamic Translation to Outperform Mobile Rivals. - Linley Gwennap (August 18, 2014) <!-- Nvidia_Denverreprint.pdf -->
* [https://www.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-11-day1-epub/HC26.11-2-Mobile-Processors-epub/HC26.11.234-Denver-Darrell.Boggs-NVIDIA-rev4.pdf IEEE HotChips 26 (HC26), 2014] - Darrell Boggs "Nvidia's Denver Processor"
* https://www.anandtech.com/tag/project-denver
[[category:nvidia]]
@@ Line 1: / Line 1: @@
-{{nvidia title|Denver|arch}}
+{{nvidia title|denver}}
 {{microarchitecture
-|atype=CPU
+| atype         = CPU
-|name=Denver
+| name          = Denver
-|designer=Nvidia
+| designer      = Nvidia
-|manufacturer=TSMC
+| manufacturer  = TSMC <!-- ??? -->>
-|introduction=2014
+| introduction  = 2014
-|process=28 nm
+| phase-out     =
-|process 2=16 nm
+| process       = 28 nm
-|cores=2
+| cores         = 2
-|type=Superscalar
+| cores 2       =
-|oooe=No
+| cores N       =
-|decode=2-way
-|isa=ARMv8
+| type          = Superscalar
-|l1i=128 KiB
+| type 2        =
-|l1i per=core
+| type N        =
-|l1i desc=4-way set associative
+| oooe          = No
-|l1d=64 KiB
+| speculative   = <!-- Yes or No only -->
-|l1d per=core
+| renaming      = <!-- Yes or No only -->
-|l1d desc=4-way set associative
+| stages        = <!-- ONLY IF FIXED SIZE, otherwise use below for range -->
-|l2=2 MiB
+| stages min    =
-|l2 per=cluster
+| stages max    =
-|l2 desc=16-way set associative
+| decode        = 2-way
-|successor=Carmel
-|successor link=nvidia/microarchitectures/carmel
+| isa           = ARMv8
+| isa 2         =
+| isa N         =
+| feature       =
+| extension     =
+| extension 2   =
+| extension N   =
+| l1i           = 128 KiB
+| l1i per       = core
+| l1i desc      = 4-way set associative
+| l1d           = 64 KiB
+| l1d per       = core
+| l1d desc      = 4-way set associative
+| l2            = 2 MiB
+| l2 per        = core
+| l2 desc       = 16-way set associative
+| l3            =
+| l3 per        =
+| l3 desc       =
+| core name        =
+| core name 2      =
+| core name N      =
+| predecessor      =
+| predecessor link =
+| successor        =
+| successor link   =
+| successor 2      =
+| successor 2 link =
+| successor N      =
+| successor N link =
 }}
 '''Denver''' is a CPU microarchitecture from [[Nvidia]] introduced in 2014, capable of executing ARMv8 code natively and with help of dynamic code optimization. Native ARM decoder can issue up to 2 instructions per cycle, and up to 7 micro-operations are started per cycle when dynamic code translation is used.
 == Architecture ==
-Denver is 7-wide in-order superscalar. It has ARMv8 hardware decoder (A32, T32, and A64 modes) which can generate up to 2 micro-ops per cycle. Also it can execute up to 7 micro-ops per-cycle directly from L1i cache. Denver has 7 execution units: 1 branch, 2 integer (1 has hardware multiply module), 2 FP/NEON (128-bit), 2 Load/Store units.
+Denver is 7-wide superscalar. It has ARMv8 hardware decoder which can generate up to 2 micro-ops per cycle. Also it can execute up to 7 micro-ops per-cycle directly from L1i cache. Denver has 7 execution units: 1 branch, 2 integer (1 has hardware multiply module), 2 FP/NEON (128-bit), 2 Load/Store units.
-Denver 2 has dynamic branch prediction with Branch Target Buffer and Global History Buffer (Conditional Direction Predictor - gshare-agree). It also has Return Stack Buffer, Indirect Target Predictor and static predictor.
-Pipeline of Denver 1 has 15 stages, mispredict penalty is 13 cycles.
-{| style="overflow-x: scroll; white-space: nowrap; font-size: 1.2em; border-spacing: 10px; border-collapse: separate; "
+Pipeline has 15 stages: IP1 (ITLB), IC2 (I$ Rd), IW3 (Way Sel), IN4 (Dec), IN5 (PB), SB1 (Pick), SB2 (Sch), EB0 (RF Rd), EB1 (Bypass), EA2, ED3, EL4(Bypass), EE5 (ALU), ES6, EW7 (RF wr). Mispredict penalty is 13 cycles.
-| Stage name: || IP1 || IC2 || IW3 || IN4 || IN5 || SB1  || SB2 || EB0 || EB1 || EA2 || ED3 || EL4 || EE5 || ES6 || EW7
-|-
-| Stage action: || ITLB || I$ Rd || Way Sel || Decode || Fetch Q || Pick || Sched || RF Rd || Bypass || Ld Addr || D$ Read || Bypass || ALU/Execute || St Addr || RF wr
-|-
-|}
 === Dynamic Code Optimization ===
-For often executed code optimization micro-interrupt can be generated and firmware-based optimizer is started. Using "Dynamic Profile Information" optimizer can translate ARMv8 instructions into optimized microcode sequence and save it into Optimization Cache. Then Denver will execute code directly from Optimization Cache (part of 128 MiB microcode carve-out) without using hardware ARMv8 decoder. Several microcode sequences may be chained.
+For often executed code optimization micro-interrupt can be generated and firmware-based optimizer is started. Using "Dynamic Profile Information" optimizer can translate ARMv8 instructions into optimized microcode sequence and save it into Optimization Cache. Then Denver will execute code directly from Optimization Cache without using hardware ARMv8 decoder. Several microcode sequences may be chained
 In 2014 Nvidia listed several optimizations for the dynamic code translation:
@@ Line 53: / Line 76: @@
 *Sinks uncommonly executed computation
 *Improves scheduling
-== Cache ==
-For two cores of Denver total cache size is:
-{{cache size
-|l1 cache=384 KiB
-|l1i cache=256 KiB
-|l1i break=2x128 KiB
-|l1i desc=4-way set associative
-|l1d cache=128KiB
-|l1d break=2x64 KiB
-|l1d desc=4-way set associative
-|l2 cache=2 MiB
-|l2 break=1x2 MiB
-|l2 desc=16-way set associative
-}}
-L1i TLB has 128 entries for 4 KiB pages and is 4-way set-associative.
-L1d TLB has 280 entries and supports 4 KiB, 64 KiB, 1 MiB and 2 MiB pages.
-TLB walk is accelerated by L2 TLB of 2048 entries in 4-way set-associative buffer.
-== Features ==
-{{arm features
-|thumb=No
-|thumb2=Yes
-|thumbee=No
-|vfpv1=No
-|vfpv2=No
-|vfpv3=Yes
-|vfpv3-d16=No
-|vfpv3-f16=No
-|vfpv4=Yes
-|vfpv4-d16=No
-|vfpv5=No
-|neon=Yes
-|trustzone=No
-|jazelle=No
-|wmmx=No
-|wmmx2=No
-|pmuv3=Yes
-|crc32=Yes
-|crypto=Yes
-|fp=No
-|fp16=No
-|profile=No
-|ras=No
-|simd=Yes
-|rdm=No
-}}
 == Products ==
-Denver is used in Nvidia's Tegra K1-64 (2014, 28 nm, model T132). It is used in Google's Nexus 9 tablet, produced by HTC.
+Denver is used in Nvidia's Tegra K1-64 (2014, 28 nm)
-Denver 2 is used in Nvidia's Tegra X2 "Parker" (2016, 16 nm, model T186). Parker SoC has 4 [[Cortex-A57]] cores and two Denver-2 cores. It is used in Nvidia Drive PX2 and Nvidia Jetson TX2.
+Denver 2 is used in Nvidia's Terga Parker (2016, 16 nm TSMC). Parker SoC also uses 4 [[Cortex-A57]] cores.
 == Die ==
@@ Line 159: / Line 132: @@
 * NVIDIA’S FIRST CPU IS A WINNER. Denver Uses Dynamic Translation to Outperform Mobile Rivals. - Linley Gwennap (August 18, 2014) <!-- Nvidia_Denverreprint.pdf -->
 * [https://www.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-11-day1-epub/HC26.11-2-Mobile-Processors-epub/HC26.11.234-Denver-Darrell.Boggs-NVIDIA-rev4.pdf IEEE HotChips 26 (HC26), 2014] - Darrell Boggs "Nvidia's Denver Processor"
-* Parker Series SoC Technical Reference Manual, Nvidia
 * https://www.anandtech.com/tag/project-denver
 [[category:nvidia]]
codename	Denver +
core count	2 +
designer	Nvidia +
first launched	2014 +
full page name	nvidia/microarchitectures/denver +
instance of	microarchitecture +
instruction set architecture	ARMv8 +
l1$ size	384 KiB (393,216 B, 0.375 MiB) +
l1d$ description	4-way set associative +
l1d$ size	128 KiB (131,072 B, 0.125 MiB) +
l1i$ description	4-way set associative +
l1i$ size	256 KiB (262,144 B, 0.25 MiB) +
l2$ description	16-way set associative +
l2$ size	2 MiB (2,048 KiB, 2,097,152 B, 0.00195 GiB) +
manufacturer	TSMC +
microarchitecture type	CPU +
name	Denver +
process	28 nm (0.028 μm, 2.8e-5 mm) + and 16 nm (0.016 μm, 1.6e-5 mm) +