|  (parker trm 17.4.3.17 crc32 sha1 aes sha2 advanced simd? and https://github.com/ssvb/tinymembench/wiki/Nexus-9-(Tegra-TK1-T132---Denver)) | |||
| Line 22: | Line 22: | ||
| |l2 per=cluster | |l2 per=cluster | ||
| |l2 desc=16-way set associative | |l2 desc=16-way set associative | ||
| + | |successor=Carmel | ||
| + | |successor link=nvidia/microarchitectures/carmel | ||
| }} | }} | ||
| '''Denver''' is a CPU microarchitecture from [[Nvidia]] introduced in 2014, capable of executing ARMv8 code natively and with help of dynamic code optimization. Native ARM decoder can issue up to 2 instructions per cycle, and up to 7 micro-operations are started per cycle when dynamic code translation is used. | '''Denver''' is a CPU microarchitecture from [[Nvidia]] introduced in 2014, capable of executing ARMv8 code natively and with help of dynamic code optimization. Native ARM decoder can issue up to 2 instructions per cycle, and up to 7 micro-operations are started per cycle when dynamic code translation is used. | ||
Revision as of 23:11, 29 August 2018
| Edit Values | |
| Denver µarch | |
| General Info | |
| Arch Type | CPU | 
| Designer | Nvidia | 
| Manufacturer | TSMC | 
| Introduction | 2014 | 
| Process | 28 nm, 16 nm | 
| Core Configs | 2 | 
| Pipeline | |
| Type | Superscalar | 
| OoOE | No | 
| Decode | 2-way | 
| Instructions | |
| ISA | ARMv8 | 
| Cache | |
| L1I Cache | 128 KiB/core 4-way set associative | 
| L1D Cache | 64 KiB/core 4-way set associative | 
| L2 Cache | 2 MiB/cluster 16-way set associative | 
| Succession | |
Denver is a CPU microarchitecture from Nvidia introduced in 2014, capable of executing ARMv8 code natively and with help of dynamic code optimization. Native ARM decoder can issue up to 2 instructions per cycle, and up to 7 micro-operations are started per cycle when dynamic code translation is used.
Contents
Architecture
Denver is 7-wide in-order superscalar. It has ARMv8 hardware decoder (A32, T32, and A64 modes) which can generate up to 2 micro-ops per cycle. Also it can execute up to 7 micro-ops per-cycle directly from L1i cache. Denver has 7 execution units: 1 branch, 2 integer (1 has hardware multiply module), 2 FP/NEON (128-bit), 2 Load/Store units.
Denver 2 has dynamic branch prediction with Branch Target Buffer and Global History Buffer (Conditional Direction Predictor - gshare-agree). It also has Return Stack Buffer, Indirect Target Predictor and static predictor.
Pipeline of Denver 1 has 15 stages, mispredict penalty is 13 cycles.
| Stage name: | IP1 | IC2 | IW3 | IN4 | IN5 | SB1 | SB2 | EB0 | EB1 | EA2 | ED3 | EL4 | EE5 | ES6 | EW7 | 
| Stage action: | ITLB | I$ Rd | Way Sel | Decode | Fetch Q | Pick | Sched | RF Rd | Bypass | Ld Addr | D$ Read | Bypass | ALU/Execute | St Addr | RF wr | 
Dynamic Code Optimization
For often executed code optimization micro-interrupt can be generated and firmware-based optimizer is started. Using "Dynamic Profile Information" optimizer can translate ARMv8 instructions into optimized microcode sequence and save it into Optimization Cache. Then Denver will execute code directly from Optimization Cache (part of 128 MiB microcode carve-out) without using hardware ARMv8 decoder. Several microcode sequences may be chained.
In 2014 Nvidia listed several optimizations for the dynamic code translation:
- Unrolls Loops
- Renames registers
- Reorders Loads and Stores
- Improves control flow
- Removes unused computation
- Hoists redundant computation
- Sinks uncommonly executed computation
- Improves scheduling
Cache
For two cores of Denver total cache size is:
|  | Cache Organization  Cache is a hardware component containing a relatively small and extremely fast memory designed to speed up the performance of a CPU by preparing ahead of time the data it needs to read from a relatively slower medium such as main memory. The organization and amount of cache can have a large impact on the performance, power consumption, die size, and consequently cost of the IC. Cache is specified by its size, number of sets, associativity, block size, sub-block size, and fetch and write-back policies. Note: All units are in kibibytes and mebibytes. | ||||||||||||||||||||||||
| 
 | |||||||||||||||||||||||||
L1i TLB has 128 entries for 4 KiB pages and is 4-way set-associative. L1d TLB has 280 entries and supports 4 KiB, 64 KiB, 1 MiB and 2 MiB pages. TLB walk is accelerated by L2 TLB of 2048 entries in 4-way set-associative buffer.
Features
[Edit/Modify Supported Features]
|  | Supported ARM Extensions & Processor Features | |||||||||||||||
| 
 | ||||||||||||||||
Products
Denver is used in Nvidia's Tegra K1-64 (2014, 28 nm, model T132). It is used in Google's Nexus 9 tablet, produced by HTC.
Denver 2 is used in Nvidia's Terga X2 "Parker" (2016, 16 nm, model T186). Parker SoC has 4 Cortex-A57 cores and two Denver-2 cores. It is used in Nvidia Drive PX2 and Nvidia Jetson TX2.
Die
All Denver Chips
| List of all Denver Chips | |||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Main processor | IGP | ||||||||||||||||||||||||
| Model | Launched | Designer | Family | Process | Core | C | T | L2$ | L3$ | Frequency | Max Mem | Designer | Name | Frequency | |||||||||||
| Count: 0 | |||||||||||||||||||||||||
References
- NVIDIA’S FIRST CPU IS A WINNER. Denver Uses Dynamic Translation to Outperform Mobile Rivals. - Linley Gwennap (August 18, 2014)
- IEEE HotChips 26 (HC26), 2014 - Darrell Boggs "Nvidia's Denver Processor"
- Parker Series SoC Technical Reference Manual, Nvidia
- https://www.anandtech.com/tag/project-denver
| codename | Denver + | 
| core count | 2 + | 
| designer | Nvidia + | 
| first launched | 2014 + | 
| full page name | nvidia/microarchitectures/denver + | 
| instance of | microarchitecture + | 
| instruction set architecture | ARMv8 + | 
| l1$ size | 384 KiB (393,216 B, 0.375 MiB) + | 
| l1d$ description | 4-way set associative + | 
| l1d$ size | 128 KiB (131,072 B, 0.125 MiB) + | 
| l1i$ description | 4-way set associative + | 
| l1i$ size | 256 KiB (262,144 B, 0.25 MiB) + | 
| l2$ description | 16-way set associative + | 
| l2$ size | 2 MiB (2,048 KiB, 2,097,152 B, 0.00195 GiB) + | 
| manufacturer | TSMC + | 
| microarchitecture type | CPU + | 
| name | Denver + | 
| process | 28 nm (0.028 μm, 2.8e-5 mm) + and 16 nm (0.016 μm, 1.6e-5 mm) + |