Edit Values | |
Gen9 LP µarch | |
General Info | |
Arch Type | GPU |
Designer | Intel |
Manufacturer | Intel |
Introduction | August 5, 2015 |
Process | 14 nm |
Succession | |
Gen9 LP (Generation 9 Low Power) is the microarchitecture for Intel's graphics processing unit utilized by Skylake-based microprocessors. Gen9 LP is the successor to Gen8 LP used by Broadwell. The Gen9 microarchitecture is designed separately by Intel and then integrated onto the same Skylake SoC die.
Contents
Codenames
Various models support different Graphics Tiers (GT) which provides different levels of performance. Some models also support an additional eDRAM side cache.
Code Name | Description |
---|---|
GT1 | Contains 1 slice with 12 execution units. |
GT2 | Contains 1 slice with 24 execution units. |
GT3 | Contains 2 slices with 48 execution units. |
GT3e | Contains 2 slices with 48 execution units. Has an additional eDRAM side cache. |
Halo (GT4e) | Contains 3 slices with 72 execution units. Has an additional eDRAM side cache. |
Models
Gen9 LP IGP Models | Standards | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Execution Units | Tier | Series | eDRAM | Vulkan | Direct3D | OpenGL | OpenCL | |||||
Windows | Linux | Windows | Linux | HLSL | Windows | Linux | Windows | Linux | |||||
HD Graphics (Skylake) | 12 | GT1 | Y | - | 1.0 | 12 | N/A | 5.1 | 4.4 | 4.5 | 2.0 | ||
HD Graphics 510 | 12 | GT1 | U, S | - | |||||||||
HD Graphics 515 | 24 | GT2 | Y | - | |||||||||
HD Graphics 520 | 24 | GT2 | U | - | |||||||||
HD Graphics 530 | 24 | GT2 | H, S | - | |||||||||
HD Graphics P530 | 24 | GT2 | H | - | |||||||||
Iris Graphics 540 | 48 | GT3e | U | 64 MiB | |||||||||
Iris Graphics 550 | 48 | GT3e | U | 64 MiB | |||||||||
Iris Pro Graphics 580 | 72 | GT4e | H | 128 MiB |
Model | SKU | EUs | CPU Stepping[devID 1] | GT Stepping[devID 2] | Device2 ID[devID 3] | GT Device2 ID Revision[devID 4] |
---|---|---|---|---|---|---|
HD Graphics 510 | SKL 2+1F DT | 12 | S0 | G0 | 0x1902 | 0x6 |
SKL U - ULT 2+1F | D0 | H0 | 0x1906 | 0x7 | ||
SKL - H 4+1F | D0 | H0 | 0x190B | 0x7 | ||
HD Graphics 515 | SKL Y – ULX 2+2 | 24 | D0 | H0 | 0x191E | 0x7 |
HD Graphics 520 | SKL U – ULT 2+2 | D0 | H0 | 0x1916 | 0x7 | |
HD Graphics 530 | SKL 4+2 DT | R0 | G0 | 0x191B | 0x6 | |
SKL 2+2 DT | S0 | G0 | 0x1912 | 0x6 | ||
SKL 4+2 DT | R0 | G0 | 0x1912 | 0x6 | ||
HD Graphics P530 | SKL WKS 4+2 | R1 | G1 | 0x191D | 0x6 | |
Iris Graphics 540 | SKL U – ULT 2+3E (15W) | 48 | K1 | L1 | 0x1926 | 0xA |
Iris Graphics 550 | SKL U - ULT 2+3E (28W) | K1 | L1 | 0x1927 | 0xA | |
HD Graphics 535 | SKL U - ULT 2+3 | K1 | L1 | 0x1923 | 0xA | |
Iris Graphics P555 | SKL Media Server 4+3FE | N0 | J0 | 0x192D | 0x9 | |
Iris Pro Graphics P580 | SKL H Halo 4+4E | 72 | N0 | J0 | 0x193B | 0x9 |
Iris Pro Graphics P580 | SKL WKS 4+4E | N0 | J0 | 0x193D | 0x9 |
- ↑ The CPU Stepping is the actual CPU design stepping.
- ↑ The GT Stepping refers to the GT design stepping.
- ↑ The Device2 ID is the PCI device ID that identifies the GT SKU for driver software
- ↑ The GT Device2 Revision ID identifies the silicon stepping for driver software.
Hardware Accelerated Video
[Edit] Skylake (Gen9) Hardware Accelerated Video Capabilities | |||||||
---|---|---|---|---|---|---|---|
Codec | Encode | Decode | |||||
Profiles | Levels | Max Resolution | Profiles | Levels | Max Resolution | ||
MPEG-2 (H.262) | Main | High | 1080p (FHD) | Main | Main, High | 1080p (FHD) | |
MPEG-4 AVC (H.264) | High, Main | 5.1 | 2160p (4K) | Main, High, SHP, MHP | 5.1 | 2160p (4K) | |
JPEG/MJPEG | Baseline | - | 16k x 16k | Baseline | Unified | 16k x 16k | |
HEVC (H.265) | Main | 5.1 | 2160p (4K) | Main, Main 10 | 5.1 | 2160p (4K) | |
VC-1 | ✘ | Advanced, Main, Simple | 3, High | 3840x3840 | |||
VP8 | Unified | Unified | - | 0 | Unified | 1080p | |
VP9 | ✘ | 0 | Unified | 2160p (4K) |
Process Technology
- Main article: Broadwell § Process Technology
Gen9 LP are part of the Skylake SoC die which uses the same 14 nm process used for the Broadwell microarchitecture.
Architecture
Gen9 LP presents a large departure from the Gen8 LP and previous architectures.
Key changes from Gen8 LP
- Architecture is drastically different
- Gen9 LP is composed of 3 truely independent major components: Display block, Unslice, and the Slice.
- Shared Virtual Memory (SVM) improvements
- Improved cache coherency performance
- Unslice
- Now sits on its own power gating/clock domain
- Capable of running at higher speeds if the situation allows (irrespective of slice clock)
- Can allow for pure fixed media alone
- Higher throughput
- Tessellator AutoStrip
- Fixed function video encoder in the Quick Sync engine
- codec (decode&encode) support for HEVC, VP8, MJPEG
- RAW imaging capabilities
- Now sits on its own power gating/clock domain
- Slice
- L3 Cache
- Increased to 768 KiB/slice (up from 576 KiB/slice)
- Request queue size was increased
- L3 Cache
- Subslice
- Adaptive scalable texture compression (ASTC)
- 16x multi-sample anti-aliasing (MSAA)
- Post depth test coverage mask
- Multi-plane overlays
- Texture samplers now natively support an NV12 YUV
- Preemption of execution is now supported at the thread level
- Round robin scheduling of threads within an execution unit.
- new native support for the 32-bit float atomics operations of min, max, and compare/exchange.
- 16-bit floating point capability is improved with native support for denormals and gradual underflow
- L4$
- The eDRAM is now a side cache instead of an L4$ like it was in Gen8 LP. (See Skylake §eDRAM architectural changes for the reason)
- Side-cache eDRAM was moved into the system agent adjacent to the display controller
Block Diagram
Entire SoC Overview
Gen9 LP
This block is for the most common setup, which is GT2 with 24 execution units.
Individual Core
Display
This section is empty; you can help add the missing info by editing this page. |
Unslice
The Unslice is one of Gen9's major components and is responsible for the fixed-function geometry capabilities, fixed-function media capabilities, and it provides the interface to the memory fabric. One of the big changes in Gen9 is that the Unslice now sits on its own power/clock domain. This change allows the Unslice to operate at its own speed provided higher on-demand performance when desired. This change has a number of other benefits such as being able to turn off the slices (one or more) when they're not used in cases where pure fixed-function media is used. Additionally, the Unslice is now capable of running at a higher clock while the slice can run at a slower clock when the scenario demands it (such as in cases where higher fixed-function geometry or memory demands occur).
The Command Stream (CS) unit manages the the flow of execution for the FF Pipeline (3D Pipeline) and the Media pipelines. The CS unit performs the switching between pipelines and forwarding command streams to the different stages. Data in the pipeline are passed to the next unit using a messaging network. Messages can be passed directly through registers or by using the URB. The Command Stream also manages the allocation of the URB and supports the Constant URB Entry (CURB) function. The Unified Return Buffer (URB) is globally shared and is explicitly addressed. The pipeline's fixed-function blocks have both read and write access to the URB, additionally the shader cores have write access to the URB.
The media general-purpose pipeline consists of two fixed-function units: Video Front End (VFE) and the Thread Spawner (TS). The VFE unit handles the interfacing with the Command Streamer, writes thread payload data into the Unified Return Buffer, as well as prepares threads to be dispatched through TS unit. The VFE unit also contains the hardware Variable Length Decode (VLD) engine for MPEG-2 video decode. The TS unit is primarily responsible for interfacing with the Thread Dispatcher (TD) unit which is responsible for spawning new root-node parent threads originated from VFE unit and for spawning child threads (either leaf-node child threads or branch-node parent thread).
3D Pipeline Stages
Pipeline Stage | Functions Performed |
---|---|
Command Stream (CS) | The Command Stream stage is responsible for managing the 3D pipeline and passing commands down the pipeline. In addition, the CS unit reads “constant data” from memory
buffers and places it in the URB. Note that the CS stage is shared between the 3D, GPGPU and Media pipelines. |
Vertex Fetch (VF) | The Vertex Fetch stage, in response to 3D Primitive Processing commands, is responsible for reading vertex data from memory, reformatting it, and writing the results into Vertex URB Entries. It then outputs primitives by passing references to the VUEs down the pipeline. |
Vertex Shader (VS) | The Vertex Shader stage is responsible for processing (shading) incoming vertices by passing them to VS threads. |
Hull Shader (HS) | The Hull Shader is responsible for processing (shading) incoming patch primitives as part of the tessellation process. |
Tessellation Engine (TE) | The Tessellation Engine is responsible for using tessellation factors (computed in the HS stage) to tessellate U,V parametric domains into domain point topologies.
Domain Shader (DS) The Domain Shader stage is responsible for processing (shading) the domain points (generated by the TE stage) into corresponding vertices. |
Geometry Shader (GS) | The Geometry Shader stage is responsible for processing incoming objects by passing each object’s vertices to a GS thread. |
Stream Output Logic (SOL) | The Stream Output Logic is responsible for outputting incoming object vertices into Stream Out Buffers in memory. |
Clipper (CLIP) | The Clipper stage performs Clip Tests on incoming objects and clips objects if required. Objects are clipped using fixed-function hardware |
Strip/Fan (SF) | The Strip/Fan stage performs object setup. Object setup uses fixed-function hardware. |
Windower/Masker (WM) | The Windower/Masker performs object rasterization and determines visibility coverage |
Slice
Slices are a cluster of subslices. For most configurations in Gen9, 3 subslices are aggregated into 1 slice to form a total of 24 execution units (depending on the model, some low end models do have less). The slice incorporates the thread dispatch routine, level 3 cache (L3$), a highly banked shared local memory structure, fixed function logic for atomics and barriers, and a number of fixed-function units for various media capabilities. The Global Thread Dispatcher (GTD) is responsible for load balancing thread distribution across the entire device. The global thread dispatcher works in concert with local thread dispatchers in each subslice.
Execution Unit (EU)
The Execution Units (EUs) are the programmable shader units - each one is an independent computational unit used for execution of 3D shaders, media, and GPGPU kernels. Internally, each unit is hardware multi-threaded capable of executing multi-issue SIMD operations. Execution is multi-issue per clock to pipelines capable of integer, single and double precision floating point operations, SIMD branch capability, logical operations, transcendental operations, and other miscellaneous operations. Communication between the EUs and the support units (shared function units such as operations involving texture sampling or scatter/gather load/stores) is done via messages that were programmatically constructed. Dependency hardware allows threads to sleep until the requested data is returned from those units.
Shared Functions are hardware units that provide a set of specialized supplemental functionality for the EUs. As their name implies they implement functions with insufficient demand to justify the cost of being implemented in the individual EUs. Functionality in these units are shared among the EUs in the subslice. Communication between the EUs and the Shared Function is done via lightweight message passing. Messages are a small self-contained packet of information created by a kernel and directed to a specific shared function. EU threads awaiting the return of a message from the Shared Function unit go into temporary sleep.
The Execution Unit is composed of 7 threads. Each thread has 128 SIMD-8 32-bit registers in a General-Purpose Register File (GRF) and supporting architecture specific registers (ARF). The EU can co-issue to four instruction processing units, including two FPUs, a branch unit, and a message send unit.
Preemption Granularity
Preemption in Gen9 was improved over Gen8 in a number of way. Preemption is important for mullti-tasking system and especially important for improving responsiveness of operations (i.e. the ability to stop and start operations quickly with minimal latency interruption for the end user). In Broadwell (Gen8) Intel added support for the ability to stop operations on object-level for 3D workloads such as on a triangle boundary (i.e. beginning of a triangle, between two triangles, between two lines or points) and be able to preempt and restore back to those operations. In Gen9 Intel added the ability to stop execution units on an instruction boundary and be able to restore them (previously such preemption was only possible at the boundary of a kernel - i.e. the entire kernel execution must take places before preemption was possible). Gen9 added support for thread-group (complete kernel execution) to mid-thread (instruction boundary) for compute workloads:
Example of responsiveness (Source: IDF15)
Application | Thread-Group Preemption | Mid-Thread Preemption | ||
---|---|---|---|---|
U Series | Y Series | U Series | Y Series | |
Adobe Photoshop | 4-6 ms | 17-22 ms | 300 µs | 800 µs |
Sample App1 | 200-500 ms | 200-500 ms | 300 µs | 280-320 µs |
Sample App2 | 17 ms | 24 ms | 240 µs | 200-430 µs |
Scalability
Gen9 can scale from 1 to 3 slices producing SKUs ranging from 12 to 72 execution units (note that the 12 EUs are formed from half a slice effectively).
GT1 (ULP)
GT1 is the most compact configuration offering two benefits: reduced cost and reduced power. GT1 is made of 1 slice containing 2 subslices with 6 EUs/subslice for a total of 12 EUs. With the scale-down, GT1 changes the ratio to 6:1 EU:sampler ratio. Note that this does retains the same ratio of 12 texels/clock and 8 pixels/clock at the backend. This configuration is better suited for some of the low power worlkload (e.g. ASTC-LDR+HDR, ETC1/2 compression). Note that software stack remains unchanged compared to the larger models.
GT1.5
GT1.5, offers 3 subslices of 6 EUs each for a total of 18 EUs.
GT2
GT2 is the standard configuration consisting of 1 slice with 3 subslices and 8 EU/subslice for a total of 24 EUs.
GT3
GT3 consists of 2 slices with 3 subslices in each and 8 EU/subslice for a total of 48 EUs.
Halo (GT4)
Codename Halo (GT4) is the most complex configuration offering the highest execution units count. Halo incorporates 3 slices with 3 subslices/slice and 8 EU/subslice for a total of 72 EUs.
Configuration
Configuration Attribute (Source: Intel's Programmer's Ref Manual) | ||||||
---|---|---|---|---|---|---|
Attribute | Model | |||||
GT1F (1x2x6) |
GT1.5F (1x3x6) |
GT2 (1x3x8) |
GT3 (2x3x8) |
GT4 (3x3x8) | ||
Global Attributes | ||||||
Slice count | 1 | 1 | 1 | 2 | 3 | |
Subslice Count | 2 | 3 | 3 | 6 | 9 | |
EU/Subslice | 6 | 6 | 8 | 8 | 8 | |
EU count (total) | 12 | 18 | 24 | 48 | 72 | |
Thread Count | 7 | 7 | 7 | 7 | 7 | |
Thread Count (Total) | 84 | 126 | 161 / 168 | 329 / 336 | 497 / 504 | |
FLOPs/Clk - Half Precision, MAD (peak) | 384 | 576 | 736 / 768 | 1504 / 1536 | 2272 / 2304 | |
FLOPs/Clk - Single Precision, MAD (peak) | 192 | 288 | 368 / 384 | 752 / 768 | 1136 / 1152 | |
FLOPs/Clk - Double Precision, MAD (peak) | 48 | 72 | 92 / 96 | 188 / 192 | 284 / 288 | |
Unslice clocking (coupled/decoupled from Cr slice) | coupled | coupled | coupled | coupled | coupled | |
GTI / Ring Interfaces | 1 | 1 | 1 | 1 | 1 | |
GTI bandwidth (bytes/unslice-clk) | 64: R 64: W |
64: R 64: W |
64: R 64: W |
64: R 64: W |
64: R 64: W | |
eDRAM Support | N/A | N/A | N/A | 0, 64 MiB | 0, 128 MiB | |
Graphics Virtual Address Range | 48 bit | 48 bit | 48 bit | 48 bit | 48 bit | |
Graphics Physical Address Range | 39 bit | 39 bit | 39 bit | 39 bit | 39 bit | |
Caches & Dedicated Memories | ||||||
L3 Cache, total size (bytes) | 384K | 768K | 768K | 1536K | 2304K | |
L3 Cache, bank count | 2 | 4 | 4 | 8 | 12 | |
L3 Cache, bandwidth (bytes/clk) | 2x 64: R 2x 64: W |
4x 64: R 4x 64: W |
4x 64: R 4x 64: W |
8x 64: R 8x 64: W |
12x 64: R 12x 64: W | |
L3 Cache, D$ Size (Kbytes) | 192K - 256K | 512K | 512K | 1024K | 1536K | |
URB Size (kbytes) | 128K - 192K | 384K | 384K | 768K | 1008K | |
SLM Size (kbytes) | 0, 128K | 0, 192K | 0, 192K | 0, 384K | 0, 576K | |
LLC/L4 size (bytes) | ~2MiB/CPU core | ~2MiB/CPU core | ~2MiB/CPU core | ~2MiB/CPU core | ~2MiB/CPU core | |
Instruction Cache (IC, bytes) | 2x 48K | 3x 48K | 3x 48K | 6x 48K | 9x 48K | |
Color Cache (RCC, bytes) | 24K | 24K | 24K | 2x 24K | 3x 24K | |
MSC Cache (MSC, bytes) | 16K | 16K | 16K | 2x 16K | 3x 16K | |
HiZ Cache (HZC, bytes) | 12K | 12K | 12K | 2x 12K | 2x 12K | |
Z Cache (RCZ, bytes) | 32K | 32K | 32K | 2x 32K | 3x 32K | |
Stencil Cache (STC, bytes) | 8K | 8K | 8K | 2x 8K | 3x 8K | |
Instruction Issue Rates | ||||||
FMAD, SP (ops/EU/clk) | 8 | 8 | 8 | 8 | 8 | |
FMUL, SP (ops/EU/clk) | 8 | 8 | 8 | 8 | 8 | |
FADD, SP (ops/EU/clk) | 8 | 8 | 8 | 8 | 8 | |
MIN,MAX, SP (ops/EU/clk) | 8 | 8 | 8 | 8 | 8 | |
CMP, SP (ops/EU/clk) | 8 | 8 | 8 | 8 | 8 | |
INV, SP (ops/EU/clk) | 2 | 2 | 2 | 2 | 2 | |
SQRT, SP (ops/EU/clk) | 2 | 2 | 2 | 2 | 2 | |
RSQRT, SP (ops/EU/clk) | 2 | 2 | 2 | 2 | 2 | |
LOG, SP (ops/EU/clk) | 2 | 2 | 2 | 2 | 2 | |
EXP, SP (ops/EU/clk) | 2 | 2 | 2 | 2 | 2 | |
POW, SP (ops/EU/clk) | 1 | 1 | 1 | 1 | 1 | |
IDIV, SP (ops/EU/clk) | 1-6 | 1-6 | 1-6 | 1-6 | 1-6 | |
TRIG, SP (ops/EU/clk) | 2 | 2 | 2 | 2 | 2 | |
FDIV, SP (ops/EU/clk) | 1 | 1 | 1 | 1 | 1 | |
Load/Store | ||||||
Data Ports (HDC) | 2 | 3 | 3 | 6 | 9 | |
L3 Load/Store (bytes/clk) | 2x 64 | 3x 64 | 3x 64 | 6x 64 | 9x 64 | |
SLM Load/Store (bytes/clk) | 2x 64 | 3x 64 | 3x 64 | 6x 64 | 9x 64 | |
Atomic Inc, 32b - sequential addresses (bytes/clk) | 2x 64 | 3x 64 | 3x 64 | 6x 64 | 9x 64 | |
Atomic Inc, 32b - same address (bytes/clk) | 2x 4 | 3x 4 | 3x 4 | 6x 4 | 9x 4 | |
Atomic CmpWr, 32b - sequential addresses (bytes/clk) | 2x 32 | 3x 32 | 3x 32 | 6x 32 | 9x 32 | |
Atomic CmpWr, 32b - same address (bytes/clk) | 2x 4 | 3x 4 | 3x 4 | 6x 4 | 9x 4 | |
3D Attributes | ||||||
Geometry pipes | 1 | 1 | 1 | 1 | 1 | |
Samplers (3D) | 2 | 3 | 3 | 6 | 9 | |
Texel Rate, point, 32b (tex/clk) | 8 | 12 | 12 | 24 | 36 | |
Texel Rate, point, 64b (tex/clk) | 8 | 12 | 12 | 24 | 36 | |
Texel Rate, point, 128b (tex/clk) | 8 | 12 | 12 | 24 | 36 | |
Texel Rate, bilinear, 32b (tex/clk) | 8 | 12 | 12 | 24 | 36 | |
Texel Rate, bilinear, 64b (tex/clk) | 8 | 12 | 12 | 24 | 36 | |
Texel Rate, bilinear, 128b (tex/clk) | 2 | 3 | 3 | 6 | 9 | |
Texel Rate, trilinear, 32b (tex/clk) | 4 | 6 | 6 | 12 | 18 | |
Texel Rate, trilinear, 64b (tex/clk) | 2 | 3 | 3 | 6 | 9 | |
Texel Rate, trilinear, 128b (tex/clk) | 1 | 1.5 | 1.5 | 3 | 4.5 | |
Texel Rate, aniso 2x, 32b (tex/clk) | 2 | 3 | 3 | 6 | 9 | |
Texel Rate, aniso 4x, 32b (tex/clk) | 1 | 1.5 | 1.5 | 3 | 4.5 | |
Texel Rate, ansio 8x, 32b (tex/clk) | 0.5 | 0.75 | 0.75 | 1.5 | 2.25 | |
Texel Rate, ansio 16x, 32b (tex/clk) | 0.25 | 0.375 | 0.375 | 0.75 | 1.125 | |
HiZ Rate, (ppc) | 64 | 64 | 64 | 2x 64 | 3x 64 | |
IZ Rate, (ppc) | 16 | 16 | 16 | 2x 16 | 3x 16 | |
Stencil Rate (ppc) | 64 | 64 | 64 | 2x 64 | 3x 64 | |
Pixel Rate, fill, 32bpp (pix/clk, RCC hit) | 8 | 8 | 8 | 16 | 24 | |
Pixel Rate, blend, 32bpp (p/clk, RCC hit) | 8 | 8 | 8 | 16 | 24 | |
Media Attributes | ||||||
Samplers (media) | 2 | 3 | 3 | 6 | 9 | |
VDBox Instances | 1 | 1 | 1 | 2 | 2 | |
VEBox Instances | 1 | 1 | 1 | 2 2 | ||
SFC Instances | 1 | 1 | 1 | 1 | 1 |
codename | Gen9 LP + |
designer | Intel + |
first launched | August 5, 2015 + |
full page name | intel/microarchitectures/gen9 + |
instance of | microarchitecture + |
manufacturer | Intel + |
microarchitecture type | GPU + |
name | Gen9 LP + |
process | 14 nm (0.014 μm, 1.4e-5 mm) + |