Difference between revisions of "intel/microarchitectures/gen9"

	Edit Values
	Gen9 µarch
	General Info
Arch Type	GPU
Designer	Intel
Manufacturer	Intel
Introduction	August 5, 2015
Process	14 nm
	Succession
	Gen8 Gen9.5

Latest revision as of 17:41, 1 November 2018

Gen9 (Generation 9) is the microarchitecture for Intel's graphics processing unit utilized by Skylake-based microprocessors. Gen9 is the successor to Gen8 used by Broadwell. The Gen9 microarchitecture is designed separately by Intel and then integrated onto the same Skylake SoC die.

Codenames[edit]

Various models support different Graphics Tiers (GT) which provides different levels of performance. Some models also support an additional eDRAM side cache.

Code Name	Description
GT1	Contains 1 slice with 12 execution units.
GT2	Contains 1 slice with 24 execution units.
GT3	Contains 2 slices with 48 execution units.
GT3e	Contains 2 slices with 48 execution units. Has an additional eDRAM side cache.
Halo (GT4e)	Contains 3 slices with 72 execution units. Has an additional eDRAM side cache.

Models[edit]

Gen9 IGP Models					Standards
Name	Execution Units	Tier	Series	eDRAM	Vulkan		Direct3D			OpenGL		OpenCL		Metal
Name	Execution Units	Tier	Series	eDRAM	Windows	Linux	Windows	Linux	HLSL	Windows	Linux	Windows	Linux	macOS
HD Graphics (Skylake)	12	GT1	Y	-	1.0		12	N/A	5.1	4.5	4.5	2.0		2.1
HD Graphics 510	12	GT1	U, S	-
HD Graphics 515	24	GT2	Y	-
HD Graphics 520	24	GT2	U	-
HD Graphics 530	24	GT2	H, S	-
HD Graphics P530	24	GT2	H	-
Iris Graphics 540	48	GT3e	U	64 MiB
Iris Graphics 550	48	GT3e	U	64 MiB
Iris Pro Graphics P555	48	GT3e	H	128 MiB
Iris Pro Graphics 580	72	GT4e	H	128 MiB
Iris Pro Graphics P580	72	GT4e	H	128 MiB

Model	SKU	EUs	CPU Stepping^{[devID 1]}	GT Stepping^{[devID 2]}	Device2 ID^{[devID 3]}	GT Device2 ID Revision^{[devID 4]}
HD Graphics 510	SKL 2+1F DT	12	S0	G0	0x1902	0x6
	SKL U - ULT 2+1F		D0	H0	0x1906	0x7
	SKL - H 4+1F		D0	H0	0x190B	0x7
HD Graphics 515	SKL Y – ULX 2+2	24	D0	H0	0x191E	0x7
HD Graphics 520	SKL U – ULT 2+2		D0	H0	0x1916	0x7
HD Graphics 530	SKL 4+2 DT		R0	G0	0x191B	0x6
	SKL 2+2 DT		S0	G0	0x1912	0x6
	SKL 4+2 DT		R0	G0	0x1912	0x6
HD Graphics P530	SKL WKS 4+2		R1	G1	0x191D	0x6
Iris Graphics 540	SKL U – ULT 2+3E (15W)	48	K1	L1	0x1926	0xA
Iris Graphics 550	SKL U - ULT 2+3E (28W)		K1	L1	0x1927	0xA
HD Graphics 535	SKL U - ULT 2+3		K1	L1	0x1923	0xA
Iris Pro Graphics P555	SKL Media Server 4+3FE		N0	J0	0x192D	0x9
Iris Pro Graphics 580	SKL H Halo 4+4E	72	N0	J0	0x193B	0x9
Iris Pro Graphics P580	SKL WKS 4+4E	72	N0	J0	0x193D	0x9

↑ The CPU Stepping is the actual CPU design stepping.
↑ The GT Stepping refers to the GT design stepping.
↑ The Device2 ID is the PCI device ID that identifies the GT SKU for driver software
↑ The GT Device2 Revision ID identifies the silicon stepping for driver software.

Performance[edit]

Frequency
Frequency	Half Precision				Single Precision				Double Precision
Models	510	515, 520, 530, P530	540, 550, P555	580, P580	510	515, 520, 530, P530	540, 550, P555	580, P580	510	515, 520, 530, P530	540, 550, P555	580, P580
Tiers	GT1	GT2	GT3e	GT4e	GT1	GT2	GT3e	GT4e	GT1	GT2	GT3e	GT4e
Ref (FLOP/clk)	384/cycle	768/cycle	1536/cycle	2304/cycle	192/cycle	384/cycle	768/cycle	1152/cycle	48/cycle	96/cycle	192/cycle	288/cycle
Base (300 MHz)	115.2 GFLOPS	230.4 GFLOPS	460.8 GFLOPS	691.2 GFLOPS	57.6 GFLOPS	115.2 GFLOPS	230.4 GFLOPS	345.6 GFLOPS	14.4 GFLOPS	28.8 GFLOPS	38.7 GFLOPS	86.4 GFLOPS
Base (350 MHz)	134.4 GFLOPS	268.8 GFLOPS	537.6 GFLOPS	806.4 GFLOPS	67.2 GFLOPS	134.4 GFLOPS	268.8 GFLOPS	403.2 GFLOPS	16.8 GFLOPS	33.6 GFLOPS	45.15 GFLOPS	100.8 GFLOPS
Base (400 MHz)	153.6 GFLOPS	307.2 GFLOPS	614.4 GFLOPS	921.6 GFLOPS	76.8 GFLOPS	153.6 GFLOPS	307.2 GFLOPS	460.8 GFLOPS	19.2 GFLOPS	38.4 GFLOPS	51.6 GFLOPS	115.2 GFLOPS
Base (650 MHz)	249.6 GFLOPS	499.2 GFLOPS	998.4 GFLOPS	1497.6 GFLOPS	124.8 GFLOPS	249.6 GFLOPS	499.2 GFLOPS	748.8 GFLOPS	31.2 GFLOPS	62.4 GFLOPS	83.85 GFLOPS	187.2 GFLOPS
Boost (800 MHz)	307.2 GFLOPS	614.4 GFLOPS	1228.8 GFLOPS	1843.2 GFLOPS	153.6 GFLOPS	307.2 GFLOPS	614.4 GFLOPS	921.6 GFLOPS	38.4 GFLOPS	76.8 GFLOPS	103.2 GFLOPS	230.4 GFLOPS
Boost (850 MHz)	326.4 GFLOPS	652.8 GFLOPS	1305.6 GFLOPS	1958.4 GFLOPS	163.2 GFLOPS	326.4 GFLOPS	652.8 GFLOPS	979.2 GFLOPS	40.8 GFLOPS	81.6 GFLOPS	109.65 GFLOPS	244.8 GFLOPS
Boost (900 MHz)	345.6 GFLOPS	691.2 GFLOPS	1382.4 GFLOPS	2073.6 GFLOPS	172.8 GFLOPS	345.6 GFLOPS	691.2 GFLOPS	1036.8 GFLOPS	43.2 GFLOPS	86.4 GFLOPS	116.1 GFLOPS	259.2 GFLOPS
Boost (950 MHz)	364.8 GFLOPS	729.6 GFLOPS	1459.2 GFLOPS	2188.8 GFLOPS	182.4 GFLOPS	364.8 GFLOPS	729.6 GFLOPS	1094.4 GFLOPS	45.6 GFLOPS	91.2 GFLOPS	122.55 GFLOPS	273.6 GFLOPS
Boost (1,000 MHz)	384 GFLOPS	768 GFLOPS	1536 GFLOPS	2304 GFLOPS	192 GFLOPS	384 GFLOPS	768 GFLOPS	1152 GFLOPS	48 GFLOPS	96 GFLOPS	129 GFLOPS	288 GFLOPS
Boost (1,050 MHz)	403.2 GFLOPS	806.4 GFLOPS	1612.8 GFLOPS	2419.2 GFLOPS	201.6 GFLOPS	403.2 GFLOPS	806.4 GFLOPS	1209.6 GFLOPS	50.4 GFLOPS	100.8 GFLOPS	135.45 GFLOPS	302.4 GFLOPS
Boost (1,100 MHz)	422.4 GFLOPS	844.8 GFLOPS	1689.6 GFLOPS	2534.4 GFLOPS	211.2 GFLOPS	422.4 GFLOPS	844.8 GFLOPS	1267.2 GFLOPS	52.8 GFLOPS	105.6 GFLOPS	141.9 GFLOPS	316.8 GFLOPS
Boost (1,150 MHz)	441.6 GFLOPS	883.2 GFLOPS	1766.4 GFLOPS	2649.6 GFLOPS	220.8 GFLOPS	441.6 GFLOPS	883.2 GFLOPS	1324.8 GFLOPS	55.2 GFLOPS	110.4 GFLOPS	148.35 GFLOPS	331.2 GFLOPS

Hardware Accelerated Video[edit]

[Edit] Skylake (Gen9) Hardware Accelerated Video Capabilities
Codec	Encode			Decode
Codec	Profiles	Levels	Max Resolution	Profiles	Levels	Max Resolution
MPEG-2 (H.262)	Main	High	1080p (FHD)	Main	Main, High	1080p (FHD)
MPEG-4 AVC (H.264)	High, Main	5.1	2160p (4K)	Main, High, SHP, MHP	5.1	2160p (4K)
JPEG/MJPEG	Baseline	-	16k x 16k	Baseline	Unified	16k x 16k
HEVC (H.265)	Main	5.1	2160p (4K)	Main, Main 10	5.1	2160p (4K)
VC-1	✘			Advanced, Main, Simple	3, High	3840x3840
VP8	Unified	Unified	-	0	Unified	1080p
VP9	✘			0	Unified	2160p (4K)

Process Technology[edit]

Main article: Broadwell § Process Technology

Gen9 are part of the Skylake SoC die which uses the same 14 nm process used for the Broadwell microarchitecture.

Architecture[edit]

Gen9 presents a large departure from the Gen8 and previous architectures.

Key changes from Gen8[edit]

Architecture is drastically different
- Gen9 is composed of 3 truly independent major components: Display block, Unslice, and the Slice.
- Shared Virtual Memory (SVM) improvements
  - Improved cache coherency performance
Unslice
- Now sits on its own power gating/clock domain
  - Capable of running at higher speeds if the situation allows (irrespective of slice clock)
  - Can allow for pure fixed media alone
- Higher throughput
- Tessellator AutoStrip
- Fixed function video encoder in the Quick Sync engine
- codec (decode&encode) support for HEVC, VP8, MJPEG
- RAW imaging capabilities
Slice
- Floating point atomics (min/max/cmpexch)
- L3 Cache
  - Increased to 768 KiB/slice (up from 576 KiB/slice)
  - Request queue size was increased
Subslice
- Adaptive scalable texture compression (ASTC)
- 16x multi-sample anti-aliasing (MSAA)
- Post depth test coverage mask
- Multi-plane overlays
- Texture samplers now natively support an NV12 YUV
- Min/max texture filtering
- Preemption of execution is now supported at the thread level
- Round robin scheduling of threads within an execution unit.
- new native support for the 32-bit float atomics operations of min, max, and compare/exchange.
- 16-bit floating point capability is improved with native support for denormals and gradual underflow
L4$
- The eDRAM is now a side cache instead of an L4$ like it was in Gen8. (See Skylake §eDRAM architectural changes for the reason)
- Side-cache eDRAM was moved into the system agent adjacent to the display controller

Block Diagram[edit]

Entire SoC Overview[edit]

Gen9[edit]

This block is for the most common setup, which is GT2 with 24 execution units.

Individual Core[edit]

See Skylake#Individual_Core.

Unslice[edit]

The Unslice is one of Gen9's major components and is responsible for the fixed-function geometry capabilities, fixed-function media capabilities, and it provides the interface to the memory fabric. One of the big changes in Gen9 is that the Unslice now sits on its own power/clock domain. This change allows the Unslice to operate at its own speed provided higher on-demand performance when desired. This change has a number of other benefits such as being able to turn off the slices (one or more) when they're not used in cases where pure fixed-function media is used. Additionally, the Unslice is now capable of running at a higher clock while the slice can run at a slower clock when the scenario demands it (such as in cases where higher fixed-function geometry or memory demands occur).

The Command Stream (CS) unit manages the the flow of execution for the FF Pipeline (3D Pipeline) and the Media pipelines. The CS unit performs the switching between pipelines and forwarding command streams to the different stages. Data in the pipeline are passed to the next unit using a messaging network. Messages can be passed directly through registers or by using the URB. The Command Stream also manages the allocation of the URB and supports the Constant URB Entry (CURB) function. The Unified Return Buffer (URB) is globally shared and is explicitly addressed. The pipeline's fixed-function blocks have both read and write access to the URB, additionally the shader cores have write access to the URB.

The media general-purpose pipeline consists of two fixed-function units: Video Front End (VFE) and the Thread Spawner (TS). The VFE unit handles the interfacing with the Command Streamer, writes thread payload data into the Unified Return Buffer, as well as prepares threads to be dispatched through TS unit. The VFE unit also contains the hardware Variable Length Decode (VLD) engine for MPEG-2 video decode. The TS unit is primarily responsible for interfacing with the Thread Dispatcher (TD) unit which is responsible for spawning new root-node parent threads originated from VFE unit and for spawning child threads (either leaf-node child threads or branch-node parent thread).

Fixed-Function[edit]

Multi-Format Codec (MFX)
- HEVC Decode
- Support for HEVC & VP8 in PAK (for encode)
- New fixed function within MFX for real-time AVC encoding usages

Video Quality Engine (VQE)
- 16-bit processing path
- 5x5 spatial denoise filter
- Local Adaptive Contrast Enhancement (LACE)
- Camera processing features to allow high-resolution raw camera processing

Scalar and Format Conversion (SFC)
- Allows for inline format conversion & upscaling or downscaling of imagest
- Can be coupled with decoder to allow high-quality video processing in the FF units in the unslice without utilizing the media sampler in the slices themselves.

3D Pipeline Stages[edit]

Pipeline Stage	Functions Performed
Command Stream (CS)	The Command Stream stage is responsible for managing the 3D pipeline and passing commands down the pipeline. In addition, the CS unit reads “constant data” from memory buffers and places it in the URB. Note that the CS stage is shared between the 3D, GPGPU and Media pipelines.
Vertex Fetch (VF)	The Vertex Fetch stage, in response to 3D Primitive Processing commands, is responsible for reading vertex data from memory, reformatting it, and writing the results into Vertex URB Entries. It then outputs primitives by passing references to the VUEs down the pipeline.
Vertex Shader (VS)	The Vertex Shader stage is responsible for processing (shading) incoming vertices by passing them to VS threads.
Hull Shader (HS)	The Hull Shader is responsible for processing (shading) incoming patch primitives as part of the tessellation process.
Tessellation Engine (TE)	The Tessellation Engine is responsible for using tessellation factors (computed in the HS stage) to tessellate U,V parametric domains into domain point topologies. Domain Shader (DS) The Domain Shader stage is responsible for processing (shading) the domain points (generated by the TE stage) into corresponding vertices.
Geometry Shader (GS)	The Geometry Shader stage is responsible for processing incoming objects by passing each object’s vertices to a GS thread.
Stream Output Logic (SOL)	The Stream Output Logic is responsible for outputting incoming object vertices into Stream Out Buffers in memory.
Clipper (CLIP)	The Clipper stage performs Clip Tests on incoming objects and clips objects if required. Objects are clipped using fixed-function hardware
Strip/Fan (SF)	The Strip/Fan stage performs object setup. Object setup uses fixed-function hardware.
Windower/Masker (WM)	The Windower/Masker performs object rasterization and determines visibility coverage

Slice[edit]

Slices are a cluster of subslices. For most configurations in Gen9, 3 subslices are aggregated into 1 slice to form a total of 24 execution units (depending on the model, some low end models do have less). The slice incorporates the thread dispatch routine, level 3 cache (L3$), a highly banked shared local memory structure, fixed function logic for atomics and barriers, and a number of fixed-function units for various media capabilities. The Global Thread Dispatcher (GTD) is responsible for load balancing thread distribution across the entire device. The global thread dispatcher works in concert with local thread dispatchers in each subslice.

Execution Unit (EU)[edit]

The Execution Units (EUs) are the programmable shader units - each one is an independent computational unit used for execution of 3D shaders, media, and GPGPU kernels. Internally, each unit is hardware multi-threaded capable of executing multi-issue SIMD operations. Execution is multi-issue per clock to pipelines capable of integer, single and double precision floating point operations, SIMD branch capability, logical operations, transcendental operations, and other miscellaneous operations. Communication between the EUs and the support units (shared function units such as operations involving texture sampling or scatter/gather load/stores) is done via messages that were programmatically constructed. Dependency hardware allows threads to sleep until the requested data is returned from those units.

Shared Functions are hardware units that provide a set of specialized supplemental functionality for the EUs. As their name implies they implement functions with insufficient demand to justify the cost of being implemented in the individual EUs. Functionality in these units are shared among the EUs in the subslice. Communication between the EUs and the Shared Function is done via lightweight message passing. Messages are a small self-contained packet of information created by a kernel and directed to a specific shared function. EU threads awaiting the return of a message from the Shared Function unit go into temporary sleep.

The Execution Unit is composed of 7 threads. Each thread has 128 SIMD-8 32-bit registers in a General-Purpose Register File (GRF) and supporting architecture specific registers (ARF). The EU can co-issue to four instruction processing units, including two FPUs, a branch unit, and a message send unit.

Preemption Granularity[edit]

Preemption in Gen9 was improved over Gen8 in a number of way. Preemption is important for mullti-tasking system and especially important for improving responsiveness of operations (i.e. the ability to stop and start operations quickly with minimal latency interruption for the end user). In Broadwell (Gen8) Intel added support for the ability to stop operations on object-level for 3D workloads such as on a triangle boundary (i.e. beginning of a triangle, between two triangles, between two lines or points) and be able to preempt and restore back to those operations. In Gen9 Intel added the ability to stop execution units on an instruction boundary and be able to restore them (previously such preemption was only possible at the boundary of a kernel - i.e. the entire kernel execution must take places before preemption was possible). Gen9 added support for thread-group (complete kernel execution) to mid-thread (instruction boundary) for compute workloads:

Example of responsiveness (Source: IDF15)

Application	Thread-Group Preemption		Mid-Thread Preemption
Application	U Series	Y Series	U Series	Y Series
Adobe Photoshop	4-6 ms	17-22 ms	300 µs	800 µs
Sample App1	200-500 ms	200-500 ms	300 µs	280-320 µs
Sample App2	17 ms	24 ms	240 µs	200-430 µs

Display[edit]

The display has a memory interface (supporting high memory bandwidth coming directly to the display sub-system), a front-end that is responsible for sorting and sequencing the requests (as well as handling things such as rotated displays), and display pipes. The display pipes perform input format conversion, multi-plane composition, color conversion, and scaling the result. The final part of the display port are the prot encoders that convert the input form the display pipes to the appropriate standard used (DP/HDMI/eDP). A number of improvements in Gen9 in the display block were done with respect to the display pipes, specifically being able to consume lossless compression directly without doing any extra unnecessary conversion operations. Additionally the pipes now support render compressed surfaces, Y-tiled surfaces, and on the fly 90/270 rotations.

Multiple Display Planes in a Pipe[edit]

Three plane sources + background
- All 3 are independent
- Fixed order (highest priority is plane3, lowest is background)
- Planes can be:
  - YUV video
  - RGB windows/desktop
- Fixed visual priority/blending order
- Color correction of result
- Two 7x5 scalers
  - Bind to individual planes or pipe output
Intended to support various OS features such as
- Microsoft's MPO (Multiplane overlay support)
- Android's SurfaceFlinger

Scalability[edit]

Gen9 can scale from 1 to 3 slices producing SKUs ranging from 12 to 72 execution units (note that the 12 EUs are formed from half a slice effectively).

GT1 (ULP)[edit]

GT1 is the most compact configuration offering two benefits: reduced cost and reduced power. GT1 is made of 1 slice containing 2 subslices with 6 EUs/subslice for a total of 12 EUs. With the scale-down, GT1 changes the ratio to 6:1 EU:sampler ratio. Note that this does retains the same ratio of 12 texels/clock and 8 pixels/clock at the backend. This configuration is better suited for some of the low power workload (e.g. ASTC-LDR+HDR, ETC1/2 compression). Note that software stack remains unchanged compared to the larger models.

GT1.5[edit]

GT1.5, offers 3 subslices of 6 EUs each for a total of 18 EUs.

GT2[edit]

GT2 is the standard configuration consisting of 1 slice with 3 subslices and 8 EU/subslice for a total of 24 EUs.

GT3[edit]

GT3 consists of 2 slices with 3 subslices in each and 8 EU/subslice for a total of 48 EUs.

Halo (GT4)[edit]

Codename Halo (GT4) is the most complex configuration offering the highest execution units count. Halo incorporates 3 slices with 3 subslices/slice and 8 EU/subslice for a total of 72 EUs.

Configuration[edit]

Configuration Attribute (Source: Intel's Programmer's Ref Manual)
Attribute	Model
Attribute	GT1F (1x2x6)	GT1.5F (1x3x6)	GT2 (1x3x8)	GT3 (2x3x8)	GT4 (3x3x8)
Global Attributes
Slice count	1	1	1	2	3
Subslice Count	2	3	3	6	9
EU/Subslice	6	6	8	8	8
EU count (total)	12	18	24	48	72
Thread Count	7	7	7	7	7
Thread Count (Total)	84	126	161 / 168	329 / 336	497 / 504
FLOPs/Clk - Half Precision, MAD (peak)	384	576	736 / 768	1504 / 1536	2272 / 2304
FLOPs/Clk - Single Precision, MAD (peak)	192	288	368 / 384	752 / 768	1136 / 1152
FLOPs/Clk - Double Precision, MAD (peak)	48	72	92 / 96	188 / 192	284 / 288
Unslice clocking (coupled/decoupled from Cr slice)	coupled	coupled	coupled	coupled	coupled
GTI / Ring Interfaces	1	1	1	1	1
GTI bandwidth (bytes/unslice-clk)	64: R 64: W	64: R 64: W	64: R 64: W	64: R 64: W	64: R 64: W
eDRAM Support	N/A	N/A	N/A	0, 64 MiB	0, 128 MiB
Graphics Virtual Address Range	48 bit	48 bit	48 bit	48 bit	48 bit
Graphics Physical Address Range	39 bit	39 bit	39 bit	39 bit	39 bit
Caches & Dedicated Memories
L3 Cache, total size (bytes)	384K	768K	768K	1536K	2304K
L3 Cache, bank count	2	4	4	8	12
L3 Cache, bandwidth (bytes/clk)	2x 64: R 2x 64: W	4x 64: R 4x 64: W	4x 64: R 4x 64: W	8x 64: R 8x 64: W	12x 64: R 12x 64: W
L3 Cache, D$ Size (Kbytes)	192K - 256K	512K	512K	1024K	1536K
URB Size (kbytes)	128K - 192K	384K	384K	768K	1008K
SLM Size (kbytes)	0, 128K	0, 192K	0, 192K	0, 384K	0, 576K
LLC/L4 size (bytes)	~2MiB/CPU core	~2MiB/CPU core	~2MiB/CPU core	~2MiB/CPU core	~2MiB/CPU core
Instruction Cache (IC, bytes)	2x 48K	3x 48K	3x 48K	6x 48K	9x 48K
Color Cache (RCC, bytes)	24K	24K	24K	2x 24K	3x 24K
MSC Cache (MSC, bytes)	16K	16K	16K	2x 16K	3x 16K
HiZ Cache (HZC, bytes)	12K	12K	12K	2x 12K	2x 12K
Z Cache (RCZ, bytes)	32K	32K	32K	2x 32K	3x 32K
Stencil Cache (STC, bytes)	8K	8K	8K	2x 8K	3x 8K
Instruction Issue Rates
FMAD, SP (ops/EU/clk)	8	8	8	8	8
FMUL, SP (ops/EU/clk)	8	8	8	8	8
FADD, SP (ops/EU/clk)	8	8	8	8	8
MIN,MAX, SP (ops/EU/clk)	8	8	8	8	8
CMP, SP (ops/EU/clk)	8	8	8	8	8
INV, SP (ops/EU/clk)	2	2	2	2	2
SQRT, SP (ops/EU/clk)	2	2	2	2	2
RSQRT, SP (ops/EU/clk)	2	2	2	2	2
LOG, SP (ops/EU/clk)	2	2	2	2	2
EXP, SP (ops/EU/clk)	2	2	2	2	2
POW, SP (ops/EU/clk)	1	1	1	1	1
IDIV, SP (ops/EU/clk)	1-6	1-6	1-6	1-6	1-6
TRIG, SP (ops/EU/clk)	2	2	2	2	2
FDIV, SP (ops/EU/clk)	1	1	1	1	1
Load/Store
Data Ports (HDC)	2	3	3	6	9
L3 Load/Store (bytes/clk)	2x 64	3x 64	3x 64	6x 64	9x 64
SLM Load/Store (bytes/clk)	2x 64	3x 64	3x 64	6x 64	9x 64
Atomic Inc, 32b - sequential addresses (bytes/clk)	2x 64	3x 64	3x 64	6x 64	9x 64
Atomic Inc, 32b - same address (bytes/clk)	2x 4	3x 4	3x 4	6x 4	9x 4
Atomic CmpWr, 32b - sequential addresses (bytes/clk)	2x 32	3x 32	3x 32	6x 32	9x 32
Atomic CmpWr, 32b - same address (bytes/clk)	2x 4	3x 4	3x 4	6x 4	9x 4
3D Attributes
Geometry pipes	1	1	1	1	1
Samplers (3D)	2	3	3	6	9
Texel Rate, point, 32b (tex/clk)	8	12	12	24	36
Texel Rate, point, 64b (tex/clk)	8	12	12	24	36
Texel Rate, point, 128b (tex/clk)	8	12	12	24	36
Texel Rate, bilinear, 32b (tex/clk)	8	12	12	24	36
Texel Rate, bilinear, 64b (tex/clk)	8	12	12	24	36
Texel Rate, bilinear, 128b (tex/clk)	2	3	3	6	9
Texel Rate, trilinear, 32b (tex/clk)	4	6	6	12	18
Texel Rate, trilinear, 64b (tex/clk)	2	3	3	6	9
Texel Rate, trilinear, 128b (tex/clk)	1	1.5	1.5	3	4.5
Texel Rate, aniso 2x, 32b (tex/clk)	2	3	3	6	9
Texel Rate, aniso 4x, 32b (tex/clk)	1	1.5	1.5	3	4.5
Texel Rate, ansio 8x, 32b (tex/clk)	0.5	0.75	0.75	1.5	2.25
Texel Rate, ansio 16x, 32b (tex/clk)	0.25	0.375	0.375	0.75	1.125
HiZ Rate, (ppc)	64	64	64	2x 64	3x 64
IZ Rate, (ppc)	16	16	16	2x 16	3x 16
Stencil Rate (ppc)	64	64	64	2x 64	3x 64
Pixel Rate, fill, 32bpp (pix/clk, RCC hit)	8	8	8	16	24
Pixel Rate, blend, 32bpp (p/clk, RCC hit)	8	8	8	16	24
Media Attributes
Samplers (media)	2	3	3	6	9
VDBox Instances	1	1	1	2	2
VEBox Instances	1	1	1	2 2
SFC Instances	1	1	1	1	1

Datasheets[edit]

White Paper[edit]

The Compute Architecture of Intel Processor Graphics Gen9

Programmer's Reference Manual[edit]

[1] The CPU Stepping is the actual CPU design stepping.

[2] The GT Stepping refers to the GT design stepping.

[3] The Device2 ID is the PCI device ID that identifies the GT SKU for driver software

[4] The GT Device2 Revision ID identifies the silicon stepping for driver software.

[devID 1

[devID 2

[devID 3

[devID 4

codename	Gen9 +
designer	Intel +
first launched	August 5, 2015 +
full page name	intel/microarchitectures/gen9 +
instance of	microarchitecture +
manufacturer	Intel +
microarchitecture type	GPU +
name	Gen9 +
process	14 nm (0.014 μm, 1.4e-5 mm) +

@@ Line 1: / Line 1: @@
-{{intel title|Gen9 LP|arch}}
+{{intel title|Gen9|arch}}
 {{microarchitecture
 | atype            = GPU
-| name             = Gen9 LP
+| name             = Gen9
 | designer         = Intel
 | manufacturer     = Intel
@@ Line 10: / Line 10: @@
 | succession       = Yes
-| predecessor      = Gen8 LP
+| predecessor      = Gen8
-| predecessor link = intel/microarchitectures/gen8_lp
+| predecessor link = intel/microarchitectures/gen8
-| successor        = Gen9.5 LP
+| successor        = Gen9.5
-| successor link   = intel/microarchitectures/gen9.5_lp
+| successor link   = intel/microarchitectures/gen9.5
 }}
-'''Gen9 LP''' (''Generation 9 Low Power'') is the [[microarchitecture]] for [[Intel]]'s [[graphics processing unit]] utilized by {{\\|Skylake}}-based microprocessors. Gen9 LP is the successor to {{\\|Gen8 LP}} used by {{\\|Broadwell}}. The Gen9 microarchitecture is designed separately by Intel and then integrated onto the same Skylake SoC die.
+'''Gen9''' (''Generation 9'') is the [[microarchitecture]] for [[Intel]]'s [[graphics processing unit]] utilized by {{\\|Skylake}}-based microprocessors. Gen9 is the successor to {{\\|Gen8}} used by {{\\|Broadwell}}. The Gen9 microarchitecture is designed separately by Intel and then integrated onto the same Skylake SoC die.
 == Codenames ==
+[[File:iris graphics logo.svg|right|200px]][[File:iris pro graphics logo.svg|right|200px]]
 Various models support different Graphics Tiers (GT) which provides different levels of performance. Some models also support an additional [[eDRAM]] side cache.
 {| class="wikitable"
@@ Line 31: / Line 32: @@
 | GT3e || Contains 2 slices with 48 execution units. Has an additional [[eDRAM]] side cache.
 |-
-| Halo (GT4) || Contains 3 slices with 72 execution units.
+| Halo (GT4e) || Contains 3 slices with 72 execution units. Has an additional [[eDRAM]] side cache.
-|-
-| Halo+e (GT4e) || Contains 3 slices with 72 execution units. Has an additional [[eDRAM]] side cache.
 |}
@@ Line 39: / Line 38: @@
 {| class="wikitable tc2 tc3"
 |-
-! colspan="5" | Gen9 LP [[IGP]] Models !! colspan="9" | Standards
+! colspan="5" | Gen9 [[IGP]] Models !! colspan="10" | Standards
 |-
-! rowspan="2" | Name !! rowspan="2" | Execution Units !! rowspan="2" | Tier !!  rowspan="2" | Series !! rowspan="2" | eDRAM !! colspan="2" | [[Vulkan]] !! colspan="3" | [[Direct3D]] !! colspan="2" | [[OpenGL]] !! colspan="2" | [[OpenCL]]
+! rowspan="2" | Name !! rowspan="2" | Execution Units !! rowspan="2" | Tier !!  rowspan="2" | Series !! rowspan="2" | eDRAM !! colspan="2" | [[Vulkan]] !! colspan="3" | [[Direct3D]] !! colspan="2" | [[OpenGL]] !! colspan="2" | [[OpenCL]] !! colspan="1" | [[Metal]]
 |-
-| Windows || Linux || Windows || Linux || [[High Level Shading Language|HLSL]] || Windows || Linux || Windows || Linux
+| Windows || Linux || Windows || Linux || [[High Level Shading Language|HLSL]] || Windows || Linux || Windows || Linux || macOS
 |-
-| {{intel|HD Graphics (Skylake)}} || 12 || GT1 || {{intel|Skylake Y|Y|l=core}} || - || rowspan="9" colspan="2" style="text-align: center;" | '''1.0''' || rowspan="9" style="text-align: center;" | '''12''' || rowspan="9" style="text-align: center;" | '''N/A''' || rowspan="9" style="text-align: center;" | '''5.1''' || rowspan="9" style="text-align: center;" | '''4.4''' || rowspan="9" style="text-align: center;" | '''4.5''' || rowspan="9" style="text-align: center;" colspan="2" | '''2.0'''
+| {{intel|HD Graphics (Skylake)}} || 12 || GT1 || {{intel|Skylake Y|Y|l=core}} || - || rowspan="11" colspan="2" style="text-align: center;" | '''1.0''' || rowspan="11" style="text-align: center;" | '''12''' || rowspan="11" style="text-align: center;" | '''N/A''' || rowspan="11" style="text-align: center;" | '''5.1''' || rowspan="11" style="text-align: center;" | '''4.5''' || rowspan="11" style="text-align: center;" | '''4.5''' || rowspan="11" style="text-align: center;" colspan="2" | '''2.0''' || rowspan="8" style="text-align: center;" colspan="1" | '''2.1'''
 |-
 | {{intel|HD Graphics 510}} || 12 || GT1 || {{intel|Skylake U|U|l=core}}, {{intel|Skylake S|S|l=core}} || -
@@ Line 60: / Line 59: @@
 |-
 | {{intel|Iris Graphics 550}} || 48 || GT3e || {{intel|Skylake U|U|l=core}} || 64 MiB
+|-
+| {{intel|Iris Pro Graphics P555}} || 48 || GT3e || {{intel|Skylake H|H|l=core}} || 128 MiB
 |-
 | {{intel|Iris Pro Graphics 580}} || 72 || GT4e || {{intel|Skylake H|H|l=core}} || 128 MiB
+|-
+| {{intel|Iris Pro Graphics P580}} || 72 || GT4e || {{intel|Skylake H|H|l=core}} || 128 MiB
+|}
+{| class="wikitable" style="text-align: center;"
+! Model || SKU || EUs || CPU Stepping<ref group=devID>The CPU Stepping is the actual CPU design stepping.</ref> || GT Stepping<ref group=devID>The GT Stepping refers to the GT design stepping.</ref> || Device2 ID<ref group=devID>The Device2 ID is the PCI device ID that identifies the GT SKU for driver software</ref> || GT Device2 ID Revision<ref group=devID>The GT Device2 Revision ID identifies the silicon stepping for driver software.</ref>
+|-
+| rowspan="3" | {{intel|HD Graphics 510}} || SKL 2+1F DT || rowspan="3" | 12 || S0 || G0 || 0x1902 || 0x6
+|-
+| SKL U - ULT 2+1F || D0 || H0 || 0x1906 ||0x7
+|-
+| SKL - H 4+1F || D0 || H0 || 0x190B || 0x7
+|-
+| {{intel|HD Graphics 515}} || SKL Y – ULX 2+2 || rowspan="6" | 24 || D0 || H0 || 0x191E || 0x7
+|-
+| {{intel|HD Graphics 520}} || SKL U – ULT 2+2 || D0 || H0 || 0x1916 || 0x7
+|-
+| rowspan="3" | {{intel|HD Graphics 530}} || SKL 4+2 DT || R0 || G0 || 0x191B || 0x6
+|-
+| SKL 2+2 DT || S0 || G0 || 0x1912 || 0x6
+|-
+| SKL 4+2 DT || R0 || G0 || 0x1912 || 0x6
+|-
+| {{intel|HD Graphics P530}} || SKL WKS 4+2 || R1 || G1 || 0x191D || 0x6
+|-
+| {{intel|Iris Graphics 540}} || SKL U – ULT 2+3E (15W) || rowspan="4" | 48 ||  K1 || L1 || 0x1926 || 0xA
+|-
+| {{intel|Iris Graphics 550}} || SKL U - ULT 2+3E (28W) || K1 || L1 || 0x1927 || 0xA
+|-
+| {{intel|HD Graphics 535}} || SKL U - ULT 2+3 || K1 || L1 || 0x1923 || 0xA
+|-
+| {{intel|Iris Pro Graphics P555}} || SKL Media Server 4+3FE || N0 || J0 || 0x192D || 0x9
+|-
+| {{intel|Iris Pro Graphics 580}} || SKL H Halo 4+4E || rowspan="2" | 72 || N0 || J0 || 0x193B || 0x9
+|-
+| {{intel|Iris Pro Graphics P580}}  || SKL WKS 4+4E || N0 || J0 || 0x193D || 0x9
 |}
+<references group=devID />
+== Performance ==
+<div style="overflow-x: auto;">
+{| class="wikitable" style="text-align: center; white-space: nowrap;"
+! rowspan="2" | Frequency !! colspan="15" | Peak Performance
+|-
+! rowspan="16" | &nbsp; || colspan="4" | Half Precision || rowspan="16" | &nbsp; || colspan="4" | Single Precision || rowspan="16" | &nbsp; || colspan="4" | Double Precision
+|-
+| Models || {{intel|HD Graphics 510|510}} || {{intel|HD Graphics 515|515}}, {{intel|HD Graphics 520|520}}, {{intel|HD Graphics 530|530}}, {{intel|HD Graphics P530|P530}} || {{intel|Iris Graphics 540|540}}, {{intel|Iris Graphics 550|550}}, {{intel|Iris Pro Graphics P555|P555}} || {{intel|Iris Pro Graphics 580|580}}, {{intel|Iris Pro Graphics P580|P580}} || {{intel|HD Graphics 510|510}} || {{intel|HD Graphics 515|515}}, {{intel|HD Graphics 520|520}}, {{intel|HD Graphics 530|530}}, {{intel|HD Graphics P530|P530}} || {{intel|Iris Graphics 540|540}}, {{intel|Iris Graphics 550|550}}, {{intel|Iris Pro Graphics P555|P555}} || {{intel|Iris Pro Graphics 580|580}}, {{intel|Iris Pro Graphics P580|P580}} || {{intel|HD Graphics 510|510}} || {{intel|HD Graphics 515|515}}, {{intel|HD Graphics 520|520}}, {{intel|HD Graphics 530|530}}, {{intel|HD Graphics P530|P530}} || {{intel|Iris Graphics 540|540}}, {{intel|Iris Graphics 550|550}}, {{intel|Iris Pro Graphics P555|P555}} || {{intel|Iris Pro Graphics 580|580}}, {{intel|Iris Pro Graphics P580|P580}}
+|-
+| Tiers || GT1 || GT2 || GT3e || GT4e || GT1 || GT2 || GT3e || GT4e ||  GT1 || GT2 || GT3e || GT4e
+|-
+| Ref (FLOP/clk) || 384/cycle || 768/cycle || 1536/cycle || 2304/cycle || 192/cycle || 384/cycle || 768/cycle || 1152/cycle || 48/cycle || 96/cycle || 192/cycle || 288/cycle
+|-
+| Base (300 MHz) || {{#expr: 384*.3}} [[GFLOPS]] || {{#expr: 768*.3}} GFLOPS || {{#expr: 1536*.3}} GFLOPS || {{#expr: 2304*.3}} GFLOPS || {{#expr: 192*.3}} GFLOPS || {{#expr: 384*.3}} GFLOPS || {{#expr: 768*.3}} GFLOPS || {{#expr: 1152*.3}} GFLOPS || {{#expr: 48*.3}} GFLOPS || {{#expr: 96*.3}} GFLOPS || {{#expr: 129*.3}} GFLOPS || {{#expr: 288*.3}} GFLOPS
+|-
+| Base (350 MHz) || {{#expr: 384*.35}} GFLOPS || {{#expr: 768*.35}} GFLOPS || {{#expr: 1536*.35}} GFLOPS || {{#expr: 2304*.35}} GFLOPS || {{#expr: 192*.35}} GFLOPS || {{#expr: 384*.35}} GFLOPS || {{#expr: 768*.35}} GFLOPS || {{#expr: 1152*.35}} GFLOPS || {{#expr: 48*.35}} GFLOPS || {{#expr: 96*.35}} GFLOPS || {{#expr: 129*.35}} GFLOPS || {{#expr: 288*.35}} GFLOPS
+|-
+| Base (400 MHz) || {{#expr: 384*.4}} GFLOPS || {{#expr: 768*.4}} GFLOPS || {{#expr: 1536*.4}} GFLOPS || {{#expr: 2304*.4}} GFLOPS || {{#expr: 192*.4}} GFLOPS || {{#expr: 384*.4}} GFLOPS || {{#expr: 768*.4}} GFLOPS || {{#expr: 1152*.4}} GFLOPS || {{#expr: 48*.4}} GFLOPS || {{#expr: 96*.4}} GFLOPS || {{#expr: 129*.4}} GFLOPS || {{#expr: 288*.4}} GFLOPS
+|-
+| Base (650 MHz) || {{#expr: 384*.65}} GFLOPS || {{#expr: 768*.65}} GFLOPS || {{#expr: 1536*.65}} GFLOPS || {{#expr: 2304*.65}} GFLOPS || {{#expr: 192*.65}} GFLOPS || {{#expr: 384*.65}} GFLOPS || {{#expr: 768*.65}} GFLOPS || {{#expr: 1152*.65}} GFLOPS || {{#expr: 48*.65}} GFLOPS || {{#expr: 96*.65}} GFLOPS || {{#expr: 129*.65}} GFLOPS || {{#expr: 288*.65}} GFLOPS
+|-
+| Boost (800 MHz) || {{#expr: 384*.8}} GFLOPS || {{#expr: 768*.8}} GFLOPS || {{#expr: 1536*.8}} GFLOPS || {{#expr: 2304*.8}} GFLOPS || {{#expr: 192*.8}} GFLOPS || {{#expr: 384*.8}} GFLOPS || {{#expr: 768*.8}} GFLOPS || {{#expr: 1152*.8}} GFLOPS || {{#expr: 48*.8}} GFLOPS || {{#expr: 96*.8}} GFLOPS || {{#expr: 129*.8}} GFLOPS || {{#expr: 288*.8}} GFLOPS
+|-
+| Boost (850 MHz) || {{#expr: 384*.85}} GFLOPS || {{#expr: 768*.85}} GFLOPS || {{#expr: 1536*.85}} GFLOPS || {{#expr: 2304*.85}} GFLOPS || {{#expr: 192*.85}} GFLOPS || {{#expr: 384*.85}} GFLOPS || {{#expr: 768*.85}} GFLOPS || {{#expr: 1152*.85}} GFLOPS || {{#expr: 48*.85}} GFLOPS || {{#expr: 96*.85}} GFLOPS || {{#expr: 129*.85}} GFLOPS || {{#expr: 288*.85}} GFLOPS
+|-
+| Boost (900 MHz) || {{#expr: 384*.9}} GFLOPS || {{#expr: 768*.9}} GFLOPS || {{#expr: 1536*.9}} GFLOPS || {{#expr: 2304*.9}} GFLOPS || {{#expr: 192*.9}} GFLOPS || {{#expr: 384*.9}} GFLOPS || {{#expr: 768*.9}} GFLOPS || {{#expr: 1152*.9}} GFLOPS || {{#expr: 48*.9}} GFLOPS || {{#expr: 96*.9}} GFLOPS || {{#expr: 129*.9}} GFLOPS || {{#expr: 288*.9}} GFLOPS
+|-
+| Boost (950 MHz) || {{#expr: 384*.95}} GFLOPS || {{#expr: 768*.95}} GFLOPS || {{#expr: 1536*.95}} GFLOPS || {{#expr: 2304*.95}} GFLOPS || {{#expr: 192*.95}} GFLOPS || {{#expr: 384*.95}} GFLOPS || {{#expr: 768*.95}} GFLOPS || {{#expr: 1152*.95}} GFLOPS || {{#expr: 48*.95}} GFLOPS || {{#expr: 96*.95}} GFLOPS || {{#expr: 129*.95}} GFLOPS || {{#expr: 288*.95}} GFLOPS
+|-
+| Boost (1,000 MHz) || {{#expr: 384*1}} GFLOPS || {{#expr: 768*1}} GFLOPS || {{#expr: 1536*1}} GFLOPS || {{#expr: 2304*1}} GFLOPS || {{#expr: 192*1}} GFLOPS || {{#expr: 384*1}} GFLOPS || {{#expr: 768*1}} GFLOPS || {{#expr: 1152*1}} GFLOPS || {{#expr: 48*1}} GFLOPS || {{#expr: 96*1}} GFLOPS || {{#expr: 129*1}} GFLOPS || {{#expr: 288*1}} GFLOPS
+|-
+| Boost (1,050 MHz) || {{#expr: 384*1.05}} GFLOPS || {{#expr: 768*1.05}} GFLOPS || {{#expr: 1536*1.05}} GFLOPS || {{#expr: 2304*1.05}} GFLOPS || {{#expr: 192*1.05}} GFLOPS || {{#expr: 384*1.05}} GFLOPS || {{#expr: 768*1.05}} GFLOPS || {{#expr: 1152*1.05}} GFLOPS || {{#expr: 48*1.05}} GFLOPS || {{#expr: 96*1.05}} GFLOPS || {{#expr: 129*1.05}} GFLOPS || {{#expr: 288*1.05}} GFLOPS
+|-
+| Boost (1,100 MHz) || {{#expr: 384*1.1}} GFLOPS || {{#expr: 768*1.1}} GFLOPS || {{#expr: 1536*1.1}} GFLOPS || {{#expr: 2304*1.1}} GFLOPS || {{#expr: 192*1.1}} GFLOPS || {{#expr: 384*1.1}} GFLOPS || {{#expr: 768*1.1}} GFLOPS || {{#expr: 1152*1.1}} GFLOPS || {{#expr: 48*1.1}} GFLOPS || {{#expr: 96*1.1}} GFLOPS || {{#expr: 129*1.1}} GFLOPS || {{#expr: 288*1.1}} GFLOPS
+|-
+| Boost (1,150 MHz) || {{#expr: 384*1.15}} GFLOPS || {{#expr: 768*1.15}} GFLOPS || {{#expr: 1536*1.15}} GFLOPS || {{#expr: 2304*1.15}} GFLOPS || {{#expr: 192*1.15}} GFLOPS || {{#expr: 384*1.15}} GFLOPS || {{#expr: 768*1.15}} GFLOPS || {{#expr: 1152*1.15}} GFLOPS || {{#expr: 48*1.15}} GFLOPS || {{#expr: 96*1.15}} GFLOPS || {{#expr: 129*1.15}} GFLOPS || {{#expr: 288*1.15}} GFLOPS
+|}
+</div>
+== Hardware Accelerated Video ==
+{{skylake hardware accelerated video table}}
 == Process Technology ==
 {{main|intel/microarchitectures/broadwell#Process_Technology|l1=Broadwell § Process Technology}}
-Gen9 LP are part of the Skylake SoC die which uses the same [[14 nm process]] used for the Broadwell microarchitecture.
+Gen9 are part of the Skylake SoC die which uses the same [[14 nm process]] used for the Broadwell microarchitecture.
 == Architecture ==
-Gen9 LP presents a large departure from the Gen8 LP and previous architectures.
+Gen9 presents a large departure from the Gen8 and previous architectures.
-=== Key changes from {{\\|Gen8 LP}} ===
+=== Key changes from {{\\|Gen8}} ===
 * Architecture is drastically different
-** Gen9 LP is composed of 3 truely independent major components: Display block, Unslice, and the Slice.
+** Gen9 is composed of 3 truly independent major components: Display block, Unslice, and the Slice.
+** Shared Virtual Memory (SVM) improvements
+*** Improved cache coherency performance
 * Unslice
 ** Now sits on its own power gating/clock domain
@@ Line 84: / Line 167: @@
 ** RAW imaging capabilities
 * Slice
+** Floating point atomics (min/max/cmpexch)
 ** L3 Cache
 *** Increased to 768 [[KiB]]/slice (up from 576 KiB/slice)
@@ Line 93: / Line 177: @@
 ** Multi-plane overlays
 ** Texture samplers now natively support an NV12 YUV
+** Min/max texture filtering
 ** Preemption of execution is now supported at the thread level
 ** Round robin scheduling of threads within an execution unit.
@@ Line 98: / Line 183: @@
 ** 16-bit floating point capability is improved with native support for denormals and gradual underflow
 * L4$
-** The [[eDRAM]] is now a side cache instead of an L4$ like it was in {{\\|Gen8 LP}}. (See {{\\|Skylake#eDRAM architectural changes|Skylake §eDRAM architectural changes}} for the reason)
+** The [[eDRAM]] is now a side cache instead of an L4$ like it was in {{\\|Gen8}}. (See {{\\|Skylake#eDRAM architectural changes|Skylake §eDRAM architectural changes}} for the reason)
 ** Side-cache eDRAM was moved into the system agent adjacent to the display controller
@@ Line 104: / Line 189: @@
 ==== Entire SoC Overview ====
 [[File:skylake soc block diagram.svg|900px]]
-==== Gen9 LP ====
+==== Gen9 ====
 This block is for the most common setup, which is GT2 with 24 execution units.
@@ Line 110: / Line 195: @@
 ==== Individual Core ====
 See {{intel|Skylake#Individual_Core|l=arch}}.
-=== Display ===
-{{empty section}}
 === Unslice ===
+[[File:gen9 lp media pipeline.svg|500px|right]]
 The Unslice is one of Gen9's major components and is responsible for the fixed-function geometry capabilities, fixed-function media capabilities, and it provides the interface to the memory fabric. One of the big changes in Gen9 is that the Unslice now sits on its own power/clock domain. This change allows the Unslice to operate at its own speed provided higher on-demand performance when desired. This change has a number of other benefits such as being able to turn off the slices (one or more) when they're not used in cases where pure fixed-function media is used. Additionally, the Unslice is now capable of running at a higher clock while the slice can run at a slower clock when the scenario demands it (such as in cases where higher fixed-function geometry or memory demands occur).
+The '''Command Stream''' ('''CS''') unit manages the the flow of execution for the FF Pipeline (3D Pipeline) and the Media pipelines. The CS unit performs the switching between pipelines and forwarding command streams to the different stages. Data in the pipeline are passed to the next unit using a messaging network. Messages can be passed directly through registers or by using the URB. The Command Stream also manages the allocation of the URB and supports the Constant URB Entry (CURB) function. The '''Unified Return Buffer''' ('''URB''') is globally shared and is explicitly addressed. The pipeline's fixed-function blocks have both read and write access to the URB, additionally the shader cores have write access to the URB.
+The '''media general-purpose pipeline''' consists of two fixed-function units: Video Front End ('''VFE''') and the '''Thread Spawner''' ('''TS'''). The VFE unit handles the interfacing with the Command Streamer, writes thread payload data into the Unified Return Buffer, as well as prepares threads to be dispatched through TS unit. The VFE unit also contains the hardware '''Variable Length Decode''' ('''VLD''') engine for MPEG-2 video decode. The TS unit is primarily responsible for interfacing with the '''Thread Dispatcher''' ('''TD''') unit which is responsible for spawning new root-node parent threads originated from VFE unit and for spawning child threads (either leaf-node child threads or branch-node parent thread).
+==== Fixed-Function ====
+[[File:gen9 multi-format codec (mfx).svg|350px|right]]
+* '''Multi-Format Codec''' ('''MFX''')
+** HEVC Decode
+** Support for HEVC & VP8 in PAK (for encode)
+** New fixed function within MFX for real-time  AVC encoding usages
+[[File:gen9 video quality engine (vqe).svg|300px|right]]
+* '''Video Quality Engine''' ('''VQE''')
+** 16-bit processing path
+** 5x5 spatial denoise filter
+** Local Adaptive Contrast Enhancement (LACE)
+** Camera processing features to allow high-resolution raw camera processing
+[[File:gen9 scalar and format conv (sfc).svg|250px|right]]
+* '''Scalar and Format Conversion''' ('''SFC''')
+** Allows for inline format conversion & upscaling or downscaling of imagest
+** Can be coupled with decoder to allow high-quality video processing in the FF units in the unslice without utilizing the media sampler in the slices themselves.
+=== 3D Pipeline Stages  ===
+{| class="wikitable"
+! Pipeline Stage !! Functions Performed
+|-
+| Command Stream (CS) || The Command Stream stage is responsible for managing the 3D pipeline and passing commands down the pipeline. In addition, the CS unit reads “constant data” from memory
+buffers and places it in the URB. Note that the CS stage is shared between the 3D, GPGPU and Media pipelines.
+|-
+| Vertex Fetch (VF) || The Vertex Fetch stage, in response to 3D Primitive Processing commands, is responsible for reading vertex data from memory, reformatting it, and writing the results into Vertex URB Entries. It then outputs primitives by passing references to the VUEs down the pipeline.
+|-
+| Vertex Shader (VS) || The Vertex Shader stage is responsible for processing (shading) incoming vertices by passing them to VS threads.
+|-
+| Hull Shader (HS) || The Hull Shader is responsible for processing (shading) incoming patch primitives as part of the tessellation process.
+|-
+| Tessellation Engine (TE) || The Tessellation Engine is responsible for using tessellation factors (computed in the HS stage) to tessellate U,V parametric domains into domain point topologies.
+Domain Shader (DS) The Domain Shader stage is responsible for processing (shading) the domain points (generated by the TE stage) into corresponding vertices.
+|-
+| Geometry Shader (GS) || The Geometry Shader stage is responsible for processing incoming objects by passing each object’s vertices to a GS thread.
+|-
+| Stream Output Logic (SOL) || The Stream Output Logic is responsible for outputting incoming object vertices into Stream Out Buffers in memory.
+|-
+| Clipper (CLIP) || The Clipper stage performs Clip Tests on incoming objects and clips objects if required. Objects are clipped using fixed-function hardware
+|-
+| Strip/Fan (SF) || The Strip/Fan stage performs object setup. Object setup uses fixed-function hardware.
+|-
+| Windower/Masker (WM) || The Windower/Masker performs object rasterization and determines visibility coverage
+|}
 === Slice ===
-{{empty section}}
+Slices are a cluster of subslices. For most configurations in Gen9, 3 subslices are aggregated into 1 slice to form a total of 24 execution units (depending on the model, some low end models do have less). The slice incorporates the thread dispatch routine, level 3 cache (L3$), a highly banked shared local memory structure, fixed function logic for atomics and barriers, and a number of fixed-function units for various media capabilities. The '''Global Thread Dispatcher''' ('''GTD''') is responsible for load balancing thread distribution across the entire device. The global thread dispatcher works in concert with local thread dispatchers in each subslice.
+=== Execution Unit (EU) ===
+The '''Execution Units''' ('''EUs''') are the programmable [[shader units]] - each one is an independent computational unit used for execution of 3D shaders, media, and [[GPGPU]] kernels. Internally, each unit is hardware multi-threaded capable of executing multi-issue [[SIMD]] operations. Execution is multi-issue per clock to pipelines capable of integer, single and double precision floating point operations, SIMD branch capability, logical operations, transcendental operations, and other miscellaneous operations. Communication between the EUs and the support units (shared function units such as operations involving texture sampling or scatter/gather load/stores) is done via messages that were programmatically constructed. Dependency hardware allows threads to sleep until the requested data is returned from those units.
+'''Shared Functions''' are hardware units that provide a set of specialized supplemental functionality for the EUs. As their name implies they implement functions with insufficient demand to justify the cost of being implemented in the individual EUs. Functionality in these units are shared among the EUs in the subslice. Communication between the EUs and the Shared Function is done via  lightweight message passing. Messages are a small self-contained packet of information created by a kernel and directed to a specific shared function. EU threads awaiting the return of a message from the Shared Function unit go into temporary sleep.
+The Execution Unit is composed of 7 threads. Each thread has 128 SIMD-8 32-bit registers in a General-Purpose Register File (GRF) and supporting architecture specific registers (ARF). The EU can co-issue to four instruction processing units, including two FPUs, a branch unit, and a message send unit.
+[[File:gen9 eu.svg|600px]]
+==== Preemption Granularity ====
+Preemption in Gen9 was improved over Gen8 in a number of way. Preemption is important for mullti-tasking system and especially important for improving responsiveness of operations (i.e. the ability to stop and start operations quickly with minimal latency interruption for the end user). In {{\\|Broadwell}} ({{\\|Gen8}}) Intel added support for the ability to stop operations on object-level for 3D workloads such as on a triangle boundary (i.e. beginning of a triangle, between two triangles, between two lines  or points) and be able to preempt and restore back to those operations. In Gen9 Intel added the ability to stop execution units on an instruction boundary and be able to restore them (previously such preemption was only possible at the boundary of a kernel - i.e. the entire kernel execution must take places before preemption was possible). Gen9 added support for thread-group (complete kernel execution) to mid-thread (instruction boundary) for compute workloads:
+Example of responsiveness (Source: IDF15)
+{| class="wikitable"
+! rowspan="2" | Application !! colspan="2" | Thread-Group Preemption !! colspan="2" | Mid-Thread Preemption
+|-
+| {{intel|Skylake U|U Series|l=core}} || {{intel|Skylake Y|Y Series|l=core}} || {{intel|Skylake U|U Series|l=core}} || {{intel|Skylake Y|Y Series|l=core}}
+|-
+| Adobe Photoshop || 4-6 ms || 17-22 ms || 300 µs || 800 µs
+|-
+| Sample App1 || 200-500 ms || 200-500 ms || 300 µs || 280-320 µs
+|-
+| Sample App2 || 17 ms || 24 ms || 240 µs || 200-430 µs
+|}
+=== Display ===
+The display has a memory interface (supporting high memory bandwidth coming directly to the display sub-system), a front-end that is responsible for sorting and sequencing the requests (as well as handling things such as rotated displays), and display pipes. The display pipes perform input format conversion, multi-plane composition, color conversion, and scaling the result. The final part of the display port are the prot encoders that convert the input form the display pipes to the appropriate standard used (DP/HDMI/eDP). A number of improvements in Gen9 in the display block were done with respect to the display pipes, specifically being able to consume lossless compression directly without doing any extra unnecessary conversion operations. Additionally the pipes now support render compressed surfaces, Y-tiled surfaces, and on the fly 90/270 rotations.
+[[File:gen9 display block.svg|650px]]
+==== Multiple Display Planes in a Pipe ====
+[[File:gen9 display planes pipe.svg|right|550px]]
+* Three plane sources + background
+** All 3 are independent
+** Fixed order (highest priority is plane3, lowest is background)
+** Planes can be:
+*** YUV video
+*** RGB windows/desktop
+** Fixed visual priority/blending order
+** Color correction of result
+** Two 7x5 scalers
+*** Bind to individual planes or pipe output
+* Intended to support various OS features such as
+** Microsoft's MPO (Multiplane overlay support)
+** Android's SurfaceFlinger
+== Scalability ==
+Gen9 can scale from 1 to 3 slices producing SKUs ranging from 12 to 72 execution units (note that the 12 EUs are formed from half a slice effectively).
+=== GT1 (ULP) ===
+GT1 is the most compact configuration offering two benefits: reduced cost and reduced power. GT1 is made of 1 slice containing 2 subslices with 6 EUs/subslice for a total of 12 EUs. With the scale-down, GT1 changes the ratio to 6:1 EU:sampler ratio. Note that this does retains the same ratio of 12 texels/clock and 8 pixels/clock at the backend. This configuration is better suited for some of the low power workload (e.g. ASTC-LDR+HDR, ETC1/2 compression). Note that software stack remains unchanged compared to the larger models.
+[[File:gen9 lp gt1 block diagram.svg|600px]]
+=== GT1.5 ===
+GT1.5, offers 3 subslices of 6 EUs each for a total of 18 EUs.
+[[File:gen9 lp gt1.5 block diagram.svg|600px]]
+=== GT2 ===
+GT2 is the standard configuration consisting of 1 slice with 3 subslices and 8 EU/subslice for a total of 24 EUs.
+[[File:gen9 lp gt2 block diagram.svg|600px]]
+=== GT3 ===
+GT3 consists of 2 slices with 3 subslices in each and 8 EU/subslice for a total of 48 EUs.
+[[File:gen9 lp gt3 block diagram.svg|700px]]
+=== Halo (GT4)  ===
+Codename Halo (GT4) is the most complex configuration offering the highest execution units count. Halo incorporates 3 slices with 3 subslices/slice and 8 EU/subslice for a total of 72 EUs.
+[[File:gen9 lp gt4 block diagram.svg|800px]]
+== Configuration ==
+{| class="wikitable"
+|-
+! colspan="7" | Configuration Attribute (Source: [[Intel]]'s Programmer's Ref Manual)
+|-
+! rowspan="2" | Attribute !! colspan="5" | Model
+|-
+! GT1F<br>(1x2x6) !! GT1.5F<br>(1x3x6) !! GT2<br>(1x3x8) !! GT3<br>(2x3x8) !! GT4<br>(3x3x8)
+|-
+| colspan="7" style="background:#f2f7ff; text-align: center;" | '''Global Attributes'''
+|-
+|-
+|Slice count || 1 || 1 || 1 || 2 || 3
+|-
+|Subslice Count || 2 || 3 || 3 || 6 || 9
+|-
+|EU/Subslice || 6 || 6 || 8 || 8 || 8
+|-
+|EU count (total) || 12 || 18 || 24 || 48 || 72
+|-
+|Thread Count || 7 || 7 || 7 || 7 || 7
+|-
+|Thread Count (Total) || 84 || 126 || 161 / 168 || 329 / 336 || 497 / 504
+|-
+|FLOPs/Clk - Half Precision, MAD (peak) || 384 || 576 || 736 / 768 || 1504 / 1536 || 2272 / 2304
+|-
+|FLOPs/Clk - Single Precision, MAD (peak) || 192 || 288 || 368 / 384 || 752 / 768 || 1136 / 1152
+|-
+|FLOPs/Clk - Double Precision, MAD (peak) || 48 || 72 || 92 / 96 || 188 / 192 || 284 / 288
+|-
+|Unslice clocking (coupled/decoupled from Cr slice) || coupled || coupled || coupled || coupled || coupled
+|-
+|GTI / Ring Interfaces || 1 || 1 || 1 || 1 || 1
+|-
+|GTI bandwidth (bytes/unslice-clk) || 64: R<br>64: W || 64: R<br>64: W || 64: R<br>64: W || 64: R<br>64: W || 64: R<br>64: W
+|-
+|eDRAM Support || N/A || N/A || N/A || 0, 64 MiB || 0, 128 MiB
+|-
+|Graphics Virtual Address Range || 48 bit || 48 bit || 48 bit || 48 bit || 48 bit
+|-
+|Graphics Physical Address Range || 39 bit || 39 bit || 39 bit || 39 bit || 39 bit
+|-
+| colspan="7" style="background:#f2f7ff; text-align: center;" | '''Caches & Dedicated Memories'''
+|-
+|L3 Cache, total size (bytes) || 384K || 768K || 768K || 1536K || 2304K
+|-
+|L3 Cache, bank count || 2 || 4 || 4 || 8 || 12
+|-
+|L3 Cache, bandwidth (bytes/clk) || 2x 64: R<br>2x 64: W || 4x 64: R<br>4x 64: W || 4x 64: R<br>4x 64: W || 8x 64: R<br>8x 64: W || 12x 64: R<br>12x 64: W
+|-
+|L3 Cache, D$ Size (Kbytes) || 192K - 256K || 512K || 512K || 1024K || 1536K
+|-
+|URB Size (kbytes) || 128K - 192K || 384K || 384K || 768K || 1008K
+|-
+|SLM Size (kbytes) || 0, 128K || 0, 192K || 0, 192K || 0, 384K || 0, 576K
+|-
+|LLC/L4 size (bytes)|| ~2MiB/CPU core || ~2MiB/CPU core || ~2MiB/CPU core || ~2MiB/CPU core || ~2MiB/CPU core
+|-
+|Instruction Cache (IC, bytes) || 2x 48K || 3x 48K || 3x 48K || 6x 48K || 9x 48K
+|-
+|Color Cache (RCC, bytes) || 24K || 24K || 24K || 2x 24K || 3x 24K
+|-
+|MSC Cache (MSC, bytes) || 16K || 16K || 16K || 2x 16K || 3x 16K
+|-
+|HiZ Cache (HZC, bytes) || 12K || 12K || 12K || 2x 12K || 2x 12K
+|-
+|Z Cache (RCZ, bytes) || 32K || 32K || 32K || 2x 32K || 3x 32K
+|-
+|Stencil Cache (STC, bytes) || 8K || 8K || 8K || 2x 8K || 3x 8K
+|-
+| colspan="7" style="background:#f2f7ff; text-align: center;" | '''Instruction Issue Rates'''
+|-
+|FMAD, SP (ops/EU/clk) || 8 || 8 || 8 || 8 || 8
+|-
+|FMUL, SP (ops/EU/clk) || 8 || 8 || 8 || 8 || 8
+|-
+|FADD, SP (ops/EU/clk) || 8 || 8 || 8 || 8 || 8
+|-
+|MIN,MAX, SP (ops/EU/clk) || 8 || 8 || 8 || 8 || 8
+|-
+|CMP, SP (ops/EU/clk) || 8 || 8 || 8 || 8 || 8
+|-
+|INV, SP (ops/EU/clk) || 2 || 2 || 2 || 2 || 2
+|-
+|SQRT, SP (ops/EU/clk) || 2 || 2 || 2 || 2 || 2
+|-
+|RSQRT, SP (ops/EU/clk) || 2 || 2 || 2 || 2 || 2
+|-
+|LOG, SP (ops/EU/clk) || 2 || 2 || 2 || 2 || 2
+|-
+|EXP, SP (ops/EU/clk) || 2 || 2 || 2 || 2 || 2
+|-
+|POW, SP (ops/EU/clk) || 1 || 1 || 1 || 1 || 1
+|-
+|IDIV, SP (ops/EU/clk) || 1-6 || 1-6 || 1-6 || 1-6 || 1-6
+|-
+|TRIG, SP (ops/EU/clk) || 2 || 2 || 2 || 2 || 2
+|-
+|FDIV, SP (ops/EU/clk) || 1 || 1 || 1 || 1 || 1
+|-
+| colspan="7" style="background:#f2f7ff; text-align: center;" | '''Load/Store'''
+|-
+|Data Ports (HDC) || 2 || 3 || 3 || 6 || 9
+|-
+|L3 Load/Store (bytes/clk) || 2x 64 || 3x 64 || 3x 64 || 6x 64 || 9x 64
+|-
+|SLM Load/Store (bytes/clk) || 2x 64 || 3x 64 || 3x 64 || 6x 64 || 9x 64
+|-
+|Atomic Inc, 32b - sequential addresses (bytes/clk) || 2x 64 || 3x 64 || 3x 64 || 6x 64 || 9x 64
+|-
+|Atomic Inc, 32b - same address (bytes/clk) || 2x 4 || 3x 4 || 3x 4 || 6x 4 || 9x 4
+|-
+|Atomic CmpWr, 32b - sequential addresses (bytes/clk) || 2x 32 || 3x 32 || 3x 32 || 6x 32 || 9x 32
+|-
+|Atomic CmpWr, 32b - same address (bytes/clk) || 2x 4 || 3x 4 || 3x 4 || 6x 4 || 9x 4
+|-
+| colspan="7" style="background:#f2f7ff; text-align: center;" | '''3D Attributes'''
+|-
+|Geometry pipes || 1 || 1 || 1 || 1 || 1
+|-
+|Samplers (3D) || 2 || 3 || 3 || 6 || 9
+|-
+|Texel Rate, point, 32b (tex/clk) || 8 || 12 || 12 || 24 || 36
+|-
+|Texel Rate, point, 64b (tex/clk) || 8 || 12 || 12 || 24 || 36
+|-
+|Texel Rate, point, 128b (tex/clk) || 8 || 12 || 12 || 24 || 36
+|-
+|Texel Rate, bilinear, 32b (tex/clk) || 8 || 12 || 12 || 24 || 36
+|-
+|-
+||Texel Rate, bilinear, 64b (tex/clk) || 8 || 12 || 12 || 24 || 36
+|-
+|Texel Rate, bilinear, 128b (tex/clk) || 2 || 3 || 3 || 6 || 9
+|-
+|Texel Rate, trilinear, 32b (tex/clk) || 4 || 6 || 6 || 12 || 18
+|-
+|Texel Rate, trilinear, 64b (tex/clk) || 2 || 3 || 3 || 6 || 9
+|-
+|Texel Rate, trilinear, 128b (tex/clk) || 1 || 1.5 || 1.5 || 3 || 4.5
+|-
+|Texel Rate, aniso 2x, 32b (tex/clk) || 2 || 3 || 3 || 6 || 9
+|-
+|Texel Rate, aniso 4x, 32b (tex/clk) || 1 || 1.5 || 1.5 || 3 || 4.5
+|-
+|Texel Rate, ansio 8x, 32b (tex/clk) || 0.5 || 0.75 || 0.75 || 1.5 || 2.25
+|-
+|Texel Rate, ansio 16x, 32b (tex/clk) || 0.25 || 0.375 || 0.375 || 0.75 || 1.125
+|-
+|HiZ Rate, (ppc) || 64 || 64 || 64 || 2x 64 || 3x 64
+|-
+|IZ Rate, (ppc) || 16 || 16 || 16 || 2x 16 || 3x 16
+|-
+|Stencil Rate (ppc) || 64 || 64 || 64 || 2x 64 || 3x 64
+|-
+|Pixel Rate, fill, 32bpp (pix/clk, RCC hit) || 8 || 8 || 8 || 16 || 24
+|-
+|Pixel Rate, blend, 32bpp (p/clk, RCC hit) || 8 || 8 || 8 || 16 || 24
+|-
+| colspan="7" style="background:#f2f7ff; text-align: center;" | '''Media Attributes'''
+|-
+|Samplers (media) || 2 || 3 || 3 || 6 || 9
+|-
+|VDBox Instances || 1 || 1 || 1 || 2 || 2
+|-
+|VEBox Instances || 1 || 1 || 1 || 2 2
+|-
+|SFC Instances || 1 || 1 || 1 || 1 || 1
+|}
+== Datasheets ==
+=== White Paper ===
+* [[:File:The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf|The Compute Architecture of Intel Processor Graphics Gen9]]
+===Programmer's Reference Manual===
+* [[:File:intel-gfx-prm-osrc-skl-vol01-preface.pdf|Volume 1: Preface]]
+* [[:File:intel-gfx-prm-osrc-skl-vol02a-commandreference-instructions-huc.pdf|Volume 2a Addendum: Command Reference: Instructions (Command Opcodes) for the HEVC Micro-Controller (HuC)]]
+* [[:File:intel-gfx-prm-osrc-skl-vol02a-commandreference-instructions.pdf|Volume 2a: Command Reference: Instructions (Command Opcodes)]]
+* [[:File:intel-gfx-prm-osrc-skl-vol02c-commandreference-registers-part1.pdf|Volume 2c: Command Reference: Registers Part 1 – Registers A through L]]
+* [[:File:intel-gfx-prm-osrc-skl-vol02c-commandreference-registers-part2.pdf|Volume 2c: Command Reference: Registers Part 2 – Registers M through Z]]
+* [[:File:intel-gfx-prm-osrc-skl-vol02d-commandreference-structures.pdf|Volume 2d: Command Reference: Structures]]
+* [[:File:intel-gfx-prm-osrc-skl-vol03-gpu overview.pdf|Volume 3: GPU Overview]]
+* [[:File:intel-gfx-prm-osrc-skl-vol04-configurations.pdf|Volume 4: Configurations]]
+* [[:File:intel-gfx-prm-osrc-skl-vol05-memory views.pdf|Volume 5: Memory Views]]
+* [[:File:intel-gfx-prm-osrc-skl-vol06-command stream programming.pdf|Volume 6: Command Stream Programming]]
+* [[:File:intel-gfx-prm-osrc-skl-vol07-3d media gpgpu.pdf|Volume 7: 3D-Media-GPGPU]]
+* [[:File:intel-gfx-prm-osrc-skl-vol08-media vdbox.pdf|Volume 8: Media VDBOX]]
+* [[:File:intel-gfx-prm-osrc-skl-vol09-media vebox.pdf|Volume 9: Media Video Enhancement (VEBOX) Engine]]
+* [[:File:intel-gfx-prm-osrc-skl-vol10-hevc.pdf|Volume 10: HEVC Codec Pipeline (HCP)]]
+* [[:File:intel-gfx-prm-osrc-skl-vol11-blitter.pdf|Volume 11: Blitter]]
+* [[:File:intel-gfx-prm-osrc-skl-vol12-display.pdf|Volume 12: Display]]
+* [[:File:intel-gfx-prm-osrc-skl-vol13-mmio.pdf|Volume 13: Memory-mapped Input/Output (MMIO)]]
+* [[:File:intel-gfx-prm-osrc-skl-vol14-observability.pdf|Volume 14: Observability]]
+* [[:File:intel-gfx-prm-osrc-skl-vol15-sfc.pdf|Volume 15: Scaler Format Converter (SFC)]]
+* [[:File:intel-gfx-prm-osrc-skl-vol16-workarounds 0.pdf|Volume 16: Workarounds]]

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung