Difference between revisions of "x86/amx"

Latest revision as of 20:40, 30 June 2020

x86
Instruction Set Architecture

History
Families

AMD's CPUIDs
Intel's CPUIDs

Advanced Matrix Extension (AMX) is an x86 extension that introduces a matrix register file and new instructions for operating on matrices.

Overview[edit]

AMX Architecture

The Advanced Matrix Extension (AMX) is an x86 extension that introduces a new programming framework for working with matrices (rank-2 tensors). The extensions introduce two new components: a 2-dimensional register file with registers called 'tiles' and a set of accelerators that are able to operate on those tiles. The tiles represent a sub-array portion from a large 2-dimensional memory image. AMX instructions are synchronous in the instruction stream with memory load/store operations by tiles being coherent with the host's memory accesses. AMX instructions may be freely interleaved with traditional x86 code and execute in parallel with other extensions (e.g., AVX512) with special tile loads and stores and accelerator commands being sent over to the accelerator for execution.

Palettes[edit]

Determining the kind of operations available on specific hardware can be done by enumerating a palette of options.

Currently, 2 palettes exist:

Palette 0 - initialized state
Palette 1 - an 8-tile register file with each register being 16 rows x 64-byte (1 KiB) for a total register file of 8 KiB.

A programmer can configure the size of the register file by configuring tiles of smaller dimensions to suit their algorithm. Tiles may be configured in rows and bytes_per_row which are stored as metadata for the accelerator to operate on. Information pertaining to the palette is stored in a tile control register (TILECFG) and is accessible via the palette_table CPUID leaf 1DH. The TILECFG is programmed using the LDTILECFG instruction.

Accelerators[edit]

AMX supports a set of accelerators that can operate on tiles. Currently, just one accelerator is defined.

Tile matrix multiply unit (TMUL)[edit]

The Tile Matrix Multiply (TMUL) Unit is an accelerator as part of AMX comprising a grid of fused multiply-add units capable of operating on tiles. Its existence is defined by the AMX-INT8 and AMX-BF16 sub-extensions. The TMUL unit instruction set computes Tile_C[M][N] += Tile_A[M][K] * Tile_B[K][N].

The TMUL unit comes with a number of parameters supported including the maximum height (tmul_maxk) and maximum SIMD dimension (tmul_maxn). Those parameters are dynamically read by the TMUL unit upon execution.

Instructions[edit]

2 x 3 Dot Product

AMX introduces 12 new instructions:

Configuration:

LDTILECFG - Load tile configuration, loads the tile configuration from the 64-byte memory location specified.
STTILECFG - Store tile configuration, stores the tile configuration in the 64-byte memory location specified.

Data:

TILELOADD/TILELOADDT1 - Load tile
TILESTORED - Store tile
TILERELEASE - Release tile, returns TILECFG and TILEDATA to the INIT state
TILEZERO - Zero tile, zeroes the destination tile

Operation:

TDPBF16PS - Perform a dot-product of BF16 tiles and accumulate the result. Packed Single Accumulation.
TDPB[XX]D - Perform a dot-product of Int8 tiles and accumulate the result. Dword Accumulation.
- Where XX can be: SU = Signed/Unsigned, US = Unsigned/Signed, SS = Signed/Signed, and UU = Unsigned/Unsigned pairs.

Feature set[edit]

Not all hardware implementations support all operations. The AMX extension comprises three sub-extensions: AMX-TILE, AMX-INT8, and AMX-BF16.

Instruction	Feature Set
	Base	TMUL
	AMX-TILE	AMX-INT8	AMX-BF16
`LDTILECFG`	✔
`STTILECFG`	✔
`TILELOADD` `TILELOADDT1`	✔
`TILESTORED`	✔
`TILERELEASE`	✔
`TILEZERO`	✔
`TDPBSSD` `TDPBSUD` `TDPBUSD` `TDPBUUD`		✔
`TDPBF16PS`			✔

Detection[edit]

CPUID		Instruction Set
Input	Output	Instruction Set
EAX=07H, ECX=0	EDX[bit 22]	AMX-BF16
	EDX[bit 24]	AMX-TILE
	EDX[bit 25]	AMX-INT8

Microarchitecture support[edit]

AMX was first planned for Sapphire Rapids.

Microarchitecture	AMX-TILE	AMX-INT8	AMX-BF16
Sapphire Rapids	✔	✔	✔

Intrinsic functions[edit]

Bibliography[edit]

Intel Architecture Instruction Set Extensions and Future Features Programming Reference, Revision 40. (Ref #319433-040)

@@ Line 1: / Line 1: @@
 {{x86 title|Advanced Matrix Extension (AMX)}}{{x86 isa main}}
-'''Advanced Matrix Extension''' ('''AMX''') is an [[x86]] extension that introduces an accelerator framework for operating on matrices.
+'''Advanced Matrix Extension''' ('''AMX''') is an [[x86]] {{x86|extension}} that introduces a matrix register file and new instructions for operating on matrices.
 == Overview ==
-The Advanced Matrix Extension (AMX) is an [[x86]] extension that introduces a new programming framework for working with matrices. AMX introduces two new components - a 2-dimensional register file with registers called 'tiles' and a set of [[accelerators]] that are able to operate on those tiles. The tiles represent a sub-array portion from a large 2-dimensional memory image. AMX instructions synchronous in the instructions stream and memory loads and stores by tiles are coherent with the host's memory accesses. AMX instructions may be freely interleaved with traditional x86 code and parallel with other extensions (e.g., [[AVX512]]) with special tile loads and stores and accelerator commands being sent over to the accelerator for execution.
+[[File:amx architecture.svg|thumb|right|AMX Architecture]]
+The Advanced Matrix Extension (AMX) is an [[x86]] extension that introduces a new programming framework for working with matrices (rank-2 tensors). The extensions introduce two new components: a 2-dimensional [[register file]] with registers called 'tiles' and a set of [[accelerators]] that are able to operate on those tiles. The tiles represent a sub-array portion from a large 2-dimensional memory image. AMX instructions are synchronous in the [[instruction stream]] with memory load/store operations by tiles being coherent with the host's memory accesses. AMX instructions may be freely interleaved with traditional x86 code and execute in parallel with other extensions (e.g., [[AVX512]]) with special tile loads and stores and accelerator commands being sent over to the accelerator for execution.
 === Palettes ===
@@ Line 18: / Line 19: @@
 AMX supports a set of accelerators that can operate on tiles. Currently, just one accelerator is defined.
 ==== Tile matrix multiply unit (TMUL) ====
-The '''Tile Matrix Multiply''' ('''TMUL''') Unit is an accelerator as part of AMX comprising a grid of fused multiply-add units capable of operating on tiles. The TMUL unit comes with a number of parameters supported including the maximum height (<code>tmul_maxk</code>) and maximum SIMD dimension (<code>tmul_maxn</code>). Those parameters are dynamically read by the TMUL unit upon execution.
+The '''Tile Matrix Multiply''' ('''TMUL''') Unit is an accelerator as part of AMX comprising a grid of fused multiply-add units capable of operating on tiles. Its existence is defined by the ''AMX-INT8'' and ''AMX-BF16'' sub-extensions. The TMUL unit instruction set computes Tile<sub>C</sub>[M][N] += Tile<sub>A</sub>[M][K] * Tile<sub>B</sub>[K][N].
+The TMUL unit comes with a number of parameters supported including the maximum height (<code>tmul_maxk</code>) and maximum SIMD dimension (<code>tmul_maxn</code>). Those parameters are dynamically read by the TMUL unit upon execution.
 == Instructions ==
+[[File:amx dot product of tiles.svg|thumb|right|2 x 3 Dot Product]]
 AMX introduces 12 new instructions:
@@ Line 34: / Line 38: @@
 Operation:
-* <code>TDPBF16PS</code> - Dot product of [[BF16]] tiles, performs a set of SIMD dot-products of two BF16 elements and accumulates the results into a packed single-precision tile.
+* <code>TDPBF16PS</code> - Perform a dot-product of [[BF16]] tiles and accumulate the result. Packed Single Accumulation.
-* <code>TDPBSSD</code>/<code>TDPBSUD</code>/<code>TDPBUSD</code>/<code>TDPBUUD</code> - Dot product of [[Int8]] tiles, performs a set of SIMD dot-products on two bytes and accumulates the results. ''SU'' = Signed/Unsigned, ''US'' = Unsigned/Signed, ''SS'' = Signed/Signed, and ''UU'' = Unsigned/Unsigned pairs.
+* <code>TDPB[XX]D</code> - Perform a dot-product of [[Int8]] tiles and accumulate the result. Dword Accumulation.
+** Where ''XX'' can be: ''SU'' = Signed/Unsigned, ''US'' = Unsigned/Signed, ''SS'' = Signed/Signed, and ''UU'' = Unsigned/Unsigned pairs.
 === Feature set ===
@@ Line 42: / Line 47: @@
 {| class="wikitable"
 |-
-! rowspan="2" | Instruction !! colspan="3" | Feature Set
+! rowspan="3" | Instruction !! colspan="3" | Feature Set
+|-
+|-
+! Base || colspan="2" | [[#TMUL|TMUL]]
 |-
 ! AMX-TILE !! AMX-INT8 !! AMX-BF16
@@ Line 57: / Line 65: @@
 |-
 | <code>TILEZERO</code> || {{tchk|yes}} || ||
-|-
-| <code>TDPBF16PS</code> || || {{tchk|yes}} ||
 |-
 | <code>TDPBSSD</code><br>
@@ Line 64: / Line 70: @@
 <code>TDPBUSD</code><br>
 <code>TDPBUUD</code>
-| || || {{tchk|yes}}
+| || {{tchk|yes}} ||
+|-
+| <code>TDPBF16PS</code> || || || {{tchk|yes}}
 |}
+== Detection ==
+{| class="wikitable"
+! colspan="2" | {{x86|CPUID}} !! rowspan="2" | Instruction Set
+|-
+! Input !! Output
+|-
+| rowspan="3" | EAX=07H, ECX=0 || EDX[bit 22] || AMX-BF16
+|-
+| EDX[bit 24] || AMX-TILE
+|-
+| EDX[bit 25] || AMX-INT8
+|}
 == Microarchitecture support ==
+[[File:intel server roadmap (2020) with amx.png|thumb|right|AMX was first planned for {{intel|Sapphire Rapids|l=arch}}.]]
 {| class="wikitable"
 |-
-! Instructions !! Introduction
+! Microarchitecture !! AMX-TILE !! AMX-INT8 !! AMX-BF16
 |-
-| AMX || {{intel|Sapphire Rappids|l=arch}} (server)
+| {{intel|Sapphire Rapids|l=arch}} || {{tchk|yes}} || {{tchk|yes}} || {{tchk|yes}}
 |}

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung

Intel

AMD

Ampere

Apple