From WikiChip
Advanced Matrix Extension (AMX) - x86
< x86
Revision as of 09:59, 28 June 2020 by David (talk | contribs) (Feature set)

Advanced Matrix Extension (AMX) is an x86 extension that introduces an accelerator framework for operating on matrices.

Overview

The Advanced Matrix Extension (AMX) is an x86 extension that introduces a new programming framework for working with matrices. AMX introduces two new components - a 2-dimensional register file with registers called 'tiles' and a set of accelerators that are able to operate on those tiles. The tiles represent a sub-array portion from a large 2-dimensional memory image. AMX instructions synchronous in the instructions stream and memory loads and stores by tiles are coherent with the host's memory accesses. AMX instructions may be freely interleaved with traditional x86 code and parallel with other extensions (e.g., AVX512) with special tile loads and stores and accelerator commands being sent over to the accelerator for execution.

Palettes

Determining the kind of operations available on specific hardware can be done by enumerating a palette of options.

Currently, 2 palettes exist:

  • Palette 0 - initialized state
  • Palette 1 - an 8-tile register file with each register being 16 rows x 64-byte (1 KiB) for a total register file of 8 KiB.

A programmer can configure the size of the register file by configuring tiles of smaller dimensions to suit their algorithm. Tiles may be configured in rows and bytes_per_row which are stored as metadata for the accelerator to operate on. Information pertaining to the palette is stored in a tile control register (TILECFG) and is accessible via the palette_table CPUID leaf 1DH. The TILECFG is programmed using the LDTILECFG instruction.

Accelerators

AMX supports a set of accelerators that can operate on tiles. Currently, just one accelerator is defined.

Tile matrix multiply unit (TMUL)

The Tile Matrix Multiply (TMUL) Unit is an accelerator as part of AMX comprising a grid of fused multiply-add units capable of operating on tiles. Its existence is defined by the AMX-INT8 and AMX-BF16 sub-extensions. The TMUL unit comes with a number of parameters supported including the maximum height (tmul_maxk) and maximum SIMD dimension (tmul_maxn). Those parameters are dynamically read by the TMUL unit upon execution.

Instructions

AMX introduces 12 new instructions:

Configuration:

  • LDTILECFG - Load tile configuration, loads the tile configuration from the 64-byte memory location specified.
  • STTILECFG - Store tile configuration, stores the tile configuration in the 64-byte memory location specified.

Data:

  • TILELOADD/TILELOADDT1 - Load tile
  • TILESTORED - Store tile
  • TILERELEASE - Release tile, returns TILECFG and TILEDATA to the INIT state
  • TILEZERO - Zero tile, zeroes the destination tile

Operation:

  • TDPBF16PS - Dot product of BF16 tiles, performs a set of SIMD dot-products of two BF16 elements and accumulates the results into a packed single-precision tile.
  • TDPBSSD/TDPBSUD/TDPBUSD/TDPBUUD - Dot product of Int8 tiles, performs a set of SIMD dot-products on two bytes and accumulates the results. SU = Signed/Unsigned, US = Unsigned/Signed, SS = Signed/Signed, and UU = Unsigned/Unsigned pairs.

Feature set

Not all hardware implementations support all operations. The AMX extension comprises three sub-extensions: AMX-TILE, AMX-INT8, and AMX-BF16.

Instruction Feature Set
Base TMUL
AMX-TILE AMX-INT8 AMX-BF16
LDTILECFG
STTILECFG
TILELOADD
TILELOADDT1
TILESTORED
TILERELEASE
TILEZERO
TDPBSSD

TDPBSUD
TDPBUSD
TDPBUUD

TDPBF16PS

Detection

CPUID Instruction Set
Input Output
EAX=07H, ECX=0 EDX[bit 22] AMX-BF16
EDX[bit 24] AMX-TILE
EDX[bit 25] AMX-INT8


Microarchitecture support

Microarchitecture AMX-TILE AMX-INT8 AMX-BF16
Sapphire Rapids

Intrinsic functions

Bibliography

  • Intel Architecture Instruction Set Extensions and Future Features Programming Reference, Revision 40. (Ref #319433-040)