Difference between revisions of "x86/avx512 4vnniw"

Latest revision as of 15:28, 15 March 2023

x86
Instruction Set Architecture

AVX-512 Vector Neural Network Instructions Word Variable Precision (AVX512_4VNNIW) is an x86 extension and part of the AVX-512 SIMD instruction set.

Overview[edit]

VP4DPWSSD, VP4DPWSSDS: Dot product of signed 16-bit words, accumulated in doublewords, four iterations.

These instructions use two 512-bit vector operands. The first one is a vector register, the second one is obtained by reading a 128-bit vector from memory and broadcasting it to the four 128-bit lanes of a 512-bit vector. The instructions multiply the 32 corresponding signed words in the source operands, then add the signed 32-bit products from the even lanes, odd lanes, and the 16 signed doublewords in the 512-bit destination vector register and store the sums in the destination. Finally the instructions increment the number of the source register by one modulo four, and repeat these operations three more times, reading four vector registers total in a 4-aligned block, e.g. ZMM12 ... ZMM15. Exceptions can occur in each iteration. Write masking is supported.

VP4DPWSSD can be replaced by 16 VPDPWSSD instructions (from the later AVX512_VNNI extension) working on 128-bit vectors, or four 512-bit instructions and a memory load with broadcast.

VP4DPWSSDS performs the same operations except the 33-bit intermediate sum is stored in the destination with signed saturation:

dest = min(max(-2³¹, even + odd + dest), 2³¹ - 1)

Its VNNI equivalent is VPDPWSSDS.

Motivation[edit]

Intel introduced this extension on their Knights Mill microarchitecture (Xeon Phi many-core products) to accelerate convolutional neural network-based algorithms. It was not implemented on other chips but partially revived with AVX512_VNNI on Cascade Lake and later microarchitectures.

Detection[edit]

CPUID		Instruction Set
Input	Output	Instruction Set
EAX=07H, ECX=0	EDX[bit 02]	AVX512_4VNNIW

Microarchitecture support[edit]

Designer	Microarchitecture	Year	Support Level
Designer	Microarchitecture	Year	F	CD	ER	PF	BW	DQ	VL	FP16	IFMA	VBMI	VBMI2	BITALG	VPOPCNTDQ	VP2INTERSECT	4VNNIW	4FMAPS	VNNI	BF16
Intel	Knights Landing	2016	✔	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘
	Knights Mill	2017	✔	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✔	✘	✔	✔	✘	✘
	Skylake (server)	2017	✔	✔	✘	✘	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘
	Cannon Lake	2018	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘
	Cascade Lake	2019	✔	✔	✘	✘	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✔	✘
	Cooper Lake	2020	✔	✔	✘	✘	✔	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✔	✔
	Tiger Lake	2020	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✔	✘	✘	✔	✘
	Rocket Lake	2021	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✘	✘	✘	✔	✘
	Alder Lake	2021	✔	✔	✘	✘	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✘	✘	✔	✔
	Ice Lake (server)	2021	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✘	✘	✘	✔	✘
	Sapphire Rapids	2023	✔	✔	✘	✘	✔	✔	✔	✔	✔	✔	✔	✔	✔	✘	✘	✘	✔	✔
AMD	Zen 4	2022	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✘	✘	✘	✔	✔
Centaur	CHA		✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘

Intrinsic functions[edit]

// VP4DPWSSD
__m512i _mm512_4dpwssd_epi32(__m512i, __m512ix4, __m128i *);
__m512i _mm512_mask_4dpwssd_epi32(__m512i, __mmask16, __m512ix4, __m128i *);
__m512i _mm512_maskz_4dpwssd_epi32(__mmask16, __m512i, __m512ix4, __m128i *);
// VP4DPWSSDS
__m512i _mm512_4dpwssds_epi32(__m512i, __m512ix4, __m128i *);
__m512i _mm512_mask_4dpwssds_epi32(__m512i, __mmask16, __m512ix4, __m128i *);
__m512i _mm512_maskz_4dpwssds_epi32(__mmask16, __m512i, __m512ix4, __m128i *);

Bibliography[edit]

"Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B, 2C & 2D): Instruction Set Reference, A-Z", Intel Order Nr. 325383, Rev. 078US, December 2022

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung

Intel

AMD

Ampere

Apple

Cavium

HiSilicon

MediaTek

NXP

Qualcomm

Renesas