From WikiChip
AVX-512 Vector Neural Network Instructions Word Variable Precision (4VNNIW) - x86
< x86

AVX-512 Vector Neural Network Instructions Word Variable Precision (AVX512_4VNNIW) is an x86 extension and part of the AVX-512 SIMD instruction set.

Overview[edit]

VP4DPWSSD, VP4DPWSSDS
Dot product of signed 16-bit words, accumulated in doublewords, four iterations.
These instructions use two 512-bit vector operands. The first one is a vector register, the second one is obtained by reading a 128-bit vector from memory and broadcasting it to the four 128-bit lanes of a 512-bit vector. The instructions multiply the 32 corresponding signed words in the source operands, then add the signed 32-bit products from the even lanes, odd lanes, and the 16 signed doublewords in the 512-bit destination vector register and store the sums in the destination. Finally the instructions increment the number of the source register by one modulo four, and repeat these operations three more times, reading four vector registers total in a 4-aligned block, e.g. ZMM12 ... ZMM15. Exceptions can occur in each iteration. Write masking is supported.
VP4DPWSSD can be replaced by 16 VPDPWSSD instructions (from the later AVX512_VNNI extension) working on 128-bit vectors, or four 512-bit instructions and a memory load with broadcast.
VP4DPWSSDS performs the same operations except the 33-bit intermediate sum is stored in the destination with signed saturation:
dest = min(max(-231, even + odd + dest), 231 - 1)
Its VNNI equivalent is VPDPWSSDS.

Motivation[edit]

Intel introduced this extension on their Knights Mill microarchitecture (Xeon Phi many-core products) to accelerate convolutional neural network-based algorithms. It was not implemented on other chips but partially revived with AVX512_VNNI on Cascade Lake and later microarchitectures.

Detection[edit]

CPUID Instruction Set
Input Output
EAX=07H, ECX=0 EDX[bit 02] AVX512_4VNNIW

Microarchitecture support[edit]

Designer Microarchitecture Year Support Level
F CD ER PF BW DQ VL FP16 IFMA VBMI VBMI2 BITALG VPOPCNTDQ VP2INTERSECT 4VNNIW 4FMAPS VNNI BF16
Intel Knights Landing 2016
Knights Mill 2017
Skylake (server) 2017
Cannon Lake 2018
Cascade Lake 2019
Cooper Lake 2020
Tiger Lake 2020
Rocket Lake 2021
Alder Lake 2021
Ice Lake (server) 2021
Sapphire Rapids 2023
AMD Zen 4 2022
Centaur CHA

Intrinsic functions[edit]

// VP4DPWSSD
__m512i _mm512_4dpwssd_epi32(__m512i, __m512ix4, __m128i *);
__m512i _mm512_mask_4dpwssd_epi32(__m512i, __mmask16, __m512ix4, __m128i *);
__m512i _mm512_maskz_4dpwssd_epi32(__mmask16, __m512i, __m512ix4, __m128i *);
// VP4DPWSSDS
__m512i _mm512_4dpwssds_epi32(__m512i, __m512ix4, __m128i *);
__m512i _mm512_mask_4dpwssds_epi32(__m512i, __mmask16, __m512ix4, __m128i *);
__m512i _mm512_maskz_4dpwssds_epi32(__mmask16, __m512i, __m512ix4, __m128i *);

Bibliography[edit]