From WikiChip
AVX-512 Fused Multiply-Accumulate Packed Single Precision (4FMAPS) - x86
< x86

AVX-512 Fused Multiply-Accumulate Packed Single Precision (AVX512_4FMAPS) is an x86 extension and part of the AVX-512 SIMD instruction set.

Overview[edit]

V4FMADDPS, V4FNMADDPS
Parallel fused multiply-accumulate of single precision values, four iterations.
In each iteration the instructions source 16 multiplicands from a 512-bit vector register, and one multiplier from memory which is broadcast to all 16 elements of a second vector. They add the 16 products and the 16 values in the corresponding elements of the 512-bit destination register, round the sums as desired, and store them in the destination. Finally the instructions increment the number of the source register by one modulo four, and the memory address by four bytes. Exceptions can occur in each iteration. Write masking is supported.
In total these instructions perform 64 multiply-accumulate operations, reading 64 single precision multiplicands from four source registers in a 4-aligned block, e.g. ZMM12 ... ZMM15, four single precision multipliers consecutive in memory, and accumulate 16 single precision results four times, also rounding four times.
V4FNMADD performs the same operation as V4FMADD except this instruction also negates the product.
V4FMADDSS, V4FNMADDSS
These "scalar" variants perform the same operations but yield only a single result in the lowest element of the 128-bit destination vector, leaving the three higher elements unchanged. As usual if the vector size is less than 512 bits the instructions zero the unused higher bits in the destination register to avoid a dependency on earlier instructions writing those bits.
In total these instructions sequentially perform four multiply-accumulate operations, read a single precision multiplicand from four source registers, four single precision multipliers from memory, and accumulate one single precision result four times in the destination register, also rounding four times.

Motivation[edit]

Intel introduced this extension on their Knights Mill microarchitecture (Xeon Phi many-core products) to accelerate convolutional neural network-based algorithms. It was not implemented on other chips. Plain parallel, single precision floating point, fused multiply-add instructions became available with the FMA extension (AVX instructions with 128- and 256-bit vector size) and AVX-512 Foundation extension (128-, 256-, 512-bit vectors).

4FMAPS has an integer counterpart AVX512_4VNNIW.

Detection[edit]

CPUID Instruction Set
Input Output
EAX=07H, ECX=0 EDX[bit 03] AVX512_4FMAPS

Microarchitecture support[edit]

Designer Microarchitecture Year Support Level
F CD ER PF BW DQ VL FP16 IFMA VBMI VBMI2 BITALG VPOPCNTDQ VP2INTERSECT 4VNNIW 4FMAPS VNNI BF16
Intel Knights Landing 2016
Knights Mill 2017
Skylake (server) 2017
Cannon Lake 2018
Cascade Lake 2019
Cooper Lake 2020
Tiger Lake 2020
Rocket Lake 2021
Alder Lake 2021
Ice Lake (server) 2021
Sapphire Rapids 2023
AMD Zen 4 2022
Centaur CHA

Intrinsic functions[edit]

// V4FMADDPS
__m512 _mm512_4fmadd_ps( __m512, __m512x4, __m128 *);
__m512 _mm512_mask_4fmadd_ps(__m512, __mmask16, __m512x4, __m128 *);
__m512 _mm512_maskz_4fmadd_ps(__mmask16, __m512, __m512x4, __m128 *);
__m512 _mm512_4fnmadd_ps(__m512, __m512x4, __m128 *);
__m512 _mm512_mask_4fnmadd_ps(__m512, __mmask16, __m512x4, __m128 *);
__m512 _mm512_maskz_4fnmadd_ps(__mmask16, __m512, __m512x4, __m128 *);
// V4FMADDSS
__m128 _mm_4fmadd_ss(__m128, __m128x4, __m128 *);
__m128 _mm_mask_4fmadd_ss(__m128, __mmask8, __m128x4, __m128 *);
__m128 _mm_maskz_4fmadd_ss(__mmask8, __m128, __m128x4, __m128 *);
__m128 _mm_4fnmadd_ss(__m128, __m128x4, __m128 *);
__m128 _mm_mask_4fnmadd_ss(__m128, __mmask8, __m128x4, __m128 *);
__m128 _mm_maskz_4fnmadd_ss(__mmask8, __m128, __m128x4, __m128 *);

Bibliography[edit]