AVX-512 Fused Multiply-Accumulate Packed Single Precision (4FMAPS) - x86

x86
Instruction Set Architecture

History
Families

AMD's CPUIDs
Intel's CPUIDs

AVX-512 Fused Multiply-Accumulate Packed Single Precision (AVX512_4FMAPS) is an x86 extension and part of the AVX-512 SIMD instruction set.

Overview[edit]

V4FMADDPS, V4FNMADDPS: Parallel fused multiply-accumulate of single precision values, four iterations.

In each iteration the instructions source 16 multiplicands from a 512-bit vector register, and one multiplier from memory which is broadcast to all 16 elements of a second vector. They add the 16 products and the 16 values in the corresponding elements of the 512-bit destination register, round the sums as desired, and store them in the destination. Finally the instructions increment the number of the source register by one modulo four, and the memory address by four bytes. Exceptions can occur in each iteration. Write masking is supported.

In total these instructions perform 64 multiply-accumulate operations, reading 64 single precision multiplicands from four source registers in a 4-aligned block, e.g. ZMM12 ... ZMM15, four single precision multipliers consecutive in memory, and accumulate 16 single precision results four times, also rounding four times.

V4FNMADD performs the same operation as V4FMADD except this instruction also negates the product.

V4FMADDSS, V4FNMADDSS: These "scalar" variants perform the same operations but yield only a single result in the lowest element of the 128-bit destination vector, leaving the three higher elements unchanged. As usual if the vector size is less than 512 bits the instructions zero the unused higher bits in the destination register to avoid a dependency on earlier instructions writing those bits.

In total these instructions sequentially perform four multiply-accumulate operations, read a single precision multiplicand from four source registers, four single precision multipliers from memory, and accumulate one single precision result four times in the destination register, also rounding four times.

Motivation[edit]

Intel introduced this extension on their Knights Mill microarchitecture (Xeon Phi many-core products) to accelerate convolutional neural network-based algorithms. It was not implemented on other chips. Plain parallel, single precision floating point, fused multiply-add instructions became available with the FMA extension (AVX instructions with 128- and 256-bit vector size) and AVX-512 Foundation extension (128-, 256-, 512-bit vectors).

4FMAPS has an integer counterpart AVX512_4VNNIW.

Detection[edit]

CPUID		Instruction Set
Input	Output	Instruction Set
EAX=07H, ECX=0	EDX[bit 03]	AVX512_4FMAPS

Microarchitecture support[edit]

Designer	Microarchitecture	Year	Support Level
Designer	Microarchitecture	Year	F	CD	ER	PF	BW	DQ	VL	FP16	IFMA	VBMI	VBMI2	BITALG	VPOPCNTDQ	VP2INTERSECT	4VNNIW	4FMAPS	VNNI	BF16
Intel	Knights Landing	2016	✔	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘
	Knights Mill	2017	✔	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✔	✘	✔	✔	✘	✘
	Skylake (server)	2017	✔	✔	✘	✘	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘
	Cannon Lake	2018	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘
	Cascade Lake	2019	✔	✔	✘	✘	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✔	✘
	Cooper Lake	2020	✔	✔	✘	✘	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✔	✔
	Tiger Lake	2020	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✔	✘	✘	✔	✘
	Rocket Lake	2021	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✘	✘	✘	✔	✘
	Alder Lake	2021	✔	✔	✘	✘	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✘	✘	✔	✔
	Ice Lake (server)	2021	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✘	✘	✘	✔	✘
	Sapphire Rapids	2023	✔	✔	✘	✘	✔	✔	✔	✔	✔	✔	✔	✔	✔	✘	✘	✘	✔	✔
AMD	Zen 4	2022	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✘	✘	✘	✔	✔
AMD	Zen 5	2024	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✔	✘	✘	✔	✔
Centaur	CHA		✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘

Intrinsic functions[edit]

// V4FMADDPS
__m512 _mm512_4fmadd_ps( __m512, __m512x4, __m128 *);
__m512 _mm512_mask_4fmadd_ps(__m512, __mmask16, __m512x4, __m128 *);
__m512 _mm512_maskz_4fmadd_ps(__mmask16, __m512, __m512x4, __m128 *);
__m512 _mm512_4fnmadd_ps(__m512, __m512x4, __m128 *);
__m512 _mm512_mask_4fnmadd_ps(__m512, __mmask16, __m512x4, __m128 *);
__m512 _mm512_maskz_4fnmadd_ps(__mmask16, __m512, __m512x4, __m128 *);
// V4FMADDSS
__m128 _mm_4fmadd_ss(__m128, __m128x4, __m128 *);
__m128 _mm_mask_4fmadd_ss(__m128, __mmask8, __m128x4, __m128 *);
__m128 _mm_maskz_4fmadd_ss(__mmask8, __m128, __m128x4, __m128 *);
__m128 _mm_4fnmadd_ss(__m128, __m128x4, __m128 *);
__m128 _mm_mask_4fnmadd_ss(__m128, __mmask8, __m128x4, __m128 *);
__m128 _mm_maskz_4fnmadd_ss(__mmask8, __m128, __m128x4, __m128 *);

Bibliography[edit]

"Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B, 2C & 2D): Instruction Set Reference, A-Z", Intel Order Nr. 325383, Rev. 078US, December 2022

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung

Intel

AMD

Ampere

Apple

Cavium

HiSilicon

MediaTek

NXP

Qualcomm

Renesas