From WikiChip
AVX-512 Fused Multiply-Accumulate Packed Single Precision (4FMAPS) - x86
x86
Instruction Set Architecture
Instruction Set Architecture
General
Variants
Topics
- Instructions
- Addressing Modes
- Registers
- Model-Specific Register
- Assembly
- Interrupts
- Micro-Ops
- Timer
- Calling Convention
- Microarchitectures
- CPUID
CPUIDs
Modes
Extensions(all)
AVX-512 Fused Multiply-Accumulate Packed Single Precision (AVX512_4FMAPS) is an x86 extension and part of the AVX-512 SIMD instruction set.
Contents
Overview[edit]
-
V4FMADDPS
,V4FNMADDPS
- Parallel fused multiply-accumulate of single precision values, four iterations.
- In each iteration the instructions source 16 multiplicands from a 512-bit vector register, and one multiplier from memory which is broadcast to all 16 elements of a second vector. They add the 16 products and the 16 values in the corresponding elements of the 512-bit destination register, round the sums as desired, and store them in the destination. Finally the instructions increment the number of the source register by one modulo four, and the memory address by four bytes. Exceptions can occur in each iteration. Write masking is supported.
- In total these instructions perform 64 multiply-accumulate operations, reading 64 single precision multiplicands from four source registers in a 4-aligned block, e.g. ZMM12 ... ZMM15, four single precision multipliers consecutive in memory, and accumulate 16 single precision results four times, also rounding four times.
-
V4FNMADD
performs the same operation asV4FMADD
except this instruction also negates the product.
-
V4FMADDSS
,V4FNMADDSS
- These "scalar" variants perform the same operations but yield only a single result in the lowest element of the 128-bit destination vector, leaving the three higher elements unchanged. As usual if the vector size is less than 512 bits the instructions zero the unused higher bits in the destination register to avoid a dependency on earlier instructions writing those bits.
- In total these instructions sequentially perform four multiply-accumulate operations, read a single precision multiplicand from four source registers, four single precision multipliers from memory, and accumulate one single precision result four times in the destination register, also rounding four times.
Motivation[edit]
Intel introduced this extension on their Knights Mill microarchitecture (Xeon Phi many-core products) to accelerate convolutional neural network-based algorithms. It was not implemented on other chips. Plain parallel, single precision floating point, fused multiply-add instructions became available with the FMA extension (AVX instructions with 128- and 256-bit vector size) and AVX-512 Foundation extension (128-, 256-, 512-bit vectors).
4FMAPS has an integer counterpart AVX512_4VNNIW.
Detection[edit]
CPUID | Instruction Set | |
---|---|---|
Input | Output | |
EAX=07H, ECX=0 | EDX[bit 03] | AVX512_4FMAPS |
Microarchitecture support[edit]
Designer | Microarchitecture | Year | Support Level | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
F | CD | ER | PF | BW | DQ | VL | FP16 | IFMA | VBMI | VBMI2 | BITALG | VPOPCNTDQ | VP2INTERSECT | 4VNNIW | 4FMAPS | VNNI | BF16 | ||||
Intel | Knights Landing | 2016 | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | |
Knights Mill | 2017 | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✔ | ✘ | ✔ | ✔ | ✘ | ✘ | ||
Skylake (server) | 2017 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ||
Cannon Lake | 2018 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ||
Cascade Lake | 2019 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✔ | ✘ | ||
Cooper Lake | 2020 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✔ | ✔ | ||
Tiger Lake | 2020 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✔ | ✘ | ||
Rocket Lake | 2021 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✔ | ✘ | ||
Alder Lake | 2021 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ||
Ice Lake (server) | 2021 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✔ | ✘ | ||
Sapphire Rapids | 2023 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✔ | ✔ | ||
AMD | Zen 4 | 2022 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✔ | ✔ | |
Centaur | CHA | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
Intrinsic functions[edit]
// V4FMADDPS
__m512 _mm512_4fmadd_ps( __m512, __m512x4, __m128 *);
__m512 _mm512_mask_4fmadd_ps(__m512, __mmask16, __m512x4, __m128 *);
__m512 _mm512_maskz_4fmadd_ps(__mmask16, __m512, __m512x4, __m128 *);
__m512 _mm512_4fnmadd_ps(__m512, __m512x4, __m128 *);
__m512 _mm512_mask_4fnmadd_ps(__m512, __mmask16, __m512x4, __m128 *);
__m512 _mm512_maskz_4fnmadd_ps(__mmask16, __m512, __m512x4, __m128 *);
// V4FMADDSS
__m128 _mm_4fmadd_ss(__m128, __m128x4, __m128 *);
__m128 _mm_mask_4fmadd_ss(__m128, __mmask8, __m128x4, __m128 *);
__m128 _mm_maskz_4fmadd_ss(__mmask8, __m128, __m128x4, __m128 *);
__m128 _mm_4fnmadd_ss(__m128, __m128x4, __m128 *);
__m128 _mm_mask_4fnmadd_ss(__m128, __mmask8, __m128x4, __m128 *);
__m128 _mm_maskz_4fnmadd_ss(__mmask8, __m128, __m128x4, __m128 *);
Bibliography[edit]
- "Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B, 2C & 2D): Instruction Set Reference, A-Z", Intel Order Nr. 325383, Rev. 078US, December 2022