From WikiChip
AVX-512 Doubleword and Quadword Instructions (AVX512DQ) - x86
< x86
Revision as of 15:09, 15 March 2023 by QuietRub (talk | contribs) (Created page with "{{x86 title|AVX-512 Doubleword and Quadword Instructions (AVX512DQ)}}{{x86 isa main}} '''AVX-512 Doubleword and Quadword Instructions''' ('''AVX512DQ''') is an x86 extensi...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

AVX-512 Doubleword and Quadword Instructions (AVX512DQ) is an x86 extension, part of the AVX-512 SIMD instruction set, and complements the AVX-512 Foundation and AVX512BW (byte and word) extensions.

Overview

The AVX512DQ extension adds new and supplementary vector instructions operating on 32-bit doublewords and 64-bit quadwords, and some floating point instructions.

Integer instructions

New AVX-512 instructions

VPMULLQ
Parallel multiplication of unsigned quadwords, providing the lower half of the 128-bit product.
VPMOVD2M, VPMOVQ2M
These instructions set the bits in a mask register, copying the most significant bit in the corresponding doubleword or quadword of the source vector register.
VPMOVM2D, VPMOVM2Q
These instructions set the bits in each doubleword or quadword of the destination vector to all ones or zeros, copying the corresponding bit in a mask register.
VEXTRACTI32X8, VEXTRACTI64X2
These instructions extract eight doublewords or a pair of quadwords from a 128-bit lane of a vector register selected by a constant index, and store the data in memory, or in the lowest 128-bit lane of a destination vector register.
VINSERTI32X8, VINSERTI64X2
These instructions insert eight doublewords or a pair of quadwords in a 128-bit lane of a vector register selected by a constant index, loading the data from memory or the lowest 128-bit lane of a source vector register.
VPBROADCASTI32X2, VPBROADCASTI32X8, VPBROADCASTI64X2
These instructions broadcast a pair of doublewords, eight doublewords, or a pair of quadwords from memory or the lowest lane of a vector register to all lanes of the same width of the destination vector.

The 32X8 instructions above support only 512-bit vectors, the 64X2 instructions only 256- and 512-bit vectors.

Instructions promoted from SSE and AVX to AVX-512

VPEXTRD, VPEXTRQ
These instructions extract a doubleword or quadword using a constant index to select an element from the lowest 128-bit lane of a vector register and store it in a general purpose register or in memory.
VPINSRD, VPINSRQ
These instructions insert a doubleword or quadword taken from the lowest bits of a general purpose register or from memory, into the lowest 128-bit lane of the destination vector register using a constant index to select the element. Bits 128 ... 511 of the destination vector register are zeroed. Write masking is not supported.

Floating point instructions

New AVX-512 instructions

VRANGE(PS/PD), VRANGE(SD/SS)
These instructions perform a parallel minimum or maximum operation on single or double precision values, either on their original or absolute values. They optionally change the sign of all results to positive or negative, or copy the sign of the corresponding element in the first source operand. The operation is selected by an immediate byte which is part of the opcode.
A saturation operation like min(max(-limit, value), +limit) for instance can be expressed as minimum of absolute values with sign copying.
VFPCLASS(PS/PD), VFPCLASS(SS/SD)
These instructions test if the single or double precision values in the source operand belong to certain classes and set the bit corresponding to each element in the destination mask register to 1 = true or 0 = false. The "packed" instructions (PS/PD) operate on all elements, the "scalar" instructions (SS/SD) only on the lowest element and set a single mask bit. Unused higher bits of the 64-bit mask register are cleared. The instructions support write masking which means they optionally perform a bitwise 'and' on the destination using a second mask register. The class is selected by an immediate byte which is part of the opcode and can be: QNaN, +0, -0, +∞, -∞, denormal, negative, SNaN, or any combination.
VREDUCE(PS/PD), VREDUCE(SS/SD)
Parallel reduce transformation on single or double precision values. The operation is
dest = source – round(2M * source) * 2-M
with desired rounding mode and M a constant in range 0 ... 15. These instructions can be used to accelerate transcendental functions.

The "scalar" variants (SS/SD) of the instructions above yield only a single result in the lowest element of the 128-bit destination vector. Higher elements are left unchanged. 256- and 512-bit vectors are not supported by these instructions, bits 128 ... 511 of the destination vector register are zeroed.

VCVT(PS/PD)2(QQ/UQQ)
VCVTT(PS/PD)2(QQ/UQQ)
VCVT(QQ/UQQ)2(PS/PD)
Parallel conversion with desired rounding of signed (QQ) or unsigned (UQQ) quadwords to single precision (PS) or double precision (PD) values, or vice versa. The VCVTT variants always round with truncation i.e. toward zero.
VEXTRACTF32X8, VEXTRACTF64X2
VINSERTF32X8, VINSERTF64X2
VBROADCASTF32X2, VBROADCASTF32X8, VBROADCASTF64X2
These instructions perform the same operation as their integer counterparts.

Instructions promoted from SSE and AVX to AVX-512

VAND(PS/PD), VANDN(PS/PD), VOR(PS/PD), VXOR(PS/PD)
Parallel bitwise logical operations on single or double precision values. The ANDN operation is (not source1) and source2.

Mask register instructions

Most of the instructions above support write masking. That means they can write individual elements in the destination vector unconditionally, leave them unchanged, or zero them if the corresponding bit in a mask register supplied as an additional source operand is zero. The masking mode is encoded in the instruction opcode.

The AVX-512 Foundation defines instructions operating on 16-bit masks which are used e.g. with 512-bit vectors containing 16 single precision elements. AVX512DQ adds support for 8-bit operations. The AVX512BW extension completes this set with support for 32- and 64-bit masks.

KADDB, KADDW
Add two masks. KADDW was not defined by AVX512F.
KANDB, KANDNB, KNOTB, KORB, KXNORB, KXORB
Bitwise logical operations. ANDN is (not source1) and source2, XNOR is not (source1 xor source2).
KTESTB, KTESTW
Performs bitwise operations temp1 = source1 and source2, temp2 = (not source1) and source2, sets the ZF and CF (for branch instructions) to indicate if the respective result is all zeros.
KORTESTB
Performs bitwise operation temp = source1 or source2, sets ZF to indicate if the result is all zeros, CF if all ones.
KSHIFTLB, KSHIFTRB
Bitwise logical shift left/right by a constant.
KMOVB
Copies a bit mask from a mask register to another mask register, a 32- or 64-bit GPR, or memory, or from a GPR or memory to a mask register. The mask is zero extended if the destination register is wider.

Detection

Support for these instructions is indicated by the AVX512DQ feature flag. Except as noted they operate on 512-bit vectors. Instruction variants operating on 128- and 256-bit vectors are supported if the AVX512VL flag is set as well.

CPUID Instruction Set
Input Output
EAX=07H, ECX=0 EBX[bit 17] AVX512DQ
EAX=07H, ECX=0 EBX[bit 31] AVX512VL

Microarchitecture support

Designer Microarchitecture Year Support Level
F CD ER PF BW DQ VL FP16 IFMA VBMI VBMI2 BITALG VPOPCNTDQ VP2INTERSECT 4VNNIW 4FMAPS VNNI BF16
Intel Knights Landing 2016
Knights Mill 2017
Skylake (server) 2017
Cannon Lake 2018
Cascade Lake 2019
Cooper Lake 2020
Tiger Lake 2020
Rocket Lake 2021
Alder Lake 2021
Ice Lake (server) 2021
Sapphire Rapids 2023
AMD Zen 4 2022
Centaur CHA

Bibliography