AVX-512 Doubleword and Quadword Instructions (AVX512DQ) - x86

x86
Instruction Set Architecture

AVX-512 Doubleword and Quadword Instructions (AVX512DQ) is an x86 extension, part of the AVX-512 SIMD instruction set, and complements the AVX-512 Foundation and AVX512BW (byte and word) extensions.

Overview

The AVX512DQ extension adds new and supplementary vector instructions operating on 32-bit doublewords and 64-bit quadwords, and some floating point instructions.

Integer instructions

New AVX-512 instructions

VPMULLQ: Parallel multiplication of unsigned quadwords, providing the lower half of the 128-bit product.

VPMOVD2M, VPMOVQ2M: These instructions set the bits in a mask register, copying the most significant bit in the corresponding doubleword or quadword of the source vector register.

VPMOVM2D, VPMOVM2Q: These instructions set the bits in each doubleword or quadword of the destination vector to all ones or zeros, copying the corresponding bit in a mask register.

VEXTRACTI32X8, VEXTRACTI64X2: These instructions extract eight doublewords or a pair of quadwords from a 128-bit lane of a vector register selected by a constant index, and store the data in memory, or in the lowest 128-bit lane of a destination vector register.

VINSERTI32X8, VINSERTI64X2: These instructions insert eight doublewords or a pair of quadwords in a 128-bit lane of a vector register selected by a constant index, loading the data from memory or the lowest 128-bit lane of a source vector register.

VPBROADCASTI32X2, VPBROADCASTI32X8, VPBROADCASTI64X2: These instructions broadcast a pair of doublewords, eight doublewords, or a pair of quadwords from memory or the lowest lane of a vector register to all lanes of the same width of the destination vector.

The 32X8 instructions above support only 512-bit vectors, the 64X2 instructions only 256- and 512-bit vectors.

Instructions promoted from SSE and AVX to AVX-512

VPEXTRD, VPEXTRQ: These instructions extract a doubleword or quadword using a constant index to select an element from the lowest 128-bit lane of a vector register and store it in a general purpose register or in memory.

VPINSRD, VPINSRQ: These instructions insert a doubleword or quadword taken from the lowest bits of a general purpose register or from memory, into the lowest 128-bit lane of the destination vector register using a constant index to select the element. Bits 128 ... 511 of the destination vector register are zeroed. Write masking is not supported.

Floating point instructions

New AVX-512 instructions

VRANGE(PS/PD), VRANGE(SD/SS): These instructions perform a parallel minimum or maximum operation on single or double precision values, either on their original or absolute values. They optionally change the sign of all results to positive or negative, or copy the sign of the corresponding element in the first source operand. The operation is selected by an immediate byte which is part of the opcode.

A saturation operation like min(max(-limit, value), +limit) for instance can be expressed as minimum of absolute values with sign copying.

VFPCLASS(PS/PD), VFPCLASS(SS/SD): These instructions test if the single or double precision values in the source operand belong to certain classes and set the bit corresponding to each element in the destination mask register to 1 = true or 0 = false. The "packed" instructions (PS/PD) operate on all elements, the "scalar" instructions (SS/SD) only on the lowest element and set a single mask bit. Unused higher bits of the 64-bit mask register are cleared. The instructions support write masking which means they optionally perform a bitwise 'and' on the destination using a second mask register. The class is selected by an immediate byte which is part of the opcode and can be: QNaN, +0, -0, +∞, -∞, denormal, negative, SNaN, or any combination.

VREDUCE(PS/PD), VREDUCE(SS/SD): Parallel reduce transformation on single or double precision values. The operation is

dest = source – round(2^M * source) * 2^-M

with desired rounding mode and M a constant in range 0 ... 15. These instructions can be used to accelerate transcendental functions.

The "scalar" variants (SS/SD) of the instructions above yield only a single result in the lowest element of the 128-bit destination vector. Higher elements are left unchanged. 256- and 512-bit vectors are not supported by these instructions, bits 128 ... 511 of the destination vector register are zeroed.

VCVT(PS/PD)2(QQ/UQQ)
VCVTT(PS/PD)2(QQ/UQQ)
VCVT(QQ/UQQ)2(PS/PD): Parallel conversion with desired rounding of signed (QQ) or unsigned (UQQ) quadwords to single precision (PS) or double precision (PD) values, or vice versa. The VCVTT variants always round with truncation i.e. toward zero.

VEXTRACTF32X8, VEXTRACTF64X2
VINSERTF32X8, VINSERTF64X2
VBROADCASTF32X2, VBROADCASTF32X8, VBROADCASTF64X2: These instructions perform the same operation as their integer counterparts.

Instructions promoted from SSE and AVX to AVX-512

VAND(PS/PD), VANDN(PS/PD), VOR(PS/PD), VXOR(PS/PD): Parallel bitwise logical operations on single or double precision values. The ANDN operation is (not source1) and source2.

Mask register instructions

Most of the instructions above support write masking. That means they can write individual elements in the destination vector unconditionally, leave them unchanged, or zero them if the corresponding bit in a mask register supplied as an additional source operand is zero. The masking mode is encoded in the instruction opcode.

The AVX-512 Foundation defines instructions operating on 16-bit masks which are used e.g. with 512-bit vectors containing 16 single precision elements. AVX512DQ adds support for 8-bit operations. The AVX512BW extension completes this set with support for 32- and 64-bit masks.

KADDB, KADDW: Add two masks. KADDW was not defined by AVX512F.

KANDB, KANDNB, KNOTB, KORB, KXNORB, KXORB: Bitwise logical operations. ANDN is (not source1) and source2, XNOR is not (source1 xor source2).

KTESTB, KTESTW: Performs bitwise operations temp1 = source1 and source2, temp2 = (not source1) and source2, sets the ZF and CF (for branch instructions) to indicate if the respective result is all zeros.

KORTESTB: Performs bitwise operation temp = source1 or source2, sets ZF to indicate if the result is all zeros, CF if all ones.

KSHIFTLB, KSHIFTRB: Bitwise logical shift left/right by a constant.

KMOVB: Copies a bit mask from a mask register to another mask register, a 32- or 64-bit GPR, or memory, or from a GPR or memory to a mask register. The mask is zero extended if the destination register is wider.

Detection

Support for these instructions is indicated by the AVX512DQ feature flag. Except as noted they operate on 512-bit vectors. Instruction variants operating on 128- and 256-bit vectors are supported if the AVX512VL flag is set as well.

CPUID		Instruction Set
Input	Output	Instruction Set
EAX=07H, ECX=0	EBX[bit 17]	AVX512DQ
EAX=07H, ECX=0	EBX[bit 31]	AVX512VL

Microarchitecture support

Designer	Microarchitecture	Year	Support Level
Designer	Microarchitecture	Year	F	CD	ER	PF	BW	DQ	VL	FP16	IFMA	VBMI	VBMI2	BITALG	VPOPCNTDQ	VP2INTERSECT	4VNNIW	4FMAPS	VNNI	BF16
Intel	Knights Landing	2016	✔	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘
	Knights Mill	2017	✔	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✔	✘	✔	✔	✘	✘
	Skylake (server)	2017	✔	✔	✘	✘	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘
	Cannon Lake	2018	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘
	Cascade Lake	2019	✔	✔	✘	✘	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✔	✘
	Cooper Lake	2020	✔	✔	✘	✘	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✔	✔
	Tiger Lake	2020	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✔	✘	✘	✔	✘
	Rocket Lake	2021	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✘	✘	✘	✔	✘
	Alder Lake	2021	✔	✔	✘	✘	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✘	✘	✔	✔
	Ice Lake (server)	2021	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✘	✘	✘	✔	✘
	Sapphire Rapids	2023	✔	✔	✘	✘	✔	✔	✔	✔	✔	✔	✔	✔	✔	✘	✘	✘	✔	✔
AMD	Zen 4	2022	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✘	✘	✘	✔	✔
Centaur	CHA		✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘

Bibliography

"Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B, 2C & 2D): Instruction Set Reference, A-Z", Intel Order Nr. 325383, Rev. 078US, December 2022

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung

Intel

AMD

Ampere

Apple

Cavium

HiSilicon

MediaTek

NXP

Qualcomm

Renesas

Samsung

Contents

Overview

Integer instructions

New AVX-512 instructions

Instructions promoted from SSE and AVX to AVX-512

Floating point instructions

New AVX-512 instructions

Instructions promoted from SSE and AVX to AVX-512

Mask register instructions

Detection

Microarchitecture support

Bibliography