AVX-512 Vector Bit Manipulation Instructions 2 (VBMI2) - x86

x86
Instruction Set Architecture

History
Families

AMD's CPUIDs
Intel's CPUIDs

AVX-512 Vector Bit Manipulation Instructions 2 (AVX512_VBMI2) is an x86 extension, part of the AVX-512 SIMD instruction set, and complements the earlier VBMI extension.

Overview[edit]

VPCOMPRESSB, VPCOMPRESSW: These instructions copy bytes or words from a vector register to memory or another vector register. They copy each element in the source vector if the corresponding bit in a mask register is set, and only then increment the memory address or destination register element number for the next store. Remaining elements if the destination is a register are left unchanged or zeroed depending on the instruction variant.

VPEXPANDB, VPEXPANDW: These instructions copy bytes or words from memory or a vector register to another vector register. They load each element of the destination register, if the corresponding bit in a mask register is set, from the source and only then increment the memory address or source register element number for the next load. Destination elements where the mask bit is cleared are left unchanged or zeroed depending on the instruction variant.

It should be noted that the AVX-512 Foundation extension already provides instructions to compress and expand vectors of doublewords, quadwords, and single and double precision floating point values.

VPSHLDW, VPSHLDD, VPSHLDQ
VPSHRDW, VPSHRDD, VPSHRDQ: Parallel bitwise funnel shift left or right by a constant amount. The instructions concatenate each word, doubleword, or quadword of the first and second source operand, perform a bitwise logical left or right shift by a constant amount modulo element width in bits taken from an immediate byte which is part of the opcode, and store the upper (L) or lower (R) half of the result in the corresponding elements of the destination vector.

VPSHLDVW, VPSHLDVD, VPSHLDVQ
VPSHRDVW, VPSHRDVD, VPSHRDVQ: Parallel bitwise funnel shift left or right by a per-element variable amount. The instructions concatenate each word, doubleword, or quadword of the destination and first source operand, perform a bitwise logical left or right shift by a variable amount modulo element width in bits taken from the corresponding element of the second source operand, and store the upper (L) or lower (R) half of the result in the corresponding elements of the destination vector.

As usual these instructions support write masking. That means they can write individual elements in the destination vector unconditionally, leave them unchanged, or zero them if the corresponding bit in a mask register supplied as an additional source operand is zero. The masking mode is encoded in the instruction opcode.

All these instructions can operate on 128-, 256-, or 512-bit wide vectors. If the vector size is less than 512 bits the instructions zero the unused higher bits in the destination register to avoid a dependency on earlier instructions writing those bits.

Detection[edit]

Support for these instructions is indicated by the AVX512_VBMI2 feature flag. 128- and 256-bit vectors are supported if the AVX512VL flag is set as well.

CPUID		Instruction Set
Input	Output	Instruction Set
EAX=07H, ECX=0	EBX[bit 31]	AVX512VL
EAX=07H, ECX=0	ECX[bit 06]	AVX512_VBMI2

Microarchitecture support[edit]

Designer	Microarchitecture	Year	Support Level
Designer	Microarchitecture	Year	F	CD	ER	PF	BW	DQ	VL	FP16	IFMA	VBMI	VBMI2	BITALG	VPOPCNTDQ	VP2INTERSECT	4VNNIW	4FMAPS	VNNI	BF16
Intel	Knights Landing	2016	✔	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘
	Knights Mill	2017	✔	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✔	✘	✔	✔	✘	✘
	Skylake (server)	2017	✔	✔	✘	✘	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘
	Cannon Lake	2018	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘
	Cascade Lake	2019	✔	✔	✘	✘	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✔	✘
	Cooper Lake	2020	✔	✔	✘	✘	✔	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘	✘	✔	✔
	Tiger Lake	2020	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✔	✘	✘	✔	✘
	Rocket Lake	2021	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✘	✘	✘	✔	✘
	Alder Lake	2021	✔	✔	✘	✘	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✘	✘	✔	✔
	Ice Lake (server)	2021	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✘	✘	✘	✔	✘
	Sapphire Rapids	2023	✔	✔	✘	✘	✔	✔	✔	✔	✔	✔	✔	✔	✔	✘	✘	✘	✔	✔
AMD	Zen 4	2022	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✘	✘	✘	✔	✔
AMD	Zen 5	2024	✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✔	✔	✔	✔	✘	✘	✔	✔
Centaur	CHA		✔	✔	✘	✘	✔	✔	✔	✘	✔	✔	✘	✘	✘	✘	✘	✘	✘	✘

Intrinsic functions[edit]

// VPCOMPRESSB
__m128i _mm_mask_compress_epi8(__m128i, __mmask16, __m128i);
__m128i _mm_maskz_compress_epi8(__mmask16, __m128i);
__m256i _mm256_mask_compress_epi8(__m256i, __mmask32, __m256i);
__m256i _mm256_maskz_compress_epi8(__mmask32, __m256i);
__m512i _mm512_mask_compress_epi8(__m512i, __mmask64, __m512i);
__m512i _mm512_maskz_compress_epi8(__mmask64, __m512i);
void _mm_mask_compressstoreu_epi8(void*, __mmask16, __m128i);
void _mm256_mask_compressstoreu_epi8(void*, __mmask32, __m256i);
void _mm512_mask_compressstoreu_epi8(void*, __mmask64, __m512i);
// VPCOMPRESSW
__m128i _mm_mask_compress_epi16(__m128i, __mmask8, __m128i);
__m128i _mm_maskz_compress_epi16(__mmask8, __m128i);
__m256i _mm256_mask_compress_epi16(__m256i, __mmask16, __m256i);
__m256i _mm256_maskz_compress_epi16(__mmask16, __m256i);
__m512i _mm512_mask_compress_epi16(__m512i, __mmask32, __m512i);
__m512i _mm512_maskz_compress_epi16(__mmask32, __m512i);
void _mm_mask_compressstoreu_epi16(void*, __mmask8, __m128i);
void _mm256_mask_compressstoreu_epi16(void*, __mmask16, __m256i);
void _mm512_mask_compressstoreu_epi16(void*, __mmask32, __m512i);
// VPEXPANDB
__m128i _mm_mask_expand_epi8(__m128i, __mmask16, __m128i);
__m128i _mm_maskz_expand_epi8(__mmask16, __m128i);
__m128i _mm_mask_expandloadu_epi8(__m128i, __mmask16, const void*);
__m128i _mm_maskz_expandloadu_epi8(__mmask16, const void*);
__m256i _mm256_mask_expand_epi8(__m256i, __mmask32, __m256i);
__m256i _mm256_maskz_expand_epi8(__mmask32, __m256i);
__m256i _mm256_mask_expandloadu_epi8(__m256i, __mmask32, const void*);
__m256i _mm256_maskz_expandloadu_epi8(__mmask32, const void*);
__m512i _mm512_mask_expand_epi8(__m512i, __mmask64, __m512i);
__m512i _mm512_maskz_expand_epi8(__mmask64, __m512i);
__m512i _mm512_mask_expandloadu_epi8(__m512i, __mmask64, const void*);
__m512i _mm512_maskz_expandloadu_epi8(__mmask64, const void*);
// VPEXPANDW
__m128i _mm_mask_expand_epi16(__m128i, __mmask8, __m128i);
__m128i _mm_maskz_expand_epi16(__mmask8, __m128i);
__m128i _mm_mask_expandloadu_epi16(__m128i, __mmask8, const void*);
__m128i _mm_maskz_expandloadu_epi16(__mmask8, const void *);
__m256i _mm256_mask_expand_epi16(__m256i, __mmask16, __m256i);
__m256i _mm256_maskz_expand_epi16(__mmask16, __m256i);
__m256i _mm256_mask_expandloadu_epi16(__m256i, __mmask16, const void*);
__m256i _mm256_maskz_expandloadu_epi16(__mmask16, const void*);
__m512i _mm512_mask_expand_epi16(__m512i, __mmask32, __m512i);
__m512i _mm512_maskz_expand_epi16(__mmask32, __m512i);
__m512i _mm512_mask_expandloadu_epi16(__m512i, __mmask32, const void*);
__m512i _mm512_maskz_expandloadu_epi16(__mmask32, const void*);
// VPSHLDW
__m128i _mm_shldi_epi16(__m128i, __m128i, int);
__m128i _mm_mask_shldi_epi16(__m128i, __mmask8, __m128i, __m128i, int);
__m128i _mm_maskz_shldi_epi16(__mmask8, __m128i, __m128i, int);
__m256i _mm256_shldi_epi16(__m256i, __m256i, int);
__m256i _mm256_mask_shldi_epi16(__m256i, __mmask16, __m256i, __m256i, int);
__m256i _mm256_maskz_shldi_epi16(__mmask16, __m256i, __m256i, int);
__m512i _mm512_shldi_epi16(__m512i, __m512i, int);
__m512i _mm512_mask_shldi_epi16(__m512i, __mmask32, __m512i, __m512i, int);
__m512i _mm512_maskz_shldi_epi16(__mmask32, __m512i, __m512i, int);
// VPSHLDD
__m128i _mm_shldi_epi32(__m128i, __m128i, int);
__m128i _mm_mask_shldi_epi32(__m128i, __mmask8, __m128i, __m128i, int);
__m128i _mm_maskz_shldi_epi32(__mmask8, __m128i, __m128i, int);
__m256i _mm256_shldi_epi32(__m256i, __m256i, int);
__m256i _mm256_mask_shldi_epi32(__m256i, __mmask8, __m256i, __m256i, int);
__m256i _mm256_maskz_shldi_epi32(__mmask8, __m256i, __m256i, int);
__m512i _mm512_shldi_epi32(__m512i, __m512i, int);
__m512i _mm512_mask_shldi_epi32(__m512i, __mmask16, __m512i, __m512i, int);
__m512i _mm512_maskz_shldi_epi32(__mmask16, __m512i, __m512i, int);
// VPSHLDQ
__m128i _mm_shldi_epi64(__m128i, __m128i, int);
__m128i _mm_mask_shldi_epi64(__m128i, __mmask8, __m128i, __m128i, int);
__m128i _mm_maskz_shldi_epi64(__mmask8, __m128i, __m128i, int);
__m256i _mm256_shldi_epi64(__m256i, __m256i, int);
__m256i _mm256_mask_shldi_epi64(__m256i, __mmask8, __m256i, __m256i, int);
__m256i _mm256_maskz_shldi_epi64(__mmask8, __m256i, __m256i, int);
__m512i _mm512_shldi_epi64(__m512i, __m512i, int);
__m512i _mm512_mask_shldi_epi64(__m512i, __mmask8, __m512i, __m512i, int);
__m512i _mm512_maskz_shldi_epi64(__mmask8, __m512i, __m512i, int);
// VPSHLDVW
__m128i _mm_shldv_epi16(__m128i, __m128i, __m128i);
__m128i _mm_mask_shldv_epi16(__m128i, __mmask8, __m128i, __m128i);
__m128i _mm_maskz_shldv_epi16(__mmask8, __m128i, __m128i, __m128i);
__m256i _mm256_shldv_epi16(__m256i, __m256i, __m256i);
__m256i _mm256_mask_shldv_epi16(__m256i, __mmask16, __m256i, __m256i);
__m256i _mm256_maskz_shldv_epi16(__mmask16, __m256i, __m256i, __m256i);
__m512i _mm512_shldv_epi16(__m512i, __m512i, __m512i);
__m512i _mm512_mask_shldv_epi16(__m512i, __mmask32, __m512i, __m512i);
__m512i _mm512_maskz_shldv_epi16(__mmask32, __m512i, __m512i, __m512i);
// VPSHLDVD
__m128i _mm_shldv_epi32(__m128i, __m128i, __m128i);
__m128i _mm_mask_shldv_epi32(__m128i, __mmask8, __m128i, __m128i);
__m128i _mm_maskz_shldv_epi32(__mmask8, __m128i, __m128i, __m128i);
__m256i _mm256_shldv_epi32(__m256i, __m256i, __m256i);
__m256i _mm256_mask_shldv_epi32(__m256i, __mmask8, __m256i, __m256i);
__m256i _mm256_maskz_shldv_epi32(__mmask8, __m256i, __m256i, __m256i);
__m512i _mm512_shldv_epi32(__m512i, __m512i, __m512i);
__m512i _mm512_mask_shldv_epi32(__m512i, __mmask16, __m512i, __m512i);
__m512i _mm512_maskz_shldv_epi32(__mmask16, __m512i, __m512i, __m512i);
// VPSHLDVQ
__m128i _mm_shldv_epi64(__m128i, __m128i, __m128i);
__m128i _mm_mask_shldv_epi64(__m128i, __mmask8, __m128i, __m128i);
__m128i _mm_maskz_shldv_epi64(__mmask8, __m128i, __m128i, __m128i);
__m256i _mm256_shldv_epi64(__m256i, __m256i, __m256i);
__m256i _mm256_mask_shldv_epi64(__m256i, __mmask8, __m256i, __m256i);
__m256i _mm256_maskz_shldv_epi64(__mmask8, __m256i, __m256i, __m256i);
__m512i _mm512_shldv_epi64(__m512i, __m512i, __m512i);
__m512i _mm512_mask_shldv_epi64(__m512i, __mmask8, __m512i, __m512i);
__m512i _mm512_maskz_shldv_epi64(__mmask8, __m512i, __m512i, __m512i);
// VPSHRDW
__m128i _mm_shrdi_epi16(__m128i, __m128i, int);
__m128i _mm_mask_shrdi_epi16(__m128i, __mmask8, __m128i, __m128i, int);
__m128i _mm_maskz_shrdi_epi16(__mmask8, __m128i, __m128i, int);
__m256i _mm256_shrdi_epi16(__m256i, __m256i, int);
__m256i _mm256_mask_shrdi_epi16(__m256i, __mmask16, __m256i, __m256i, int);
__m256i _mm256_maskz_shrdi_epi16(__mmask16, __m256i, __m256i, int);
__m512i _mm512_shrdi_epi16(__m512i, __m512i, int);
__m512i _mm512_mask_shrdi_epi16(__m512i, __mmask32, __m512i, __m512i, int);
__m512i _mm512_maskz_shrdi_epi16(__mmask32, __m512i, __m512i, int);
// VPSHRDD
__m128i _mm_shrdi_epi32(__m128i, __m128i, int);
__m128i _mm_mask_shrdi_epi32(__m128i, __mmask8, __m128i, __m128i, int);
__m128i _mm_maskz_shrdi_epi32(__mmask8, __m128i, __m128i, int);
__m256i _mm256_shrdi_epi32(__m256i, __m256i, int);
__m256i _mm256_mask_shrdi_epi32(__m256i, __mmask8, __m256i, __m256i, int);
__m256i _mm256_maskz_shrdi_epi32(__mmask8, __m256i, __m256i, int);
__m512i _mm512_shrdi_epi32(__m512i, __m512i, int);
__m512i _mm512_mask_shrdi_epi32(__m512i, __mmask16, __m512i, __m512i, int);
__m512i _mm512_maskz_shrdi_epi32(__mmask16, __m512i, __m512i, int);
__m128i _mm_shrdi_epi64(__m128i, __m128i, int);
// VPSHRDQ
__m128i _mm_shrdi_epi64(__m128i, __m128i, int);
__m128i _mm_mask_shrdi_epi64(__m128i, __mmask8, __m128i, __m128i, int);
__m128i _mm_maskz_shrdi_epi64(__mmask8, __m128i, __m128i, int);
__m256i _mm256_shrdi_epi64(__m256i, __m256i, int);
__m256i _mm256_mask_shrdi_epi64(__m256i, __mmask8, __m256i, __m256i, int);
__m256i _mm256_maskz_shrdi_epi64(__mmask8, __m256i, __m256i, int);
__m512i _mm512_shrdi_epi64(__m512i, __m512i, int);
__m512i _mm512_mask_shrdi_epi64(__m512i, __mmask8, __m512i, __m512i, int);
__m512i _mm512_maskz_shrdi_epi64(__mmask8, __m512i, __m512i, int);
// VPSHRDVW
__m128i _mm_shrdv_epi16(__m128i, __m128i, __m128i);
__m128i _mm_mask_shrdv_epi16(__m128i, __mmask8, __m128i, __m128i);
__m128i _mm_maskz_shrdv_epi16(__mmask8, __m128i, __m128i, __m128i);
__m256i _mm256_shrdv_epi16(__m256i, __m256i, __m256i);
__m256i _mm256_mask_shrdv_epi16(__m256i, __mmask16, __m256i, __m256i);
__m256i _mm256_maskz_shrdv_epi16(__mmask16, __m256i, __m256i, __m256i);
__m512i _mm512_shrdv_epi16(__m512i, __m512i, __m512i);
__m512i _mm512_mask_shrdv_epi16(__m512i, __mmask32, __m512i, __m512i);
__m512i _mm512_maskz_shrdv_epi16(__mmask32, __m512i, __m512i, __m512i);
// VPSHRDVD
__m128i _mm_shrdv_epi32(__m128i, __m128i, __m128i);
__m128i _mm_mask_shrdv_epi32(__m128i, __mmask8, __m128i, __m128i);
__m128i _mm_maskz_shrdv_epi32(__mmask8, __m128i, __m128i, __m128i);
__m256i _mm256_shrdv_epi32(__m256i, __m256i, __m256i);
__m256i _mm256_mask_shrdv_epi32(__m256i, __mmask8, __m256i, __m256i);
__m256i _mm256_maskz_shrdv_epi32(__mmask8, __m256i, __m256i, __m256i);
__m512i _mm512_shrdv_epi32(__m512i, __m512i, __m512i);
__m512i _mm512_mask_shrdv_epi32(__m512i, __mmask16, __m512i, __m512i);
__m512i _mm512_maskz_shrdv_epi32(__mmask16, __m512i, __m512i, __m512i);
// VPSHRDVQ
__m128i _mm_shrdv_epi64(__m128i, __m128i, __m128i);
__m128i _mm_mask_shrdv_epi64(__m128i, __mmask8, __m128i, __m128i);
__m128i _mm_maskz_shrdv_epi64(__mmask8, __m128i, __m128i, __m128i);
__m256i _mm256_shrdv_epi64(__m256i, __m256i, __m256i);
__m256i _mm256_mask_shrdv_epi64(__m256i, __mmask8, __m256i, __m256i);
__m256i _mm256_maskz_shrdv_epi64(__mmask8, __m256i, __m256i, __m256i);
__m512i _mm512_shrdv_epi64(__m512i, __m512i, __m512i);
__m512i _mm512_mask_shrdv_epi64(__m512i, __mmask8, __m512i, __m512i);
__m512i _mm512_maskz_shrdv_epi64(__mmask8, __m512i, __m512i, __m512i);

Bibliography[edit]

"Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B, 2C & 2D): Instruction Set Reference, A-Z", Intel Order Nr. 325383, Rev. 078US, December 2022

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung

Intel

AMD

Ampere

Apple

Cavium

HiSilicon

MediaTek

NXP

Qualcomm

Renesas