Instruction Set Architecture
- Instructions
- Addressing Modes
- Registers
- Model-Specific Register
- Assembly
- Interrupts
- Micro-Ops
- Timer
- Calling Convention
- Microarchitectures
- CPUID
AVX-512 Bit Algorithms (AVX512_BITALG) is an x86 extension and part of the AVX-512 SIMD instruction set.
Overview
-
VPOPCNTB
,VPOPCNTW
- Parallel population count instructions, an operation also known as sideways sum, bit summation, or Hamming weight. They count the number of set bits in each byte or 16-bit word in the source operand, a vector register or vector in memory, and store the result in the corresponding element of the destination vector register.
-
VPOPCNTD
,VPOPCNTQ
- These instructions were added by the AVX512_VPOPCNTDQ extension, not BITALG. They count the set bits in each 32-bit doubleword or 64-bit quadword of the source operand. They can optionally read a single doubleword or quadword from memory and broadcast the result to all elements of the destination vector.
As usual these instructions support write masking. That means they can write individual elements in the destination vector unconditionally, leave them unchanged, or zero them if the corresponding bit in a mask register supplied as an additional source operand is zero. The masking mode is encoded in the instruction opcode.
-
VPSHUFBITQMB
- Shuffles the bits in the first source operand, a vector register, using bit indices. For each bit of the 16/32/64-bit mask in a destination mask register, the instruction obtains a source index modulo 64 from the corresponding byte in a second source operand, a 128/256/512-bit vector in a vector register or in memory. The operation is confined to quadwords so the indices can only select a bit from the same 64-bit lane where the index byte resides. For instance bytes 8 ... 15 can address bits 64 ... 127. The instruction supports write masking which means it optionally performs a bitwise 'and' on the destination using a second mask register.
All these instructions can operate on 128-, 256-, or 512-bit wide vectors. If the vector size is less than 512 bits the instructions zero the unused higher bits in the destination register to avoid a dependency on earlier instructions writing those bits.
Detection
Support for these instructions is indicated by the feature flags below. 128- and 256-bit vectors are supported if the AVX512VL flag is set as well.
CPUID | Instruction Set | |
---|---|---|
Input | Output | |
EAX=07H, ECX=0 | EBX[bit 31] | AVX512VL |
EAX=07H, ECX=0 | ECX[bit 12] | AVX512_BITALG |
EAX=07H, ECX=0 | ECX[bit 14] | AVX512_VPOPCNTDQ |
Microarchitecture support
Designer | Microarchitecture | Year | Support Level | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
F | CD | ER | PF | BW | DQ | VL | FP16 | IFMA | VBMI | VBMI2 | BITALG | VPOPCNTDQ | VP2INTERSECT | 4VNNIW | 4FMAPS | VNNI | BF16 | ||||
Intel | Knights Landing | 2016 | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | |
Knights Mill | 2017 | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✔ | ✘ | ✔ | ✔ | ✘ | ✘ | ||
Skylake (server) | 2017 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ||
Cannon Lake | 2018 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ||
Cascade Lake | 2019 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✔ | ✘ | ||
Cooper Lake | 2020 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✔ | ✔ | ||
Tiger Lake | 2020 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✔ | ✘ | ||
Rocket Lake | 2021 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✔ | ✘ | ||
Alder Lake | 2021 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ||
Ice Lake (server) | 2021 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✔ | ✘ | ||
Sapphire Rapids | 2023 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✔ | ✔ | ||
AMD | Zen 4 | 2022 | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✔ | ✔ | |
Centaur | CHA | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | ✔ | ✘ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
Intrinsic functions
// VPOPCNTB
__m128i _mm_popcnt_epi8(__m128i);
__m128i _mm_mask_popcnt_epi8(__m128i, __mmask16, __m128i);
__m128i _mm_maskz_popcnt_epi8(__mmask16, __m128i);
__m256i _mm256_popcnt_epi8(__m256i);
__m256i _mm256_mask_popcnt_epi8(__m256i, __mmask32, __m256i);
__m256i _mm256_maskz_popcnt_epi8(__mmask32, __m256i);
__m512i _mm512_popcnt_epi8(__m512i);
__m512i _mm512_mask_popcnt_epi8(__m512i, __mmask64, __m512i);
__m512i _mm512_maskz_popcnt_epi8(__mmask64, __m512i);
// VPOPCNTW
__m128i _mm_popcnt_epi16(__m128i);
__m128i _mm_mask_popcnt_epi16(__m128i, __mmask8, __m128i);
__m128i _mm_maskz_popcnt_epi16(__mmask8, __m128i);
__m256i _mm256_popcnt_epi16(__m256i);
__m256i _mm256_mask_popcnt_epi16(__m256i, __mmask16, __m256i);
__m256i _mm256_maskz_popcnt_epi16(__mmask16, __m256i);
__m512i _mm512_popcnt_epi16(__m512i);
__m512i _mm512_mask_popcnt_epi16(__m512i, __mmask32, __m512i);
__m512i _mm512_maskz_popcnt_epi16(__mmask32, __m512i);
// VPOPCNTD
__m128i _mm_popcnt_epi32(__m128i);
__m128i _mm_mask_popcnt_epi32(__m128i, __mmask8, __m128i);
__m128i _mm_maskz_popcnt_epi32(__mmask8, __m128i);
__m256i _mm256_popcnt_epi32(__m256i);
__m256i _mm256_mask_popcnt_epi32(__m256i, __mmask8, __m256i);
__m256i _mm256_maskz_popcnt_epi32(__mmask8, __m256i);
__m512i _mm512_popcnt_epi32(__m512i);
__m512i _mm512_mask_popcnt_epi32(__m512i, __mmask16, __m512i);
__m512i _mm512_maskz_popcnt_epi32(__mmask16, __m512i);
// VPOPCNTQ
__m128i _mm_popcnt_epi64(__m128i);
__m128i _mm_mask_popcnt_epi64(__m128i, __mmask8, __m128i);
__m128i _mm_maskz_popcnt_epi64(__mmask8, __m128i);
__m256i _mm256_popcnt_epi64(__m256i);
__m256i _mm256_mask_popcnt_epi64(__m256i, __mmask8, __m256i);
__m256i _mm256_maskz_popcnt_epi64(__mmask8, __m256i);
__m512i _mm512_popcnt_epi64(__m512i);
__m512i _mm512_mask_popcnt_epi64(__m512i, __mmask8, __m512i);
__m512i _mm512_maskz_popcnt_epi64(__mmask8, __m512i);
// VPSHUFBITQMB
__mmask16 _mm_bitshuffle_epi64_mask(__m128i, __m128i);
__mmask16 _mm_mask_bitshuffle_epi64_mask(__mmask16, __m128i, __m128i);
__mmask32 _mm256_bitshuffle_epi64_mask(__m256i, __m256i);
__mmask32 _mm256_mask_bitshuffle_epi64_mask(__mmask32, __m256i, __m256i);
__mmask64 _mm512_bitshuffle_epi64_mask(__m512i, __m512i);
__mmask64 _mm512_mask_bitshuffle_epi64_mask(__mmask64, __m512i, __m512i);
Bibliography
- "Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B, 2C & 2D): Instruction Set Reference, A-Z", Intel Order Nr. 325383, Rev. 078US, December 2022