

### Intel<sup>®</sup> Advanced Vector Extensions 2015/2016 Support in GNU Compiler Collection

**GNU Tools Cauldron 2014** 

Presented by Kirill Yukhin of Intel, July 2014 (kirill.yukhin@intel.com)



### Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

### New ISA: What Is Where?





Small & efficient cores

(intel) inside

Xeon<sup>®</sup>Phi

- •Big focus on throughput and many-threads
- •State-of-the-art SIMD support for HPC: AVX-512 F + CDI + ERI + PFI
- Industry performance-per-watt leadership

### Intel<sup>®</sup> Advanced Vector Extensions



### Introducing AVX-512

### Intel<sup>®</sup> AVX Technology



| AVX                | AVX2               |
|--------------------|--------------------|
| 256-bit basic FP   | Float16 (IVB 2012) |
| 16 registers       | 256-bit FP FMA     |
| NDS (and AVX128)   | 256-bit integer    |
| Improved blend     | PERMD              |
| MASKMOV            | Gather             |
| Implicit unaligned |                    |

#### AVX-512

512-bit FP/Integer 32 registers 8 mask registers Embedded rounding Embedded broadcast Scalar/SSE/AVX "promotions" HPC additions Transcendental support Gather/Scatter

#### Future Processors (KNL & SKX) in planning, subject to change

**SNB** 

2011

**HSW** 

2013

### AVX-512 Mask Registers

#### 8 Mask registers of size 64-bits

- k1-k7 can be used for predication
  - k0 can be used as a destination or source for mask manipulation operations

#### 4 different mask granularities. For instance, at 512b:

- Packed Integer Byte use mask bits [63:0]
  - VPADDB zmm1 {k1}, zmm2, zmm3
- Packed Integer Word use mask bits [31:0]
  - VPADDW zmm1 {k1}, zmm2, zmm3
- Packed IEEE FP32 and Integer Dword use mask bits [15:0]
  - VADDPS zmm1 {k1}, zmm2, zmm3
- Packed IEEE FP64 and Integer Qword use mask bits [7:0]
  - VADDPD zmm1 {k1}, zmm2, zmm3



|         |           | Vector Length |     |     |
|---------|-----------|---------------|-----|-----|
|         |           | 128           | 256 | 512 |
|         | Byte      | 16            | 32  | 64  |
|         | Word      | 8             | 16  | 32  |
| element | Dw ord/SP | 4             | 8   | 16  |
| size    | Qw ord/DP | 2             | 4   | 8   |

## AVX-512 Features (II): Masking VADDPS ZMM0 {k1}, ZMM3, [mem]

- Mask bits used to:
  - 1. Suppress individual elements read from memory
    - hence not signaling any memory fault
  - 2. Avoid actual independent operations within an instruction happening
    - and hence not signaling any FP fault
  - 3. Avoid the individual destination elements being updated,
    - or alternatively, force them to zero (zeroing)

Caveat: vector shuffles do not suppress memory fault Exceptions as mask refers to "output" not to "input"



**Optimization Notice** 

### Embedded Broadcasts

#### VFMADD231PS zmm1, zmm2, C {1to16}

- Scalars from memory are first class citizens
  - Broadcast one scalar from memory into all vector elements before operation
- Memory fault suppression avoids fetching the scalar if no mask bit is set to 1

#### Other "tuples" supported

- Memory only touched if at least one consumer lane needs the data
- For instance, when broadcast a tuple of 4 elements, the semantics check for every element being really used
  - E.g.: element 1 checks for mask bits 1, 5, 9, 13, ...

float32 A[N], B[N], C; for(i=0; i<8; i++) if(A[i]!=0.0) A[i] = A[i] + **C**\* B[i];

VBROADCASTSS zmm1 {k1}, [rax] VBROADCASTF64X2 zmm2 {k1}, [rax] VBROADCASTF32X4 zmm3 {k1}, [rax] VBROADCASTF32X8 zmm4, {k1}, [rax]



### AVX-512 Features: Embedded Rounding Control & SAE (Suppress All Exceptions)

#### **Embedded Rounding Control :**

- MXCSR.RC can be overridden on all FP instructions
  - VADDPS ZMM1 {k1}, ZMM2, [mem]  $\{1 \rightarrow 16\}$  {rne-sae}
- "Suspend All Exceptions"
  - Always implied by using embedded RC
    - NO MXCSR updates / exception reporting for <u>any</u> lane
    - Changes to RC without SAE via LDMXCSR
      - Not needed for most common case (truncating FP convert to int)

#### Only available for reg-reg mode and 512b operands

#### Main application:

- Saving, modifying and restoring MXCSR is usually slow and cumbersome
- Being able to avoid suppressions and set the rounding-mode on a per instruction basis simplifies development of high performance math software sequences (math libs)
  - E.g.: avoid spurious overflow/underflow reporting in intermediate computations
  - E.g: make sure that RM=rne regardless of the contents of MXCSR

### AVX-512F, CDI, ERI & PRI





### AVX-512F Designed for HPC

- Promotions of many AVX and AVX2 instructions to AVX-512
  - 32-bit and 64-bit floating-point instructions from AVX
    - Scalar and 512-bit
  - 32-bit and 64-bit integer instructions from AVX2
- Many new instructions to speedup HPC workloads



intel

### AVX-512{VL,DQ,BW}: Complements AVX-512F

Complete vector ISA extension shows up in Skylake Xeon

- Main focus on simplifying the task of auto-vectorization for \*any\* compiler
- Support for all data types: including 8-bit (byte) and 16-bit (word) integers
  - Useful for media and other workloads
- Support for all vector lengths
- Some instructions to speedup HPC workloads: closing KNL's AVX-512 gaps



### AVX-512VL: Vector Length Orthogonality

Some algorithms are "natural" at certain element counts

- Scalar = 1 element count
- float4 = 128-bit
- word 4x4 (media) = 256-bit
- 32 registers / broadcast / masking cannot be retroactively added to AVX

Auto-vectorization of loops with mixed datatypes

- Choose target for number of elements per iteration
  - 16 Single Precisions is one ZMM register, but...
  - 16 Words is a half a ZMM register aka YMM

But... why not just use the mask?

- potential mask bookkeeping overhead
- potential performance pitfalls now and in the future

Solution: Add vector length support for all AVX-512 packed instructions

- Every instruction is supported at 128-bit, 256-bit and 512-bit vector length
  - Ex: VADDPS xmm1 {k1}{z}, xmm2, xmm3 {1toN}

### AVX-512DQ: New HPC ISA (vs AVX512F)

64

#### AVX-512 HPC

VBROADCAST{F32X8,F64X2,I32X8,I64X2}

VBROADCAST{I32X2}

VEXTRACT{F32X8,F64X2,I32X8,I64X2}

VINSERT{F32X8,F64X2,I32X8,I64X2}

VCVT{,T}{PS,PD}2{QQ,UQQ}

VCVT{QQ,UQQ}2{PS,PD}

VCVT{,T}{PS,PD}2{QQ,UQQ}

VFPCLASS{PS,PD}

VRANGE{PS,PD}

VREDUCE{PS,PD}

VPMULLQ

K{AND,ANDN,OR,XNOR,XOR,NOT}B

K{MOV,ORTEST,SHIFR,SHIFTL}B

K{ADD,TEST}{B,W}

VPMOV{D2M,Q2M}, VPMOV{M2D,M2Q}

Extended Tuple support: 32X8, 64X2, 32X2

Int64 ⇔ FP conversions

Transcendental package enhancements

INT64 arithmetic support

Byte support for mask instructions

Expanded mask functionality

### AVX-512BW: Byte and Word Support

| AVX-512BW            | AVX-512BW             |                          |                          |
|----------------------|-----------------------|--------------------------|--------------------------|
| VPBROADCAST{B,W}     | VMOVDQU{8,16}         | KADD{D,Q}                | VPMOV{SX,ZX}BW           |
| VPSRLDQ, VPSLLDQ     | VPBLENDM{B,W}         | VPMOV{B2M,W2M,M2B,M2W}   | VPMUL{HRS,H,L}W          |
| VP{SRL,SRA,SLL}{V}W  | {KAND,KANDN}{D,Q}     | VPCMP{,EQ,GT}{B,W,UB,UW} | VPSADBW                  |
| VPMOV{WB,SWB, USWB}  | {KOR,KXNOR,KXOR}{D,Q} | VP{ABS,AVG}{B,W}         | VPSHUFB, VPSHUF{H,L}W    |
| VPTESTM{B,W}         | KNOT{D,Q}             | VP{ADD,SUB}{,S,US}{B,W}  | VP{SRA,SRL,SLL}{,V}{B,W} |
| VPMADW               | KORTEST{D,Q}          | VPALIGNR                 | VPUNPCK{H,L}{BW,WD}      |
| VPABSDIFFW           | KTEST{D,Q}            | VP{EXTR,INSR}{B,W}       |                          |
| VDBPSADBW            | KSHIFT{L,R}{D,Q}      | VPMADD{UBSW,WD}          |                          |
| VPERMW, VPERM{I,T}2W | KUNPACK{WD,DQ}        | VP{max,min}{s,u}{B,w}    |                          |

#### VPABSW zmm1 {k1}, zmm2





**Optimization Notice** 

### Enabling of AVX-512 in GNU toolchain



KNL support in GNU toolchain overview Support in binutils (gas/objdump) available from v2.24 glibc tuning not done so far

- memcpy, memset etc.
- Use of transcendental instructions from AVX-512ERI

Basic support in GCC available from GCC 4.9.x (see next slides)

Embedded rounding control autogeneration is not going to be supported in GCC

fe[get|set]round () is not acting as FP barrier in GCC

Usage of advanced encoding features supported in back-end only

- New meta-pattern called `define\_subst' introduced from GCC 4.8.x
- Using `subst' embedded masking, broadcasting and embedded rounding control were easily described in the backend

### AVX-512: New Patterns

Specific new patterns (est.)

# of instructions: 651 (w/ masking: 500, w/ rounding: 114, w/ msk and rnd: 100)

Total

(400 × 2) + (100 × 3) + (14 × 2) + (651 – 514) ≈ **1300** 

~ 5000 new intrinsics

### Solution: introduce `define\_subst'

Generate new pattern from existing

E.g. add masking and rounding

### Example (original pattern)

(define\_insn "\*<plusminus\_insn><mode>3"
 [(set (match\_operand:VF\_AVX512 0 "register\_operand" "=x,v")
 (plusminus:VF\_AVX512 (match\_operand:VF\_AVX512 1 "nonimmediate\_operand" "%0,v")
 (match\_operand:VF\_AVX512 2 "nonimmediate\_operand" "xm,vm"))



### Example (+mask)

(define\_insn "\*<plusminus\_insn><mode>3"
 [(set (match\_operand:VF\_AVX512 0 "register\_operand" "=x,v")
 (plusminus:VF\_AVX512 (match\_operand:VF\_AVX512 1 "nonimmediate\_operand" "%0,v")
 (match\_operand:VF\_AVX512 2 "nonimmediate\_operand" "xm,vm"))

(define\_insn "\*<plusminus\_insn><mode>3\_mask"
[(set (match\_operand:VF\_AVX512 0 "register\_operand" "=x,v")
 (vec\_merge:VF\_AVX512
 (plusminus:VF\_AVX512 (match\_operand:VF\_AVX512 1 "nonimmediate\_operand" "%0,v")
 (match\_operand:VF\_AVX512 2 "nonimmediate\_operand" "xm,vm"))
 (match\_operand:VF\_AVX512 3 "nonimmediate\_or\_const0\_operand" "0C,0C")
 (match\_operand:DI 4 "register\_operand" "k,k")

### Example(+rounding)

(define\_insn "\*<plusminus\_insn><mode>3"
 [(set (match\_operand:VF\_AVX512 0 "register\_operand" "=x,v")
 (plusminus:VF\_AVX512 (match\_operand:VF\_AVX512 1 "nonimmediate\_operand" "%0,v")
 (match\_operand:VF\_AVX512 2 "nonimmediate\_operand" "xm,vm"))

(define\_insn "\*<plusminus\_insn><mode>3\_mask"
[(set (match\_operand:VF\_AVX512 0 "register\_operand" "=x,v")
 (vec\_merge:VF\_AVX512
 (plusminus:VF\_AVX512 (match\_operand:VF\_AVX512 1 "nonimmediate\_operand" "%0,v")
 (match\_operand:VF\_AVX512 2 "nonimmediate\_operand" "xm,vm"))
 (match\_operand:VF\_AVX512 3 "nonimmediate\_or\_const0\_operand" "0C,0C")
 (match\_operand:DI 4 "register\_operand" "k,k")



### Example (+both)

(define\_insn "\*<plusminus\_insn><mode>3"
 [(set (match\_operand:VF\_AVX512 0 "register\_operand" "=x,v")
 (plusminus:VF\_AVX512 (match\_operand:VF\_AVX512 1 "nonimmediate\_operand" "%0,v")
 (match\_operand:VF\_AVX512 2 "nonimmediate\_operand" "xm,vm"))



### Example (using subst)

(define\_insn "\*<plusminus\_insn><mode>3<mask\_name>"
 [(set (match\_operand:VF\_AVX512 0 "register\_operand" "=x,v")
 (plusminus:VF\_AVX512
 (match\_operand:VF\_AVX512 1 "nonimmediate\_operand" "%0,v")
 (match\_operand:VF\_AVX512 2 "nonimmediate\_operand" "xm,vm"))

(define\_insn "\*<plusminus\_insn><mode>3"
 [(set (match\_operand:VF\_AVX512 0 "register\_operand" "=x,v")
 (plusminus:VF\_AVX512
 (match\_operand:VF\_AVX512 1 "nonimmediate\_operand" "%0,v")
 (match\_operand:VF\_AVX512 2 "nonimmediate\_operand" "xm,vm")))]

(define\_insn "\*<plusminus\_insn><mode>3\_mask"
[(set (match\_operand:VF\_AVX512 0 "register\_operand" "=x,v")
 (vec\_merge VF\_AVX512
 (plusminus:VF\_AVX512
 (match\_operand:VF\_AVX512 1 "nonimmediate\_operand" "%0,v")
 (match\_operand:VF\_AVX512 2 "nonimmediate\_operand" "xm,vm"))
 (match\_operand VF\_AVX512 3 "nonimmediate\_or\_const0\_operand" "0C,0C")
 (match\_operand.DI 4 "register\_operand" "Yk,Yk")

**Optimization Notice** 

### AVX-512: Embedded broadcasting



### Embedded broadcasting support in GCC



#### Implementation

#### Use substs to generate rtx patterns and rely on combiner

```
(define_subst "emb_bcst2"
 [(set (match_operand:BCST_V 0)
          (any_operator2:BCST_V
              (match_operand:BCST_V 1)
               (match_operand:BCST_V 2)))]
"TARGET_AVX512F"
 [(set (match_dup 0)
          (any_operator2:BCST_V
               (vec_duplicate:BCST_V
                    (match_operand:<ssescalarmode> 2 "memory_operand" "m"))
        (match_dup 1)))])
```



### Results

- Internal implementation done
- Performance gain is 0%
- Combiner can't eliminate broadcasts that are
  - Have multiple destinations
  - Reside in different BBs
- State of the art embedded broadcasting (icc) shows little icount gain
- Impact on icache can't be measured without hardware

#### Conclusion

• Patch not submitted - no performance gain (for now?)

# Support of new `scatter' instruction family

### AVX-512: Scatters overview

Allows vectorization of loops with stores which addresses can be represented as: Address [i] = BaseAddress + Index[i] \* Scale



#### VPSCATTERDD zmm0, ([rax], zmm1, 4) {k1}

Stores up to 16 elements (controlled by mask) to the memory location pointed by base address, index vector and scale. For successfully stored elements corresponding mask bits will be set to zero.



### Scatter Support in GCC

Patch which adds autogeneration of scatter instructions in case when one array is indexed by another array is ready

- Need to add autogeneration of scatter instructions for strided stores
- Expectations of performance improve from strided stores using scatters based on ICC 14:
  - SPEC 2006
    - 434.zeusmp >1.5%
  - NPB 3.3.1-SER.ClassW more than 1% of all executed instructions are scatters, so few percent of performance improve can be expected







### 512-bit auto-vectorization in GCC

7% icounts decrease (vs. AVX2 on SPECfp2006, -Ofast, ref data)

- New registers ZMM16–ZMM31 0.6% on SPECfp2006
- Extended standard patterns for the vectorizer:
  - Arithmetic and logic 4.0% on SPECfp2006
  - expand\_vector\_init 1.0% on SPECfp2006
  - FP division 0.6% on SPECfp2006
  - FMA 0.4% on SPECfp2006
  - copysign 0.4% on SPECfp2006
  - cmp, vec\_perm, unpack, extract, sqrt, rcp, floor, ceil, round, gather, reduction, etc.
- (No hotspots in SPEC CPU2006 benchmarks, where GCC can vectorize with VL=256, but can't vectorize with VL=512)

### Embedded masking autogeneration

Most promising feature of new encoding (EVEX)

- Preliminary investigation performed
  - Not-for-trunk proof-of-concept patch implemented, which shows about 1.5% of icount decrease on average in SpecFP2006
  - Vectorized loop tails
  - Vectorization of loop heads looks promising as well
- Applicable for if-conv optimization
- Masking of operation, not result, hence no redundant side effects, exceptions, memory accesses etc.



### Enabling of SKX in GNU toolchain



### Enabling of SKX in GNU toolchain

New ISA were published on July'18 2014

- Patch set for support new ISA in binutils (gas/objdump) was submitted
- Branch with support for GCC was created (avx512-skx)
  - Extended existing patterns (i386/sse.md) to support AVX-512VL,BW,DQ
  - Set of intrinsics covering new ISA was implemented
  - ... Covered by corresponding testsuite
  - Target for GCC 4.10.x
- No performance work was done so far
- glibc work was not performed

### Backup



**Optimization Notice** 

Copyright © 2014, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

#### 8 in 32-bit mode } 256-bit width 8 x FP32 4 x FP64

#### AVX-512: VADDPS ZMM0, ZMM24, [mem]

AVX: VADDPS YMM0, YMM3, [mem]

Up to 32 AVX registers

Registers

Up to 16 AVX registers

- 8 in 32-bit mode
- 512-bit width
  - 16 x FP32
  - 8 x FP64

But you need many more features to use all that real estate effectively...





## Why Separate Mask Registers?

Don't waste away real vector registers for vector of booleans

Separate control flow from data flow

Boolean operations on logical predicates consume less energy (separate functional unit)

Tight encoding allows orthogonal operand

Every instruction now has an extra mask operand

# Why True Masking?

#### Memory fault suppression

- Vectorize code without touching memory that the correspondent scalar code would not touch
  - Typical examples are if-conditional statements or loop remainders
  - AVX is forced to use VMASKMOV\*

#### MXCSR flag updates and fault handlers

 Avoid spurious floating-point exceptions without having to inject neutral data

#### Zeroing/merging

- Use zeroing to avoid false dependencies in OOO architecture
- Use merging to avoid extra blends in if-then-else clauses (predication) for greater code density

float32 A[N], B[N], C[N]; for(i=0; i<16; i++) if(B[i] != 0) { A[i] = A[i] / B[i];else { A[i] = A[i] / C[i];VMOVUPS zmm2, A VCMPPS k1, zmm0, B, 4 VDIVPS zmm1 {k1}{z}, zmm2, B KNOT k2, k1 VDIVPS zmm1 {k2}, zmm2, C VMOVUPS A, zmm1



# AVX-512 Features: Compressed Displacement

#### VADDPS zmm1, zmm2, [rax+256]

- Observation is that displacement in generated vector code is a multiple of the actual operand size
  - An obvious side effect of unrolling
- Unfortunately, regular IA 8-bit displacement format have limited scope for 512-bit vector sizes (unrolling look-ahead of +/-2 at most)
  - So we would end up using 32-bit displacement formats too often

#### AVX-512 disp8\*N compressed displacement

- AVX-512 implicitly encodes a 8-bit displacement as a multiple of the actual size of the memory operand
  - VADDPD zmm1 {k1}, zmm2, [rax] memory size operand is 512bits
  - VADDPD xmm1 {k1}, xmm2, [rax] memory size operand is 128bits
  - VADDPD zmm1 {k1}, zmm2, [rax] {1toN} memory size operand is 64 bits
- Assembler/compiler reverts to 32-bit displacement when the real displacement is not a multiple



AVX-512 F: Common Xeon Phi (KNL) and Skylake Xeon Vector ISA Extension

> AVX-512 Foundation is the common SIMD foundation for HPC software development First on KNL Planned on SKX (Skylake Xeon)



## Quadword Integer Arithmetic

Useful for pointer manipulation 64-bit becomes a first class citizen Removes the need for expensive SW emulation sequences Note: VPMULQ and int64 <-> FP converts not in AVX-512 F

| Instruction                                        | Description                 |
|----------------------------------------------------|-----------------------------|
| <pre>VPADDQ zmm1 {k1}, zmm2, zmm3</pre>            | INT64 addition              |
| <pre>VPSUBQ zmm1 {k1}, zmm2, zmm3</pre>            | INT64 subtraction           |
| <pre>VP{SRA,SRL,SLL}Q zmm1 {k1}, zmm2, imm8</pre>  | INT64 shift (imm8)          |
| <pre>VP{SRA,SRL,SLL}VQ zmm1 {k1}, zmm2, zmm3</pre> | INT64 shift (variable)      |
| <pre>VP{MAX,MIN}Q zmm1 {k1}, zmm2, zmm3</pre>      | INT64 max, min              |
| <pre>VP{MAX,MIN}UQ zmm1 {k1}, zmm2, zmm3</pre>     | UINT64 max, min             |
| <pre>VPABSQ zmm1 {k1}, zmm2, zmm3</pre>            | INT64 absolute value        |
| <pre>VPMUL{DQ,UDQ} zmm1 {k1}, zmm2, zmm3</pre>     | 32x32 = 64 integer multiply |

## Math Support

| Instruction                        | • Good value u<br>• Available in F | with Math library writing<br>pside in financial applications<br>PS, PD, SS and SD data types<br>bination with embedded RC |
|------------------------------------|------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| VGETXEXP <sub>{PS,PD,SS,SD}</sub>  | zmm1 {k1}, zmm2                    | Obtain exponent in FP format                                                                                              |
| VGETMANT <sub>{PS,PD,SS,SD}</sub>  | zmm1 {k1}, zmm2                    | Obtain normalized mantissa                                                                                                |
| VRNDSCALE <sub>{PS,PD,SS,SD}</sub> | zmm1 {k1}, zmm2, imm8              | Round to scaled integral number                                                                                           |
| VSCALEF {PS,PD,SS,SD}              | zmm1 {k1}, zmm2, zmm3              | X*2 <sup>y</sup> , X <= getmant, Y <= getexp                                                                              |
| VFIXUPIMM <sub>{PS,PD,SS,SD}</sub> | zmm1, zmm2, zmm3, imm8             | Patch output numbers based on inputs                                                                                      |
| VRCP14 <sub>{PS,PD,SS,SD}</sub>    | zmm1 {k1}, zmm2                    | Approx. reciprocal() with rel. error 2 <sup>-14</sup>                                                                     |
| VRSQRT14 <sub>{PS,PD,SS,SD}</sub>  | zmm1 {k1}, zmm2                    | Approx. rsqrt() with rel. error 2 <sup>-14</sup>                                                                          |
| VDIV <sub>{PS,PD,SS,SD}</sub>      | zmm1 {k1}, zmm2, zmm3              | IEEE division                                                                                                             |
| $VSQRT_{\{PS,PD,SS,SD\}}$          | zmm1 {k1}, zmm2                    | IEEE square root                                                                                                          |



## New 2-Source Shuffles

#### 2-Src Shuffles VSHUF{PS,PD} Long standing customer request VPUNPCK{H,L}{DQ,QDQ} • 16/32-entry table lookup (transcendental support) VUNPCK{H,L}{PS,PD} • AOS $\Leftrightarrow$ SOA support, matrix transpose VPERM{I,D}2{D,Q,PS,PD} Variable VALIGN emulation VSHUF{F,I}32X4 EVEX.U1.512.NDS.66.0F38.W1 AVX3.1 Permute double-precision values Α V/V77 /r on floating-point in zmm3/mV and zmm2 using indexes in VPERMI2PD zmm1 {k1}{z}, zmm1 and store the result in $zmm2, zmm3/B_{64}(mV)$ zmm1 using writemask k1. zmm1 8 3 10 9 7 6 5 4 2 1 0 2 9 15 10 11 2 0 0 zmm2 zmm3 G' F' E' D' C' B' A' Е H' н G D С В F Α

H'

zmm1

C'

Α

D'

 $\boldsymbol{\cap}$ 

Β'

# Expand & Compress

#### Allows vectorization of conditional loops

- Opposite operation (compress) in AXV-512
- Similar to FORTRAN pack/unpack intrinsics
- Provides mem fault suppression
- Faster than alternative gather/scatter

#### VEXPANDPS zmm0 {k2}, [rax]

Moves compressed (consecutive) elements in register or memory to sparse elements in register (controlled by mask), with merging or zeroing

for(j=0, i=0; i<N; i++)

B[i] = A[i] \* C[i++];

if(C[i] != 0.0)



## **Bit Manipulation**

Basic bit manipulation operations on mask and vector operands

- Useful to manipulate mask registers
- Have uses in cryptography algorithms

| Instruction                                      | Description                        |
|--------------------------------------------------|------------------------------------|
| KUNPCKBW k1, k2, k3                              | Interleave bytes in k2 and k3      |
| KSHIFT{L,R}W k1, k2, imm8                        | Shift bits left/right using imm8   |
| <pre>VPROR{D,Q} zmm1 {k1}, zmm2, imm8</pre>      | Rotate bits right using imm8       |
| <pre>VPROL{D,Q} zmm1 {k1}, zmm2, imm8</pre>      | Rotate bits left using imm8        |
| <pre>VPRORV{D,Q} zmm1 {k1}, zmm2, zmm3/mem</pre> | Rotate bits right w/ variable ctrl |
| <pre>VPROLV{D,Q} zmm1 {k1}, zmm2, zmm3/mem</pre> | Rotate bits left w/ variable ctrl  |

inte

# VPTERNLOG – Ternary Logic Instruction

- Take every bit of three sources to obtain a 3-bit index N
  - Obtain Nth bit from imm8

#### VPTERNLOGD zmm0 {k2}, zmm15, zmm3/[rax], imm8

Any arbitrary truth table of 3 values can be implemented *andor, andxor, vote, parity, bitwise-cmov,* etc each column in the right table corresponds to imm8

| S1 | <b>S2</b> | <b>S</b> 3 | ANDOR      | VOTE (S1)?S3:S2 |
|----|-----------|------------|------------|-----------------|
| 0  | 0         | 0          | 00         | 0               |
| 0  | 0         | 1          | 10         | 1               |
| 0  | 1         | 0          | 00         | 0               |
| 0  | 1         | 1          | 1 1        | 1               |
| 1  | 0         | 0          | <b>0</b> 0 | 0               |
| 1  | 0         | 1          | 1 1        | 0               |
| 1  | 1         | 0          | 11         | 1               |
| 1  | 1         | 1          | 11         | 1               |
|    |           |            |            |                 |



# AVX-512 CDI: Conflict Detection Instructions



# Motivation for Conflict Detection

Sparse computations are common in HPC, but hard to vectorize due to race conditions

Consider the "histogram" problem:



- Code above is wrong if any values within B[i] are duplicated
  - Only one update from the repeated index would be registered!
- A solution to the problem would be to avoid executing the sequence gather-opscatter with vector of indexes that contain conflicts

# Conflict Detection Instructions in AVX512

VPCONFLICT instruction detects elements with previous conflicts in a vector of indexes

- Allows to generate a mask with a subset of elements that are guaranteed to be conflict free
- The computation loop can be re-executed with the remaining elements until all the indexes have been operated upon

#### **CDI instr.**

VPCONFLICT{D,Q} zmm1{k1}, zmm2/mem VPBROADCASTM{W2D,B2Q} zmm1, k2 VPTESTNM{D,Q} k2{k1}, zmm2, zmm3/mem VPLZCNT{D,Q} zmm1 {k1}, zmm2/mem

```
index = vload &B[i] // Load 16 B[i]
pending_elem = 0xFFFF; // all still remaining
do {
    curr_elem = get_conflict_free_subset(index, pending_elem)
    old_val = vgather {curr_elem} A, index // Grab A[B[i]]
    new_val = vadd old_val, +1.0 // Compute new values
    vscatter A {curr_elem}, index, new_val // Update A[B[i]]
    pending_elem = pending_elem ^ curr_elem // remove done idx
} while (pending_elem)
```

This not even the fastest version: see backup for details

# AVX-512 ERI & AVX-512 PRI: Xeon Phi Only



**Optimization Notice** 

Copyright © 2014, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

## Xeon Phi Only Instructions

#### Set of segment-specific instruction extensions

- First appear on KNL
- Will be supported in all future Xeon Phi processors
- May or may not show up on a later Xeon processor

#### Address two HPC customer requests

- Ability to maximize memory bandwidth
  - Hardware prefetching is too restrictive
  - Conventional software prefetching results in instructions overhead
- Competitive support for transcendental sequences
  - Mostly division and square root
  - Differentiating factor in HPC/TPT

## AVX-512 ERI & PRI Description

| CPUID       | Instructions           | Description                                                                                  |
|-------------|------------------------|----------------------------------------------------------------------------------------------|
| AVX-512 PRI | PREFETCHWT1            | Prefetch cache line into the L2 cache with intent to write (RFO ring request)                |
|             | VGATHERPF{D,Q}{0,1}PS  | Prefetch vector of D/Qword indexes into the L1/L2 cache                                      |
|             | VSCATTERPF{D,Q}{0,1}PS | Prefetch vector of D/Qword indexes into the L1/L2 cache with intent to write                 |
| AVX-512 ERI | VEXP2{PS,PD}           | Computes approximation of 2 <sup>x</sup> with maximum relative error of 2 <sup>-23</sup>     |
|             | VRCP28{PS,PD}          | Computes approximation of reciprocal with max relative error of 2 <sup>-28</sup>             |
|             | VRSQRT28{PS,PD}        | Computes approximation of reciprocal square root with max relative error of 2 <sup>-28</sup> |

## AVX-512 ERI & PRI Motivation

| CPUID       | Instructions           | Motivation                                                                                                                                                                                                  |  |
|-------------|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| AVX-512 PRI | PREFETCHWT1            | Reduce ring traffic in core-to-core data communication                                                                                                                                                      |  |
|             | VGATHERPF{D,Q}{0,1}PS  | Reduce overhead of software prefetching:<br>dedicate side engine to prefetch sparse structures<br>while devoting the main CPU to pure raw flops                                                             |  |
|             | VSCATTERPF{D,Q}{0,1}PS |                                                                                                                                                                                                             |  |
| AVX-512 ERI | VEXP2{PS,PD}           | Speed-up key FSI workloads: Black-Scholes,<br>Montecarlo                                                                                                                                                    |  |
|             | VRCP28{PS,PD}          | Key building block to speed up most<br>transcendental sequences (in particular, division<br>and square root):<br>Increasing precision from 14=>28 allows to<br>reduce one complete Newton-Raphson iteration |  |
|             | VRSQRT28{PS,PD}        |                                                                                                                                                                                                             |  |







**Optimization Notice** 

## Summary

AVX512: new 512-bit vector ISA extension

Common between Xeon (SKL) and Xeon Phi (KNL)

#### AVX512VL, AVX512DQ, AVX512BW: complements AVX512

- Shows up first on Skylake Xeon
- Provides support for all data types and vector lengths

#### Conflict detection new instructions

- Improves autovectorization
- Common to Xeon and Xeon Phi

#### **TVX new instructions**

- 28-bit transcendentals and new prefetch instructions
- On Xeon Phi only

## Authors

- Mark Charney
- Jesus Corbal
- Roger Espasa
- Milind Girkar
- Moustapha Ould-ahmed-vall
- Ilya Tocar
- Bret Toll
- Bob Valentine
- Ilya Verbin
- Kirill Yukhin

inte

# Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

