(intel Look Inside"

## Intel ${ }^{\circ}$ Advanced Vector Extensions <br> 2015/2016 <br> Support in GNU Compiler Collection

GNU Tools Cauldron 2014

Presented by Kirill Yukhin of Intel, July 2014
(kirill.yukhin@intel.com)

# Legal Disclaimer \& Optimization Notice 

> INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

## Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision \#20110804

New ISA: What Is Where?


## Intel ${ }^{\circ}$ Advanced Vector Extensions



## Introducing AVX-512

## Intel® AVX Technology

| $\begin{gathered} 256 \mathrm{~b} \text { AVX1 } \\ 16 \text { SP / } 8 \text { DP } \\ \text { Flops/Cycle } \\ \hline \end{gathered}$ | 256b AVX2 <br> 32 SP / 16 DP Flops/Cycle (FMA) |
| :---: | :---: |
| AVX | AVX2 |
| 256-bit basic FP <br> 16 registers <br> NDS (and AVX128) <br> Improved blend <br> MASKMOV <br> Implicit unaligned | Float16 (IVB 2012) <br> 256-bit FP FMA <br> 256-bit integer <br> PERMD <br> Gather |

SNB
2011
HSW
2013

## AVX-512

512-bit FP/Integer
32 registers
8 mask registers
Embedded rounding
Embedded broadcast
Scalar/SSE/AVX "promotions"
HPC additions
Transcendental support
Gather/Scatter
Future Processors (KNL \& SKX)
in planning, subject to change

## AVX-512 Mask Registers

8 Mask registers of size 64-bits

- k1-k7 can be used for predication
- kO can be used as a destination or source for mask manipulation operations

4 different mask granularities.
For instance, at 512b:

- Packed Integer Byte use mask bits [63:0]
- VPADDB zmm1 \{k1\}, zmm2, zmm3

VADDPD zmm1 \{k1\}, zmm2, zmm3


- Packed Integer Word use mask bits [31:0]
- VPADDW zmm1 \{k1\}, zmm2, zmm3
- Packed IEEE FP32 and Integer Dword use mask bits [15:0]
- VADDPS zmm1 \{k1\}, zmm2, zmm3
- Packed IEEE FP64 and Integer Qword use mask bits [7:0]

| element size |  | Vector Length |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  |  | 128 | 256 | 512 |
|  | Byte | 16 | 32 | 64 |
|  | Word | 8 | 16 | 32 |
|  | Dw ord/SP | 4 | 8 | 16 |
|  | Qw ord/DP | 2 | 4 | 8 |

- VADDPD zmm1 \{k1\}, zmm2, zmm3


## AVX-512 Features (II): Masking VADDPS ZMMO \{k1\}, ZMM3, [mem]

- Mask bits used to:

1. Suppress individual elements read from memory

- hence not signaling any memory fault

2. Avoid actual independent operations within an instruction happening

- and hence not signaling any FP fault

3. Avoid the individual destination elements
```
for (I in vector length)
{
    if (no_masking or mask[I]) {
        dest[I] = OP(src2, src3)
    } else {
        if (zeroing_masking)
            dest[I] = 0
        else
            // dest[I] is preserved
    }
}
``` being updated,
- or alternatively, force them to zero (zeroing)

Caveat: vector shuffles do not suppress memory fault Exceptions as mask refers to "output" not to "input"

\section*{Embedded Broadcasts}

VFMADD231PS zmm1, zmm2, C \{1to16\}
- Scalars from memory are first class citizens
- Broadcast one scalar from memory into all vector elements before operation
- Memory fault suppression avoids fetching the scalar if no mask bit is set to 1

\section*{Other "tuples" supported}
- Memory only touched if at least one consumer lane needs the data
- For instance, when broadcast a tuple of 4 elements, the semantics check for every element being really used
- E.g.: element 1 checks for mask bits 1, 5, 9, 13, ...
...
...
\[
\begin{aligned}
& \text { VBROADCASTSS zmm1 \{k1\}, [rax] } \\
& \text { VBROADCASTF64X2 zmm2 }\{\mathrm{k} 1\} \text {, [rax] } \\
& \text { VBROADCASTF32X4 } \mathrm{zmm} 3\{\mathrm{k} 1\} \text {, [rax] } \\
& \text { VBROADCASTF32X8 } \mathrm{zmm} 4, \text {, } k 1\} \text {, [rax] }
\end{aligned}
\]
\[
\text { float32 } A[N], B[N], C ;
\]
\}
```

for(i=0; i<8; i++)

```
for(i=0; i<8; i++)
```

for(i=0; i<8; i++)
\{
\{
\{
$i f(A[i]!=0.0)$
$i f(A[i]!=0.0)$
$i f(A[i]!=0.0)$
$A[i]=A[i]+C^{*} B[i] ;$

```
        \(A[i]=A[i]+C^{*} B[i] ;\)
```

        \(A[i]=A[i]+C^{*} B[i] ;\)
    ```
\(\qquad\)

\section*{AVX-512 Features: Embedded Rounding Control \& SAE (Suppress All Exceptions)}

\section*{Embedded Rounding Control :}
- MXCSR.RC can be overridden on all FP instructions
- VADDPS ZMM1 \{k1\}, ZMM2, [mem] \{1 \(\rightarrow 16\}\) \{rne-sae\}
- "Suspend All Exceptions"
- Always implied by using embedded RC
- NO MXCSR updates / exception reporting for any lane
- Changes to RC without SAE via LDMXCSR
- Not needed for most common case (truncating FP convert to int)

Only available for reg-reg mode and 512b operands

\section*{Main application:}
- Saving, modifying and restoring MXCSR is usually slow and cumbersome
- Being able to avoid suppressions and set the rounding-mode on a per instruction basis simplifies development of high performance math software sequences (math libs)
- E.g.: avoid spurious overflow/underflow reporting in intermediate computations
- E.g: make sure that RM=rne regardless of the contents of MXCSR

\section*{AVX-512F, CDI, ERI \& PRI}

\section*{AVX-512 F: 512-bit instructions common between Xeon Phi and Xeon}
- Comprehensive vector extension for HPC and enterprise
\(\square\) All the key AVX-512 features: masking, broadcast...
32-bit and 64-bit integer and floating-point instructions
- Promotion of many AVX and AVX2 instructions to AVX-512
\(\square\) Many new instructions added to accelerate HPC workloads

\section*{AVX-512 CDI (Conflict Detection): Available on Xeon Phi first}
\(\square\) Allow vectorization of loops with possible address conflict Will show up on Xeon in SKL or CNL (follow up to SKL)

\section*{AVX-512 ERI \& PRI: Available on Xeon Phi only}

28-bit precision RCP, RSQRT and EXP transcendentals
\(\square\) New prefetch instructions: gather/scatter prefetches and PREFETCHWT1

\section*{AVX-512F Designed for HPC}
- Promotions of many AVX and AVX2 instructions to AVX-512
- 32-bit and 64-bit floating-point instructions from AVX
- Scalar and 512-bit
- 32-bit and 64-bit integer instructions from AVX2
- Many new instructions to speedup HPC workloads


\section*{AVX-512\{VL,DQ,BW\}: Complements AVX-512F}

Complete vector ISA extension shows up in Skylake Xeon
- Main focus on simplifying the task of auto-vectorization for *any* compiler
- Support for all data types: including 8-bit (byte) and 16-bit (word) integers
- Useful for media and other workloads
- Support for all vector lengths
- Some instructions to speedup HPC workloads: closing KNL's AVX-512 gaps


\section*{AVX-512VL: Vector Length Orthogonality}

Some algorithms are "natural" at certain element counts
- Scalar = 1 element count
- float4 = 128-bit
- word \(4 \times 4\) (media) \(=256\)-bit
- 32 registers / broadcast / masking cannot be retroactively added to AVX

Auto-vectorization of loops with mixed datatypes
- Choose target for number of elements per iteration
- 16 Single Precisions is one ZMM register, but...
- 16 Words is a half a ZMM register aka YMM

But... why not just use the mask?
- potential mask bookkeeping overhead
- potential performance pitfalls now and in the future

Solution: Add vector length support for all AVX-512 packed instructions
- Every instruction is supported at 128-bit, 256-bit and 512-bit vector length
- Ex: VADDPS xmm1 \{k1\}\{z\}, xmm2, xmm3 \{1toN\}

\section*{AVX-512DQ: New HPC ISA (vs AVX512F)}

AVX-512 HPC


\section*{Int64 \(\Leftrightarrow\) FP conversions}

\section*{Transcendental package enhancements}

\section*{INT64 arithmetic support}

Byte support for mask instructions
Expanded mask functionality

\section*{AVX-512BW: Byte and Word Support}
\begin{tabular}{|c|c|c|}
\hline AVX-512BW & AVX-512BW & \\
\hline VPBROADCAST\{B,W\} & VMOVDQU\{8,16\} & KADD \(\{\mathrm{D}, \mathrm{Q}\}\) \\
\hline VPSRLDQ, VPSLLDQ & VPBLENDM\{B,W\} & VPMOV\{B2M,W2M,M2B,M2W\} \\
\hline VP\{SRL,SRA,SLL\} \({ }^{\text {V }}\) \}W & \{KAND,KANDN\}D,Q\} & VPCMP\{,EQ,GT\} \({ }^{\text {, }}\),W,UB,UW\} \\
\hline VPMOV\{WB,SWB, USWB & \{KOR,KXNOR,KXOR\}\{D,Q\} & VP\{ABS,AVG\}B,W\} \\
\hline VPTESTM \(\{B, W\}\) & KNOT\{D,Q\} & VP\{ADD,SUBX,S,US\}\{B,W\} \\
\hline VPMADW & KORTEST\{D,Q\} & VPALIGNR \\
\hline VPABSDIFFW & KTEST\{D,Q\} & VP\{EXTR,INSR\}\{B,W\} \\
\hline VDBPSADBW & KSHIFT\{L,R\}\{D,Q\} & VPMADD\{UBSW,WD\} \\
\hline VPERMW, VPERM\{I,T\}2W & KUNPACK\{WD,DQ\} & VP\{MAX,MIN\}SS,U\}B,W\} \\
\hline
\end{tabular}

VPABSW zmm1 \{k1\}, zmm2

(inter

\section*{Enabling of AVX-512 in GNU toolchain}

\section*{KNL support in GNU toolchain overview} Support in binutils (gas/objdump) available from v2.24 glibc tuning not done so far
- memcpy, memset etc.
- Use of transcendental instructions from AVX-512ERI

Basic support in GCC available from GCC 4.9.x (see next slides) Embedded rounding control autogeneration is not going to be supported in GCC
- fe[get|set]round () is not acting as FP barrier in GCC

Usage of advanced encoding features supported in back-end only
- New meta-pattern called `define_subst' introduced from GCC 4.8.x
- Using `subst' embedded masking, broadcasting and embedded rounding control were easily described in the backend

\section*{AVX-512: New Patterns}

Specific new patterns (est.)
\# of instructions: 651 (w/ masking: 500, w/ rounding: 114, w/ msk and rnd: 100)

Total
\[
(400 \times 2)+(100 \times 3)+(14 \times 2)+(651-514) \approx 1300
\]
~ 5000 new intrinsics

\section*{Solution: introduce `define_subst’}

Generate new pattern from existing
E.g. add masking and rounding

\section*{Example (original pattern)}
```

(define_insn "*<plusminus_insn><mode>3"
[(set (match_operand:VF_AVX512 0 "register_operand" "=x,v")
(plusminus:VF_AVX512 (match_operand:VF_AVX512 1 "nonimmediate_operand" "%0,v")
(match_operand:VF_AVX512 2 "nonimmediate_operand" "xm,vm"))

```

\section*{Example (+mask)}
```

(define_insn "*<plusminus_insn><mode>3"
[(set (match_operand:VF_AVX512 0 "register_operand" "=x,v")
(plusminus:VF_AVX512 (match_operand:VF_AVX512 1 "nonimmediate_operand" "%0,v")
(match_operand:VF_AVX512 2 "nonimmediate_operand" "xm,vm"))

```
```

(define_insn "*<plusminus_insn><mode>3_mask"
[(set (match_operand:VF_AVX512 0 "register_operand" "=x,v")
(vec_merge:VF_AVX512
(plusminus:VF_AVX512 (match_operand:VF_AVX512 1 "nonimmediate_operand" "%0,v")
(match_operand:VF_AVX512 2 "nonimmediate_operand" "xm,vm"))
(match_operand:VF_AVX512 3 "nonimmediate_or_const0_operand" "OC,OC")
(match_operand:DI 4 "register_operand" "k,k")

```

\section*{Example(+rounding)}
```

(define_insn "*<plusminus_insn><mode>3"
[(set (match_operand:VF_AVX512 0 "register_operand" "=x,v")
(plusminus:VF_AVX512 (match_operand:VF_AVX512 1 "nonimmediate_operand" "%0,v")
(match_operand:VF_AVX512 2 "nonimmediate_operand" "xm,vm"))

```
(define_insn "*<plusminus_insn><mode>3_mask"
[(set (match_operand:VF_AVX512 0 "register_operand" "=x,v")
    (vec_merge:VF_AVX512
        (plusminus:VF_AVX512 (match_operand:VF_AVX512 1 "nonimmediate_operand" "\%0,v")
                                (match_operand:VF_AVX512 2 "nonimmediate_operand" "xm,vm"))
    (match_operand:VF_AVX512 3 "nonimmediate_or_const0_operand" "OC,OC")
    (match_operand:DI 4 "register_operand" "k,k")
```

(define_insn "*<plusminus_insn><mode>3_round"
[(parallel [(set (match_operand:VF_AVX512 0 "register_operand" "=x,x")
(plusminus:VF_AVX512 (match_operand:VF_AVX512 1 "nonimmediate_operand" "%0,v")
(match_operand:VF_AVX512 2 "nonimmediate_operand" "xm,vm"))
(match_operand:SI 3 "const_4_to_8_operand" "n,n")

```

\section*{Example (+both)}
(define_insn "*<plusminus_insn><mode>3"
[(set (match_operand:VF_AVX512 0 "register_operand" " \(=\mathrm{x}, \mathrm{v}\) ")
(plusminus:VF_AVX512 (match_operand:VF_AVX512 1 "nonimmediate_operand" "\%0,v")
(match_operand:VF_AVX512 2 "nonimmediate_operand" "xm,vm"))
(define_insn "*<plusminus_insn><mode>3_mask"
[(set (match_operand:VF_AVX512 0 "register_operand" "=x,v")
(vec_merge:VF_AVX512
(plusminus:VF_AVX512 (match_operand:VF_AVX512 1 "nonimmediate_operand" "\%0,v")
(match_operand:VF_AVX512 2 "nonimmediate_operand" "xm,vm"))
(match_operand:VF_AVX512 3 "nonimmediate_or_const0_operand" "OC,OC")
(match_operand:DI 4 "register_operand" "k,k")
```

(define_insn "*<plusminus_insn><mode>3_round"
[(parallel [(set (match_operand:VF_AVX512 0 "register_operand" "=x,x")
(plusminus:VF_AVX512 (match_operand:VF_AVX512 1 "nonimmediate_operand" "%0,v")
(match_operand:VF_AVX512 2 "nonimmediate_operand" "xm, vm"))
(match_operand:SI 3 "const_4_to_8_operand" "n,n")

```
(define_insn "*<plusminus_insn><mode>3_mask_round"
    [(parallel (set (match_operand:VF_AVX512 0 "register_operand" "=x, v")
            (vec_merge:VF AVX512
                (plusminus:VF_AVX512 (match_operand:VF_AVX512 1 "nonimmediate_operand" "\%0,v")
                    (match_operand:VF_AVX512 2 "nonimmediate_operand" "xm, vm"))
            (match_operand:VF_AVX512 3 "nonimmediate_or_const0_operand" "0C,0C")
            (match_operand:DI 4 "register_operand" "k,k")
    (match_operand:SI 3 "const_4_to_8_operand" "n,n")

\section*{define_subst (example)}
(define_subst "mask"
[(set (match_operand 0)
(match_operand 1))]
"TARGET_MASK"
[(set (match_dup 0)

Added to condition of new pattern
(vec_merge:VF_512 (match_dup 1)
 (match_operand:VF_512 2 "register_operand" "0C") (match_operand:<at> 3 "register_operand" "Yk")))])
```

iterator_attribute

```

\section*{Example (using subst)}
(define_insn "*<plusminus_insn><mode>3<mask_name>"
[(set (match_operand:VF_AVX512 0 "register_operand" "=x,v")
(plusminus:VF_AVX512
(match_operand:VF_AVX512 1 "nonimmediate_operand" "\%0,v")
(match_operand:VF_AVX512 2 "nonimmediate_operand" "xm,vm"))
(define_insn "*<plusminus_insn><mode>3"
[(set (match_operand:VF_AVX512 0 "register_operand" "=x,v")
(plusminus:VF_AVX512
(match_operand:VF_AVX512 1 "nonimmediate_operand" "\%0,v")
(match_operand:VF_AVX512 2 "nonimmediate_operand" "xm,vm")))]
(define_insn "*<plusminus_insn><mode>3_mask"
[(set (match_operand:VF_AVX512 0 "register_operand" "=x,v")
(vec_merge:VF_AVX512
(plusminus:VF_AVX512
(match_operand:VF_AVX512 1 "nonimmediate_operand" "\%0,v")
(match_operand:VF_AVX512 2 "nonimmediate_operand" "xm,vm"))
(match_operand:VF_AVX512 3 "nonimmediate_or_constO_operand" "OC,OC")
(match_operand:DI 4 "register_operand" "Yk,Yk")

\section*{AVX-512: Embedded broadcasting}

\section*{Embedded broadcasting support in GCC}

\section*{GOAL}

VBROADCAST (\%rax),zmm3
VADDPD zmm3,zmm2,zmm1


VADDPD (\%rax)\{1to8\},zmm2,zmm1

\section*{Implementation}

Use substs to generate rtx patterns and rely on combiner
```

(define_subst "emb_bcst2"
[(set (match_operand:BCST_V 0)
(any_operator2:BCST_V
(match_operand:BCST_V 1)
(match_operand:BCST_V 2)))]
"TARGET AVX512F"
[(set (match_dup 0)
(any_operator2:BCST_V
(vec_duplicate:BCST_V
(match_operand:<ssescalarmode> 2 "memory_operand" "m"))
(match_dup 1)))])

```

\section*{Results}
- Internal implementation done
- Performance gain is 0\%
- Combiner can't eliminate broadcasts that are
- Have multiple destinations
- Reside in different BBs
- State of the art embedded broadcasting (icc) shows little icount gain
- Impact on icache can't be measured without hardware

Conclusion
- Patch not submitted - no performance gain (for now?)

\section*{Support of new `scatter’ instruction family}

\section*{AVX-512: Scatters overview}

\section*{Allows vectorization of loops with stores} which addresses can be represented as: Address [i] = BaseAddress + Index[i] * Scale
\[
\begin{aligned}
& \text { for }(i=0 ; i<N ; i++) \\
& \left\{\begin{array}{l}
\text { A } B[i]+3]=C[i] ; \\
\}
\end{array}\right.
\end{aligned}
\]

\section*{VPSCATTERDD zmm0, ([rax], zmm1, 4) \{k1\}}

Stores up to 16 elements (controlled by mask) to the memory location pointed by base address, index vector and scale. For successfully stored elements corresponding mask bits will be set to zero.


\section*{Scatter Support in GCC}

Patch which adds autogeneration of scatter instructions in case when one array is indexed by another array is ready
- Need to add autogeneration of scatter instructions for strided stores
- Expectations of performance improve from strided stores using scatters based on ICC 14:
- SPEC 2006
- 434.zeusmp - >1.5\%
- NPB 3.3.1-SER.ClassW - more than 1\% of all executed instructions are scatters, so few percent of performance improve can be expected

Array indexing another array:
for ( \(i=0 ; i<N ; i++\) )
\{
\[
A[B[i]+3]=C[i] ;
\]
\}

Strided store:
for ( \(i=0 ; i<N ; i++\) )
\{
\[
A\left[5^{\star} i+3\right]=B[i] ;
\]
\}

\section*{512-bit auto-vectorization in GCC}

7\% icounts decrease (vs. AVX2 on SPECfp2006, -Ofast, ref data)
- New registers ZMM16-ZMM31 - 0.6\% on SPECfp2006
- Extended standard patterns for the vectorizer:
- Arithmetic and logic - 4.0\% on SPECfp2006
- expand_vector_init - \(1.0 \%\) on SPECfp2006
- FP division - \(0.6 \%\) on SPECfp2006
- FMA - 0.4\% on SPECfp2006
- copysign - \(0.4 \%\) on SPECfp2006
- cmp, vec_perm, unpack, extract, sqrt, rcp, floor, ceil, round, gather, reduction, etc.
- (No hotspots in SPEC CPU2006 benchmarks, where GCC can vectorize with VL=256, but can't vectorize with VL=512)

\section*{Embedded masking autogeneration}

Most promising feature of new encoding (EVEX)
- Preliminary investigation performed
- Not-for-trunk proof-of-concept patch implemented, which shows about 1.5\% of icount decrease on average in SpecFP2006
- Vectorized loop tails
- Vectorization of loop heads looks promising as well
- Applicable for if-conv optimization
- Masking of operation, not result, hence no redundant side effects, exceptions, memory accesses etc.

\section*{Enabling of SKX in GNU toolchain}

\section*{Enabling of SKX in GNU toolchain}

New ISA were published on July'18 2014
- Patch set for support new ISA in binutils (gas/objdump) was submitted
- Branch with support for GCC was created (avx512-skx)
- Extended existing patterns (i386/sse.md) to support AVX-512VL,BW,DQ
- Set of intrinsics covering new ISA was implemented
- ... Covered by corresponding testsuite
- Target for GCC 4.10.x
- No performance work was done so far
- glibc work was not performed

\section*{Backup}

\section*{AVX-512 features (I): More \& Bigger}

\section*{Registers}

AVX: VADDPS YMMO, YMM3, [mem]
- Up to 16 AVX registers
- 8 in 32-bit mode
- 256-bit width
- \(8 \times\) FP32
- 4 x FP64

AVX-512: VADDPS ZMMO, ZMM24, [mem]
- Up to 32 AVX registers
- 8 in 32-bit mode
- 512-bit width
- \(16 \times\) FP32
- \(8 \times\) FP64

But you need many more features to use all that real estate effectively...
float32 A[N], B[N];
for(i=0; i<8; i++)
\{
\[
A[i]=A[i]+B[i] ;
\]
\}
float32 A[N], B[N];
for(i=0; i<16; i++)
\{
\[
A[i]=A[i]+B[i] ;
\]
\}
\[
\text { \} }
\]

\section*{Why Separate Mask Registers?}

Don't waste away real vector registers for vector of booleans

Separate control flow from data flow

Boolean operations on logical predicates consume less energy (separate functional unit)

Tight encoding allows orthogonal operand
- Every instruction now has an extra mask operand

\section*{Why True Masking?}

\section*{Memory fault suppression}
- Vectorize code without touching memory that the correspondent scalar code would not touch
- Typical examples are if-conditional statements or loop remainders
- AVX is forced to use VMASKMOV*

\section*{MXCSR flag updates and fault handlers}
- Avoid spurious floating-point exceptions without having to inject neutral data
Zeroing/merging
- Use zeroing to avoid false dependencies in OOO architecture
- Use merging to avoid extra blends in if-then-else clauses (predication) for greater code density
```

float32 A[N], B[N], C[N];

```
for (i=0; i<16; i++)
\{
    if(B[i] != 0) \{
        \(A[i]=A[i] / B[i] ;\)
    else \{
        \(A[i]=A[i] / C[i] ;\)
    \}
\}

VMOVUPS zmm2, A VCMPPS k1, zmm0, B, 4 VDIVPS zmm1 \{k1\}\{z\}, zmm2, B KNOT k2, k1
VDIVPS zmm1 \{k2\}, zmm2, C VMOVUPS A, zmm1

\section*{AVX-512 Features: Compressed Displacement \\ VADDPS zmm1, zmm2, [rax+256]}
- Observation is that displacement in generated vector code is a multiple of the actual operand size
- An obvious side effect of unrolling
- Unfortunately, regular IA 8-bit displacement format have limited scope for 512-bit vector sizes (unrolling look-ahead of \(+/-2\) at most)
- So we would end up using 32-bit displacement formats too often

\section*{AVX-512 disp8*N compressed displacement}
- AVX-512 implicitly encodes a 8-bit displacement as a multiple of the actual size of the memory operand
- VADDPD zmm1 \{k1\}, zmm2, [rax] memory size operand is 512bits
- VADDPD xmm1 \{k1\}, xmm2, [rax] memory size operand is 128bits
- VADDPD zmm1 \{k1\}, zmm2, [rax] \{1toN\} memory size operand is 64 bits
- Assembler/compiler reverts to 32-bit displacement when the real displacement is not a multiple

\section*{AVX-512 F: Common Xeon Phi (KNL) and Skylake Xeon Vector ISA Extension}

AVX-512 Foundation is the common SIMD foundation for HPC software development
First on KNL
Planned on SKX (Skylake Xeon)

\section*{Quadword Integer Arithmetic}

\section*{Useful for pointer manipulation 64-bit becomes a first class citizen Removes the need for expensive SW emulation sequences Note: VPMULQ and int64 <-> FP converts not in AVX-512 F}
\begin{tabular}{|l|l|}
\hline Instruction & Description \\
\hline VPADDQ zmm1 \{k1\}, zmm2, zmm3 & INT64 addition \\
\hline VPSUBQ zmm1 \{k1\}, zmm2, zmm3 & INT64 subtraction \\
\hline VP\{SRA, SRL, SLL\}Q zmm1 \{k1\}, zmm2, imm8 & INT64 shift (imm8) \\
\hline VP\{SRA, SRL, SLL\}VQ zmm1 \{k1\}, zmm2, zmm3 & INT64 shift (variable) \\
\hline VP\{MAX, MIN\}Q zmm1 \{k1\}, zmm2, zmm3 & INT64 max, min \\
\hline VP\{MAX, MIN\}UQ zmm1 \{k1\}, zmm2, zmm3 & UINT64 max, min \\
\hline VPABSQ zmm1 \{k1\}, zmm2, zmm3 & INT64 absolute value \\
\hline VPMUL\{DQ,UDQ\} zmm1 \{k1\}, zmm2, zmm3 & \(32 \times 32=64\) integer multiply \\
\hline
\end{tabular}

\section*{Math Support}

\section*{Package to aid with Math library writing}
- Good value upside in financial applications
\begin{tabular}{|c|}
\hline Instruction \\
\hline \(\mathrm{VGETXEXP}_{\text {\{PS,PD,SS,SD }\}}\) \\
\hline VGETMANT \({ }_{\text {\{PS, PD,SS,SD }}\) \\
\hline VRNDSCALE \(_{\text {[PS,PD,SS,SD }}\) \\
\hline \(\mathrm{VSCALEF}_{\text {\{PS,PD,SS,SD }}\) \\
\hline VFIXUPIMM \({ }_{\text {\{PS,PD,SS,SD }}\) \\
\hline VRCP14 \({ }_{\text {\{PS, PD, SS,SD\} }}\) \\
\hline VRSQRT14 \({ }_{\text {[PS, PD, SS, SD }\}}\) \\
\hline \(\mathrm{VDIV}_{\text {\{PS,PD, SS,SD\} }}\) \\
\hline \(\mathrm{VSQRT}_{\text {\{PS, PD, SS,SD\} }}\) \\
\hline
\end{tabular}
\begin{tabular}{|l|l|}
\hline zmm1 \{k1\}, zmm2 & Obtain exponent in FP format \\
\hline zmm1 \{k1\}, zmm2 & Obtain normalized mantissa \\
\hline zmm1 \{k1\}, zmm2, imm8 & Round to scaled integral number \\
\hline zmm1 \{k1\}, zmm2, zmm3 & \(\mathrm{X}^{*} 2^{y}, \mathrm{X}\) < = getmant, Y <= getexp \\
\hline zmm1, zmm2, zmm3, imm8 & Patch output numbers based on inputs \\
\hline zmm1 \{k1\}, zmm2 & Approx. reciprocal() with rel. error 2-14 \\
\hline zmm1 \{k1\}, zmm2 & Approx. rsqrt() with rel. error 2-14 \\
\hline zmm1 \{k1\}, zmm2, zmm3 & IEEE division \\
\hline zmm1 \{k1\}, zmm2 & IEEE square root \\
\hline
\end{tabular}

\section*{New 2-Source Shuffles}

\section*{2-Src Shuffles}
\begin{tabular}{|l|}
\hline VSHUF\{PS,PD \(\}\) \\
\hline VPUNPCK\{H,L\}\{DQ,QDQ \(\}\) \\
\hline VUNPCK\{H,L\}\{PS,PD \(\}\) \\
\hline VPERM\{I,D\}2\{D,Q,PS,PD \(\}\) \\
\hline VSHUF\{F,I\}32X4 \\
\hline
\end{tabular}

\section*{Long standing customer request}
- 16/32-entry table lookup (transcendental support)
- AOS \(\Leftrightarrow\) SOA support, matrix transpose
- Variable VALIGN emulation


\section*{Expand \& Compress}

Allows vectorization of conditional loops
- Opposite operation (compress) in AXV-512
- Similar to FORTRAN pack/unpack intrinsics
- Provides mem fault suppression
- Faster than alternative gather/scatter
for \((j=0, i=0 ; i<N ; i++)\)
\{
    \(i f(C[i]!=0.0)\)
    \{
        \(B[i]=A[i] * C\left[j^{++}\right] ;\)
    \}
\}

\section*{VEXPANDPS zmm0 \{k2\}, [rax]}

Moves compressed (consecutive) elements in register or memory to sparse elements in register (controlled by mask), with merging or zeroing


\section*{Bit Manipulation}

\section*{Basic bit manipulation operations on mask and vector operands \\ - Useful to manipulate mask registers \\ - Have uses in cryptography algorithms}
\begin{tabular}{|l|l|}
\hline Instruction & Description \\
\hline KUNPCKBW k1, k2, k3 & Interleave bytes in k2 and k3 \\
\hline KSHIFT\{L, R\}W k1, k2, imm8 & Shift bits left/right using imm8 \\
\hline VPROR\{D,Q\} zmm1 \{k1\}, zmm2, imm8 & Rotate bits right using imm8 \\
\hline VPROL\{D,Q\} zmm1 \{k1\}, zmm2, imm8 & Rotate bits left using imm8 \\
\hline VPRORV\{D, Q\} zmm1 \{k1\}, zmm2, zmm3/mem & Rotate bits right w/ variable ctrl \\
\hline VPROLV\{D, Q\} zmm1 \{k1\}, zmm2, zmm3/mem & Rotate bits left w/ variable ctrl \\
\hline
\end{tabular}

\section*{VPTERNLOG - Ternary Logic Instruction Mimics a FPGA cell}
- Take every bit of three sources to obtain a 3-bit index N
- Obtain Nth bit from imm8

\section*{VPTERNLOGD zmm0 \{k2\}, zmm15, zmm3/[rax], imm8}

Any arbitrary truth table of 3 values can be implemented andor, andxor, vote, parity, bitwise-cmov, etc each column in the right table corresponds to imm8
\begin{tabular}{|c|c|c|c|c|}
\hline S1 & S2 & S3 & ANDOR & VOTE (S1)?S3:S2 \\
\hline 0 & 0 & 0 & 00 & 0 \\
\hline 0 & 0 & 1 & 10 & 1 \\
\hline 0 & 1 & 0 & 00 & 0 \\
\hline 0 & 1 & 1 & 11 & 1 \\
\hline 1 & 0 & 0 & 00 & 0 \\
\hline 1 & 0 & 1 & 11 & 0 \\
\hline 1 & 1 & 0 & 11 & 1 \\
\hline 1 & 1 & 1 & 11 & 1 \\
\hline
\end{tabular}


Dest[i]

\section*{AVX-512 CDI: Conflict Detection Instructions}

\section*{Motivation for Conflict Detection}

Sparse computations are common in HPC, but hard to vectorize due to race conditions

Consider the "histogram" problem:

- Code above is wrong if any values within B[i] are duplicated
- Only one update from the repeated index would be registered!
- A solution to the problem would be to avoid executing the sequence gather-opscatter with vector of indexes that contain conflicts

\section*{Conflict Detection Instructions in}

\section*{AVX512}

\section*{VPCONFLICT instruction detects elements with}

\section*{previous conflicts in a vector of indexes}
- Allows to generate a mask with a subset of elements that are guaranteed to be conflict free
- The computation loop can be re-executed with the remaining elements until all the indexes have been

\section*{CDI instr.}

VPCONFLICT\{D,Q\} zmm1\{k1\}, zmm2/mem VPBROADCASTM\{W2D,B2Q\} zmm1, k2

VPTESTNM\{D,Q\} k2\{k1\}, zmm2, zmm3/mem VPLZCNT\{D,Q\} zmm1 \{k1\}, zmm2/mem operated upon
```

index = vload \&B[i] // Load 16 B[i]
pending_elem = 0xFFFF;
do {
curr_elem = get_conflict_free_subset(index, pending_elem)
old_val = vgather {curr_elem} A, index // Grab A[B[i]]
new_val = vadd old_val, +1.0 // Compute new values
vscatter A {curr_elem}, index, new_val // Update A[B[i]]
pending_elem = pending_elem ^ curr_elem // remove done idx
} while (pending_elem)

```

This not even the fastest version: see backup for details

\section*{AVX-512 ERI \& AVX-512 PRI: Xeon Phi Only}

\section*{Xeon Phi Only Instructions}

\section*{Set of segment-specific instruction extensions}
- First appear on KNL
- Will be supported in all future Xeon Phi processors
- May or may not show up on a later Xeon processor

\section*{Address two HPC customer requests}
- Ability to maximize memory bandwidth
- Hardware prefetching is too restrictive
- Conventional software prefetching results in instructions overhead
- Competitive support for transcendental sequences
- Mostly division and square root
- Differentiating factor in HPC/TPT

\section*{AVX-512 ERI \& PRI Description}
\begin{tabular}{|c|c|c|}
\hline CPUID & Instructions & Description \\
\hline \multirow{3}{*}{\[
\begin{aligned}
& \bar{\alpha} \\
& \underset{\sim}{N} \\
& \underset{\sim}{\grave{N}} \\
& \underset{~}{㐅}
\end{aligned}
\]} & PREFETCHWT1 & Prefetch cache line into the L2 cache with intent to write (RFO ring request) \\
\hline & VGATHERPF\{D,Q\}\{0,1\}PS & Prefetch vector of \(D / Q w o r d\) indexes into the L1/L2 cache \\
\hline & VSCATTERPF\{D,Q\}\{0,1\}PS & Prefetch vector of D/Qword indexes into the L1/L2 cache with intent to write \\
\hline \multirow[b]{3}{*}{\[
\begin{aligned}
& \bar{\sim} \\
& \stackrel{\sim}{\sim} \\
& \stackrel{N}{N} \\
& \underset{~}{\gtrless}
\end{aligned}
\]} & VEXP2\{PS,PD\} & Computes approximation of \(2^{\mathrm{x}}\) with maximum relative error of \(2^{-23}\) \\
\hline & VRCP28\{PS, PD \} & Computes approximation of reciprocal with max relative error of \(2^{-28}\) \\
\hline & VRSQRT28\{PS,PD\} & Computes approximation of reciprocal square root with max relative error of \(2^{-28}\) \\
\hline
\end{tabular}

\section*{AVX-512 ERI \& PRI Motivation}
\begin{tabular}{|c|c|c|}
\hline CPUID & Instructions & Motivation \\
\hline \multirow{3}{*}{} & PREFETCHWT1 & Reduce ring traffic in core-to-core data communication \\
\hline & VGATHERPF\{D,Q\}\{0,1\}PS & \multirow[t]{2}{*}{Reduce overhead of software prefetching: dedicate side engine to prefetch sparse structures while devoting the main CPU to pure raw flops} \\
\hline & VSCATTERPF\{D,Q\}\{0,1\}PS & \\
\hline \multirow{3}{*}{} & VEXP2\{PS,PD\} & Speed-up key FSI workloads: Black-Scholes, Montecarlo \\
\hline & VRCP28\{PS, PD \} & \multirow[t]{2}{*}{\begin{tabular}{l}
Key building block to speed up most transcendental sequences (in particular, division and square root): \\
Increasing precision from 14=>28 allows to reduce one complete Newton-Raphson iteration
\end{tabular}} \\
\hline & VRSQRT28\{PS,PD\} & \\
\hline
\end{tabular}

\section*{define_subst_attr}

(define_subst_attr "mask_name" "mask" "" "_mask")


\section*{Summary}

AVX512: new 512-bit vector ISA extension
- Common between Xeon (SKL) and Xeon Phi (KNL)

AVX512VL, AVX512DQ, AVX512BW: complements AVX512
- Shows up first on Skylake Xeon
- Provides support for all data types and vector lengths

Conflict detection new instructions
- Improves autovectorization
- Common to Xeon and Xeon Phi

TVX new instructions
- 28-bit transcendentals and new prefetch instructions
- On Xeon Phi only

\section*{Authors}
- Mark Charney
- Jesus Corbal
- Roger Espasa
- Milind Girkar
- Moustapha Ould-ahmed-vall
- Ilya Tocar
- Bret Toll
- Bob Valentine
- Ilya Verbin
- Kirill Yukhin

\title{
Legal Disclaimer \& Optimization Notice
}

> INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

\section*{Optimization Notice}

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision \#20110804

\section*{(intel)}```

