

# New Instruction Set Extensions

Instruction Set Innovation in Intels Processor Code Named Haswell

bob.valentine@intel.com

## Agenda

- Introduction Overview of ISA Extensions
- Haswell New Instructions
  - New Instructions Overview
  - Intel® AVX2 (256-bit Integer Vectors)
  - Gather
  - FMA: Fused Multiply-Add
  - Bit Manipulation Instructions
  - TSX/HLE/RTM
- Tools Support for New Instruction Set Extensions
- Summary/References



### Instruction Set Architecture (ISA) Extensions

| 199x | MMX, CMOV,<br>PAUSE,<br>XCHG, | Multiple new instruction sets added to the initial 32bit instruction set of the Intel® 386 processor |
|------|-------------------------------|------------------------------------------------------------------------------------------------------|
| 1999 | Intel® SSE                    | 70 new instructions for 128-bit single-precision FP support                                          |
| 2001 | Intel® SSE2                   | 144 new instructions adding 128-bit integer and double-precision FP support                          |
| 2004 | Intel® SSE3                   | 13 new 128-bit DSP-oriented math instructions and thread synchronization instructions                |
| 2006 | Intel SSSE3                   | 16 new 128-bit instructions including fixed-point multiply and horizontal instructions               |
| 2007 | Intel® SSE4.1                 | 47 new instructions improving media, imaging and 3D workloads                                        |
| 2008 | Intel® SSE4.2                 | 7 new instructions improving text processing and CRC                                                 |
| 2010 | Intel® AES-NI                 | 7 new instructions to speedup AES                                                                    |
| 2011 | Intel® AVX                    | 256-bit FP support, non-destructive (3-operand)                                                      |
| 2012 | Ivy Bridge NI                 | RNG, 16 Bit FP                                                                                       |
| 2013 | Haswell NI                    | AVX2, TSX, FMA, Gather, Bit NI                                                                       |

## A long history of ISA Extensions!



#### Instruction Set Architecture (ISA) Extensions

- Why new instructions?
  - Higher absolute performance
  - More energy efficient performance
  - New application domains
  - Customer requests
  - Fill gaps left from earlier extensions
- For a historical overview see
   <a href="http://en.wikipedia.org/wiki/X86\_instruction\_listings">http://en.wikipedia.org/wiki/X86\_instruction\_listings</a>



### Intel Tick-Tock Model



Tick-tock delivers leadership through technology innovation on a reliable and predictable timeline



## New Instructions in Haswell

| Group                             |                                                         | Description                                                                                      | Count *      |
|-----------------------------------|---------------------------------------------------------|--------------------------------------------------------------------------------------------------|--------------|
| <b>2</b>                          | SIMD Integer<br>Instructions<br>promoted to 256<br>bits | Adding vector integer operations to 256-bit                                                      |              |
| AVX2                              | Gather                                                  | Load elements from vector of indices vectorization enabler                                       | 170 /<br>124 |
|                                   | Shuffling / Data<br>Rearrangement                       | Blend, element shift and permute instructions                                                    |              |
| FMA                               |                                                         | Fused Multiply-Add operation forms (FMA-3)                                                       | 96 / 60      |
| Bit Manipulation and Cryptography |                                                         | Improving performance of bit stream manipulation and decode, large integer arithmetic and hashes | 15 / 15      |
| TSX=RTM+HLE                       |                                                         | M+HLE Transactional Memory                                                                       |              |
| Others                            |                                                         | Others MOVBE: Load and Store of Big Endian forms INVPCID: Invalidate processor context ID        |              |

<sup>\*</sup> Total instructions / different mnemonics



## Intel® AVX2: 256-bit Integer Vector

Extends Intel® AVX to cover integer operations

Uses same AVX (256-bit) register set

Nearly all 128-bit integer vector instructions are 'promoted' to 256

Including Intel® SSE2, Intel® SSSE3, Intel® SSE4

Exceptions: GPR moves (MOVD/Q); Insert and Extracts <32b, Specials (STTNI instructions, AES, PCLMULQDQ)

New 256b Integer vector operations (not present in Intel® SSE)

- Cross-lane element permutes
- Element variable shifts
- Gather
- •Haswell implementation doubles the cache bandwidth
- Two 256-bit loads per cycle, fill rate and split line improvements
- Helps both Intel® AVX2 and legacy Intel® AVX performance

Intel® AVX2 'completes' the 256-bit extensions started with Intel® AVX



## Integer Instructions Promoted to 256

| VMOVNTDQA | VPHADDSW        | VPSUBQ           | VPMOVZXDQ         | VPMULHW   |
|-----------|-----------------|------------------|-------------------|-----------|
| VPABSB    | VPHADDW         | VPSUBSB          | VPMOVZXWD         | VPMULLD   |
| VPABSD    | VPHSUBD         | <b>VPSUBSW</b>   | <b>VPMOVZXWQ</b>  | VPMULLW   |
| VPABSW    | <b>VPHSUBSW</b> | VPSUBUSB         | VPSHUFB           | VPMULUDQ  |
| VPADDB    | VPHSUBW         | VPSUBUSW         | VPSHUFD           | VPSADBW   |
| VPADDD    | VPMAXSB         | VPSUBW           | VPSHUFHW          | VPSLLD    |
| VPADDQ    | <b>VPMAXSD</b>  | <b>VPACKSSDW</b> | VPSHUFLW          | VPSLLDQ   |
| VPADDSB   | <b>VPMAXSW</b>  | <b>VPACKSSWB</b> | <b>VPUNPCKHBW</b> | VPSLLQ    |
| VPADDSW   | VPMAXUB         | VPACKUSDW        | VPUNPCKHDQ        | VPSLLW    |
| VPADDUSB  | VPMAXUD         | VPACKUSWB        | VPUNPCKHQDQ       | VPSRAD    |
| VPADDUSW  | <b>VPMAXUW</b>  | <b>VPALIGNR</b>  | <b>VPUNPCKHWD</b> | VPSRAW    |
| VPADDW    | VPMINSB         | <b>VPBLENDVB</b> | VPUNPCKLBW        | VPSRLD    |
| VPAVGB    | VPMINSD         | VPBLENDW         | VPUNPCKLDQ        | VPSRLDQ   |
| VPAVGW    | VPMINSW         | VPMOVSXBD        | VPUNPCKLQDQ       | VPSRLQ    |
| VPCMPEQB  | <b>VPMINUB</b>  | <b>VPMOVSXBQ</b> | <b>VPUNPCKLWD</b> | VPSRLW    |
| VPCMPEQD  | VPMINUD         | <b>VPMOVSXBW</b> | VMPSADBW          | VPAND     |
| VPCMPEQQ  | VPMINUW         | VPMOVSXDQ        | VPCMPGTQ          | VPANDN    |
| VPCMPEQW  | VPSIGNB         | VPMOVSXWD        | VPMADDUBSW        | VPOR      |
| VPCMPGTB  | VPSIGND         | <b>VPMOVSXWQ</b> | <b>VPMADDWD</b>   | VPXOR     |
| VPCMPGTD  | VPSIGNW         | VPMOVZXBD        | VPMULDQ           | VPMOVMSKB |
| VPCMPGTW  | VPSUBB          | VPMOVZXBQ        | VPMULHRSW         |           |
| VPHADDD   | VPSUBD          | VPMOVZXBW        | VPMULHUW          |           |



#### Gather

"Gather" is a fundamental building block for vectorizing indirect memory accesses.

```
int a[] = {1,2,3,4,.....,99,100};
int b[] = {0,4,8,12,16,20,24,28}; // indices

for (i=0; i<n; i++) {
    x[i] = a[b[i]];
}</pre>
```

Haswell introduces set of GATHER instructions to allow automatic vectorization of similar loops with non-adjacent, indirect memory accesses

HSW GATHER not always faster – software can make optimizing assumptions



## Gather Instruction -Sample

VPGATHERDQ ymm1,[xmm9\*8 + eax+22], ymm2

| P    | integer data (no FP data)                                     |
|------|---------------------------------------------------------------|
| D    | indicate index size; here double word (32bit)                 |
| Q    | to indicate data size; here quad word (64 bit) integer        |
| ymm1 | destination register                                          |
| ymm9 | index register; here 4-byte size indices                      |
| 8    | scale factor                                                  |
| eax  | general register containing base address                      |
| 22   | offset added to base address                                  |
| ymm2 | mask register; only most significant bit of each element used |

Gather: Fundamental building block for nonadjacent, indirect memory accesses for either integer or floating point data



## Sample – How it Works





#### Gather - The whole Set

- Destination register either XMM or YMM
  - Mask register matches destination
  - No difference in instruction name
- Load of FP data or int data
  - 4 or 8 byte
- Index size 4 or 8 bytes

This results in 2x2x2x2=16 instructions

| 1 V I      |   |                          |                          |  |  |
|------------|---|--------------------------|--------------------------|--|--|
|            |   | 4                        | 8                        |  |  |
| Size       | 4 | VGATHERDPS<br>VPGATHERDD | VGATHERDPD<br>VPGATHERDQ |  |  |
| Index Size | 8 | VGATHERQPS<br>VPGATHERQD | VGATHERQPD<br>VPGATHERQQ |  |  |

Data Size

- Gather is fully deterministic
  - Same behavior from run to run even in case of exceptions
  - Mask bits are set to 0 after processing
    - Gather is complete when mask is all 0
    - The instruction can be restarted in the middle



## New Data Movement Instructions

| VPERMQ/PD imm                                                                                       |                                                                      |  |
|-----------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|--|
| VPERMD/PS var                                                                                       | Permutes & Blends                                                    |  |
| VPBLENDD imm                                                                                        |                                                                      |  |
| VPSLLVQ & VPSRLVQ<br>Quadword Variable Vector Shift (Left<br>& Right Logical)                       |                                                                      |  |
| VPSLLVD, VPSRLVD, VPSRAVD Doubleword Variable Vector Shift (Left & Right Logical, Right Arithmetic) | Element Based Vector Shifts                                          |  |
| VPBROADCASTB/W/D/Q XMM & mem                                                                        |                                                                      |  |
| VBROADCASTSS/SD XMM                                                                                 | New Broadcasts  Register broadcasts requested by software developers |  |



## Shuffling / Data Rearrangement

- Traditional IA focus was lowest possible latency
- Many very specializes shuffles:
  - Unpacks, Packs, in-lane shuffles
- Shuffle controls were not data driven
- New Strategy
- Any-to-any, data driven shuffles at slightly higher latency



| Element Width | Vector Width | Instruction       | Launch    |
|---------------|--------------|-------------------|-----------|
| BYTE          | 128          | PSHUFB            | SSE4      |
| DWORD         | 256          | VPERMD<br>VPERMPS | AVX2 New! |
| QWORD         | 256          | VPERMQ<br>VPERMPD | AVX2 New! |



#### **VPERMD**

#### Instruction

VPERMD ymm1, ymm2, ymm3/m256

#### **Intrinsics**

b = \_mm256\_permutevar8x32\_epi32(\_\_m256i a, \_\_m256i idx)





#### Variable Bit Shift

- Different control for each element
  - Previous shifts had one control for ALL elements



| Element with | <<      | >>(Logical) | >> (Arithmetic)   |
|--------------|---------|-------------|-------------------|
| DWORD        | VPSLLVD | VPSRLVD     | VPSRAVD           |
| QWORD        | VPSLLVQ | VPSRLVQ     | (not implemented) |

<sup>\*</sup> If the controls are greater than data width, then the destination data element are written with "0".



## FMA: Fused Multiply-Add

Computes  $(a \times b) \pm c$  with only one round

a × b intermediate result is not rounded before add/sub

Can speed up and improve the accuracy of many FP computations, e.g.,

- Matrix multiplication (SGEMM, DGEMM, etc.)
- Dot product
- Polynomial evaluation

Can perform 8 single-precision FMA operations or 4 double-precision FMA operations with 256-bit vectors per FMA unit

- Increases FLOPS capacity over Intel® AVX
- Maximum throughput of two FMA operations per cycle

FMA can provide improved accuracy and performance



## 20 FMA Varieties Supported

| vFMAdd    | a×b + c                                                                            | fused mul-add              | ps, pd, ss,<br>sd |
|-----------|------------------------------------------------------------------------------------|----------------------------|-------------------|
| vFMSub    | a×b - c                                                                            | fused mul-sub              | ps, pd, ss,<br>sd |
| vFNMAdd   | - (a×b + c)                                                                        | fused negative mul-<br>add | ps, pd, ss,<br>sd |
| vFNMSub   | – (a×b – c)                                                                        | fused negative mul-<br>sub | ps, pd, ss,<br>sd |
| vFMAddSub | $a[i] \times b[i] + c[i]$ on odd indices $a[i] \times b[i] - c[i]$ on even indices |                            | ps, pd            |
| vFMSubAdd | $a[i] \times b[i] - c[i]$ on odd indices $a[i] \times b[i] + c[i]$ on even indices |                            | ps, pd            |

Multiple FMA varieties support both data types and eliminate additional negation instructions



## Haswell Bit Manipulation Instructions

15 overall, operate on general purpose registers (GPR):

- Leading and trailing zero bits counts
- Trailing bit manipulations and masks
- Random bit fields extract/pack
- Improved long precision multiplies and rotates

Narrowly focused instruction set arch (ISA) extension

- Partly driven by direct customers' requests
- Allows for additional performance on highly-optimized codes

Beneficial for applications with:

- hot spots doing bit-level operations, universal (de)coding (Golomb, Rice, Elias Gamma codes), bit fields pack/extract
- crypto algorithms using arbitrary precision arithmetic or rotates e.g.: RSA, SHA family of hashes



## Bit Manipulation Instructions

| CPUID bit                                            | Name   | Operation                                                                                                                |  |
|------------------------------------------------------|--------|--------------------------------------------------------------------------------------------------------------------------|--|
| CPUID.EAX=080000001H:<br>ECX. <b>LZCNT</b> [bit 5] * | LZCNT  | Leading Zero Count                                                                                                       |  |
|                                                      | TZCNT  | Trailing Zero Count                                                                                                      |  |
|                                                      | ANDN   | Logical And Not $Z = \sim X \& Y$                                                                                        |  |
| CPUID.(EAX=07H,                                      | BLSR   | Reset Lowest Set Bit                                                                                                     |  |
| ECX=0H):<br>EBX.BMI1[bit 3] *                        | BLSMSK | Get Mask Up to Lowest Set Bit                                                                                            |  |
|                                                      | BLSI   | Isolate Lowest Set Bit                                                                                                   |  |
|                                                      | BEXTR  | Bit Field Extract (can also be done with SHRX + BZHI)                                                                    |  |
|                                                      | BZHI   | Zero High Bits Starting with Specified Position                                                                          |  |
|                                                      | SHLX   | Variable Shifts (non-destructive, have Load+Operation forms, no implicit CL dependency and no flags effect)  Z = X << Y, |  |
|                                                      | SHRX   |                                                                                                                          |  |
| CPUID.(EAX=07H,                                      | SARX   | Z = X >> Y (for signed and unsigned X)                                                                                   |  |
| ECX=0H):<br>EBX.BMI2[bit 8]                          | PDEP   | Parallel Bit Deposit                                                                                                     |  |
|                                                      | PEXT   | Parallel Bit Extract                                                                                                     |  |
|                                                      | RORX   | Rotate Without Affecting Flags                                                                                           |  |
|                                                      | MULX   | Unsigned Multiply Without Affecting Flags                                                                                |  |

Note: Software needs to check for CPUID bits of all groups it uses instructions from



# Intel® Transactional Synchronization Extensions (Intel® TSX)

Intel® TSX = HLE + RTM

HLE (Hardware Lock Elision) is a hint inserted in front of a LOCK operation to indicate a region is a candidate for lock elision

- XACQUIRE (0xF2) and XRELEASE (0xF3) prefixes
- Don't actually acquire lock, but execute region speculatively
- Hardware buffers loads and stores, checkpoints registers
- Hardware attempts to commit atomically without locks
- If cannot do without locks, restart, execute non-speculatively

RTM (Restricted Transactional Memory) is three new instructions (XBEGIN, XEND, XABORT)

- Similar operation as HLE (except no locks, new ISA)
- If cannot commit atomically, go to handler indicated by XBEGIN
- Provides software additional capabilities over HLE



## Typical Lock Use: Thread Safe Hash Table



Focus on data conflicts, not lock contention



#### Lock Contention vs. Data Conflict



Data conflicts truly limit concurrency, not lock contention

Focus on data conflicts, not lock contention



## **HLE Execution**



No serialization if no data conflicts



#### **HLE: Two New Prefixes**

mov eax, 0 mov ebx, 1

F2 lock cmpxchg \$semaphore, ebx

Add a prefix hint to those instructions (lock inc, dec, cmpxchg, etc.) that identify the start of a critical section. Prefix is ignored on non-HLE systems.

. . .

mov eax, \$value1

mov \$value2, eax

mov \$value3, eax

. . .

Speculate

F3 mov \$semaphore, 0

Add a prefix hint to those instructions (e.g., stores) that identify the end of a critical section.



#### RTM: Three New Instructions

| Name               | Operation         |
|--------------------|-------------------|
| XBEGIN < rel16/32> | Transaction Begin |
| XEND               | Transaction End   |
| XABORT arg8        | Transaction Abort |

XBEGIN: Starts a transaction in the pipeline. Causes a checkpoint of register state and all following memory transactions to be buffered. Rel16/32 is the fallback handler. Nested depth support up to 7 XBEGINS. Abort if depth is 8. All abort roll-back is to outermost region.

XEND: Causes all buffered state to be atomically committed to memory. LOCK ordering semantics (even for empty transactions)

XABORT: Causes all buffered state to be discarded and register checkpoint to be recovered. Will jump to the XBEGIN labeled fallback handler. Takes imm8 argument.

There is also an XTEST instruction which can be used both for HLE and RTM to query whether execution takes place in a transaction region



# SW visibility for HLE and RTM

#### HLE

- Execute exact same code path if HLE aborts
- Legacy compatible: transparent to software

#### RTM

- Execute alternative code path if RTM aborts
- Visible to software
  - Provides 'return code' with reason tra

#### XTEST: New instruction to check if in HLE or RTM

- Can be used inside HLE and RTM
- Software can determine if hardware in HLE/RTM execution



## Intel Compilers - Haswell Support

- Compiler options
  - -Qxcore-avx2, -Qaxcore-avx2 (Windows\*)
  - -xcore-avx2 and -axcore-avx2 (Linux)
  - -march=core-avx2 Intel compiler
  - Separate options for FMA:
    - -Qfma, -Qfma- (Windows)
    - -fma, -no-fma (Linux)
- AVX2 integer 256-bit instructions
  - Asm, intrinsics, automatic vectorization
- Gather
  - Asm, intrinsics, automatic vectorization
- BMI Bit manipulation instructions
  - All supported through asm/intrinsics
  - Some through automatic code generation
- INVPCID Invalidate processor context
  - Asm only



## Other Compilers - Haswell Support

- Windows: Microsoft Visual Studio\* 11 will have similar support as Intel compiler
- GNU Compiler (4.7 experimental, 4.8 final) will have similar support as Intel compiler
- In particular switch -march=core-avx2 and same intrinsics



#### References

New Intel® AVX (AVX2) instructions specification

http://software.intel.com/file/36945

Forum on Haswell New Instructions

 http://software.intel.com/en-us/blogs/2011/06/13/haswellnew-instruction-descriptions-now-available/

Article from Oracle engineers on how to use hardwaresupported transactional memory on user level code ( not Intel® TSX specific )

 http://labs.oracle.com/scalable/pubs/HTM-algs-SPAA-2010.pdf



## Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference <a href="https://www.intel.com/software/products">www.intel.com/software/products</a>.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. \*Other names and brands may be claimed as the property of others.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804





#### **VGATHERDD**





#### **VGATHERQQ**





#### **VGATHERQD**





## **All Gather Instructions**

|       |   |     | Data S                                                                                                                                                                                | Size                                                                                                                                                                                   |
|-------|---|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|       |   |     | 4                                                                                                                                                                                     | 8                                                                                                                                                                                      |
| Size  | 4 | 단   | VGATHERDPS  Gather Packed SP FP using signed Dword Indicesm128i _mm_i32gather_ps()m128i _mm_mask_i32gather_ps()m256i _mm256_i32gather_ps()m256i _mm256_mask_i32gather_ps()            | VGATHERDPD  Gather Packed DP FP using signed Dword Indicesm128i _mm_i32gather_pd()m128i _mm_mask_i32gather_pd()m256i _mm256_i32gather_pd()m256i _mm256_mask_i32gather_pd()             |
|       |   | Int | VGATHERDD  Gather Packed Dword using signed Dword Indicesm128i _mm_i32gather_epi32()m128i _mm_mask_i32gather_epi32()m256i _mm256_i32gather_epi32()m256i _mm256_mask_i32gather_epi32() | VPGATHERDQ  Gather Packed Qword using signed Dword Indicesm128i _mm_i32gather_epi64()m128i _mm_mask_i32gather_epi64()m256i _mm256_i32gather_epi64()m256i _mm256_mask_i32gather_epi64() |
| Index | 8 | НЪ  | VGATHERQPS  Gather Packed SP FP using signed Qword Indicesm128i _mm_i64gather_ps()m128i _mm_mask_i64gather_ps()m256i _mm256_i64gather_ps()m256i _mm256_mask_i64gather_ps()            | VGATHERQPD  Gather Packed DP FP using signed Qword Indicesm128i _mm_i64gather_pd()m128i _mm_mask_i64gather_pd()m256i _mm256_i64gather_pd()m256i _mm256_mask_i64gather_pd()             |
|       | 0 | Int | VGATHERQD  Gather Packed Dword using signed Qword Indicesm128i _mm_i64gather_epi32()m128i _mm_mask_i64gather_epi32()m256i _mm256_i64gather_epi32()m256i _mm256_mask_i64gather_epi32() | VPGATHERQQ Gather Packed Qword using signed Qword Indicesm128i _mm_i64gather_epi64()m128i _mm_mask_i64gather_epi64()m256i _mm256_i64gather_epi64()m256i _mm256_mask_i64gather_epi64()  |



## FMA Operand Ordering Convention

3 orders: VFMADD132, VFMADD213, VFMADD231

VFMADDabc: Srca= (Srca× Srcb) + Srcc

- Srca and Srcb are numerically symmetric (interchangeable)
- Srca must be a register & is always the destination
- Srcb must be a register
- Srcc can be memory or register
- All 3 can be used in multiply or add

```
Example: Srca Srcb Srcc

VFMADD231 xmm8, xmm9, mem256

xmm8 = (xmm9 × mem256) + xmm8
```

#### Notes

Combination of 20 varieties with 3 operand orderings result in 60 new mnemonics

FMA operand ordering allows complete flexibility in selection of memory operand and destination



# FMA Latency

Due to out-of-order, FMA latency is not always better than separate multiply and add instructions

- Add latency: 3 cycles
- Multiply and FMA latencies: 5 cycles
- Will likely differ in later CPUs



FMA can improve or reduce performance due to various factors



## **Fused Multiply-Add –All Combinations**

|                                                                                               | Double Precision<br>Packed FP                                                                       | Single Precision<br>Packed FP                                                                       | Double Precision<br>Scalar FP                                                         | Single Precision<br>Scalar FP                                                             |
|-----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| Fused Multiply-Add<br>A = A x B + C<br>C += A x B                                             | VFMADD132PD<br>VFMADD213PD<br>VFMADD231PD<br>_mm_fmadd_pd()<br>_mm256_fmadd_pd()                    | VFMADD132PS<br>VFMADD213PS<br>VFMADD231PS<br>_mm_fmadd_ps()<br>_mm256_fmadd_ps()                    | VFMADD132SD<br>VFMADD213SD<br>VFMADD231SD<br>_mm_fmadd_sd()<br>_mm256_fmadd_sd()      | VFMADD132SS,<br>FMADD213SS<br>VFMADD231SS<br>_mm_fmadd_ss()<br>_mm256_fmadd_ss()          |
| Fused Multiply-Alternating Add/Subtract A = A x B + C   A = A x B - C C += A x B   C -= A x B | VFMADDSUB132PD<br>VFMADDSUB213PD<br>VFMADDSUB231PD<br>_mm_fmaddsub_pd()<br>_mm256_fmaddsub_pd(<br>) | VFMADDSUB132PS<br>VFMADDSUB213PS<br>VFMADDSUB231PS<br>_mm_fmaddsub_ps()<br>_mm256_fmaddsub_p<br>s() |                                                                                       |                                                                                           |
| Fused Multiply-Alternating Subtract/Add A = A x B - C   A = A x B + C C -= A x B   C += A x B | VFMSUBADD132PD<br>VFMSUBADD213PD<br>VFMSUBADD231PD<br>_mm_fmsubadd_pd()<br>_mm256_fmsubadd_pd(<br>) | VFMSUBADD132PS<br>VFMSUBADD213PS<br>VFMSUBADD231PS<br>_mm_fmsubadd_pd()<br>_mm256_fmsubadd_p<br>d() |                                                                                       |                                                                                           |
| Fused Multiply-Subtract A = A x B - C C -= A x B                                              | VFMSUB132PD<br>VFMSUB213PD<br>VFMSUB231PD<br>_mm_fmsub_pd()<br>_mm256_fmsub_pd()                    | VFMSUB132PS<br>VFMSUB213PS<br>VFMSUB231PS<br>_mm_fmsub_ps()<br>_mm256_fmsub_ps()                    | VFMSUB132SD<br>VFMSUB213SD<br>VFMSUB231SD<br>_mm_fmsub_sd()<br>_mm256_fmsub_sd()      | VFMSUB132SS<br>VFMSUB213SS<br>VFMSUB231SS<br>_mm_fmsub_ss()<br>_mm256_fmsub_ss()          |
| Fused Negative Multiply-Add<br>A = -A x B + C<br>C += -A x B                                  | VFNMADD132PD<br>VFNMADD213PD<br>VFNMADD231PD<br>_mm_fnmadd_pd()<br>_mm256_fnmadd_pd()               | VFNMADD132PS<br>VFNMADD213PS<br>VFNMADD231PS<br>_mm_fnmadd_ps()<br>_mm256_fnmadd_ps()               | VFNMADD132SD<br>VFNMADD213SD<br>VFNMADD231SD<br>_mm_fnmadd_sd()<br>_mm256_fnmadd_sd() | VFNMADD132SS<br>VFNMADD213SS<br>VFNMADD231SS<br>_mm_fnmadd_ss()<br>_mm256_fnmadd_ss<br>() |
| Fused Negative Multiply-Subtract A = -A x B - C C -= -A x B                                   | VFNMSUB132PD<br>VFNMSUB213PD<br>VFNMSUB231PD<br>_mm_fnmsub_pd()<br>_mm256_fnmsub_pd()               | VFNMSUB132PS<br>VFNMSUB213PS<br>VFNMSUB231PS<br>_mm_fnmsub_ps()<br>_mm256_fnmsub_ps()               | VFNMSUB132SD<br>VFNMSUB213SD<br>VFNMSUB231SD<br>_mm_fnmsub_sd()<br>_mm256_fnmsub_sd() | VFNMSUB132SS VFNMSUB213SS VFNMSUB231SS _mm_fnmsub_ss() _mm256_fnmsub_ss ()                |

