From WikiChip
Difference between revisions of "flops"

(a note on SMP and MP)
(Minor Edit)
 
(26 intermediate revisions by 7 users not shown)
Line 1: Line 1:
 
{{title|Floating-Point Operations Per Second (FLOPS)}}
 
{{title|Floating-Point Operations Per Second (FLOPS)}}
'''Floating-point operations per second''' ('''FLOPS''') is a microprocessor performance unit used to quantify the number of [[floating-point]] [[floating-point operations|operations]] a [[physical core|core]], machine, or system is capable of in a one second.
+
'''Floating-point operations per second''' ('''FLOPS''') is a measure of [[compute performance]] used to quantify the number of [[floating-point]] [[floating-point operations|operations]] a [[physical core|core]], machine, or system is capable of in a one second.
  
 
== Overview ==
 
== Overview ==
Line 8: Line 8:
  
 
With the advent of [[multi-socket]] and [[multi-core]] architectures, additional levels of explicit parallelism have been introduced resulting in the following modified equation:
 
With the advent of [[multi-socket]] and [[multi-core]] architectures, additional levels of explicit parallelism have been introduced resulting in the following modified equation:
 +
 +
:<math>\text{FLOPS}_\text{node} = \frac{\text{FLOPs}}{\text{cycle}} \times \frac{\text{cycles}}{\text{second}} \times \frac{\text{cores}}{\text{node}}</math>
 +
 +
and,
  
 
:<math>\text{FLOPS}_\text{system} = \frac{\text{FLOPs}}{\text{cycle}} \times \frac{\text{cycles}}{\text{second}} \times \frac{\text{cores}}{\text{node}} \times \frac{\text{nodes}}{\text{system}}</math>
 
:<math>\text{FLOPS}_\text{system} = \frac{\text{FLOPs}}{\text{cycle}} \times \frac{\text{cycles}}{\text{second}} \times \frac{\text{cores}}{\text{node}} \times \frac{\text{nodes}}{\text{system}}</math>
 +
 +
Modern microprocessors exploit [[data parallelism]] further through the introduction of various vector extensions such as [[x86]]'s {{x86|AVX}} and [[ARM]]'s {{arm|SVE}}. With those extensions, it's possible to perform multiple floating-point operations within a single instruction. For example, a typical [[fused multiply-accumulate]] (FMAC) operation can perform two floating-point operations at once. For a single core, this can be expressed as
 +
 +
:<math>\text{FLOPS}_\text{core} = \frac{\text{instructions}}{\text{cycle}} \times \frac{\text{operations}}{\text{instruction}} \times \frac{\text{FLOPs}}{\text{operation}} \times \frac{\text{cycles}}{\text{second}}</math>
 +
 +
And for a full system, this can be extended to:
 +
 +
:<math>\text{FLOPS}_\text{system} = \frac{\text{instructions}}{\text{cycle}} \times \frac{\text{operations}}{\text{instruction}} \times \frac{\text{FLOPs}}{\text{operation}} \times \frac{\text{cycles}}{\text{second}} \times \frac{\text{cores}}{\text{node}} \times \frac{\text{nodes}}{\text{system}}</math>
 +
 +
=== Nomenclature ===
 +
* KiloFLOPS / KFLOPS: 10<sup>3</sup> FLOPS
 +
* MegaFLOPS / MFLOPS: 10<sup>6</sup> FLOPS
 +
* GigaFLOPS / GFLOPS: 10<sup>9</sup> FLOPS
 +
* TeraFLOPS / TFLOPS: 10<sup>12</sup> FLOPS
 +
* PetaFLOPS / PFLOPS: 10<sup>15</sup> FLOPS
 +
* ExaFLOPS / EFLOPS: 10<sup>18</sup> FLOPS
 +
* ZettaFLOPS / ZFLOPS: 10<sup>21</sup> FLOPS
 +
* YottaFLOPS / YFLOPS: 10<sup>24</sup> FLOPS
 +
 +
== FLOPs by microarchitecture ==
 +
=== x86 ===
 +
{| class="wikitable"
 +
|-
 +
! Microarchitecture !! colspan="3" | FLOPs !! ISA
 +
|-
 +
! colspan="5" | [[Intel]] Microarchitectures
 +
|-
 +
| rowspan="3" | {{intel|Core|l=arch}}<br>{{intel|Penryn|l=arch}}<br>{{intel|Nehalem|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit Multiplication + 1 × 128-bit Addition || rowspan="3" | {{x86|SSE}} (128-bit)
 +
|-
 +
| '''DP''' || 4 FLOPs/cycle || 2 FLOPs + 2 FLOPs
 +
|-
 +
| '''SP''' || 8 FLOPs/cycle || 4 FLOPs + 4 FLOPs
 +
|-
 +
| rowspan="3" | {{intel|Sandy Bridge|l=arch}}<br>{{intel|Ivy Bridge|l=arch}} || '''EUs''' || colspan="2" | 1 × 256-bit Multiplication + 1 × 256-bit Addition || rowspan="3" | {{x86|AVX}} (256-bit)
 +
|-
 +
| '''DP''' || 8 FLOPs/cycle || 4 FLOPs + 4 FLOPs
 +
|-
 +
| '''SP''' || 16 FLOPs/cycle || 8 FLOPs + 8 FLOPs
 +
|-
 +
| rowspan="3" | {{intel|Haswell|l=arch}}<br>{{intel|Broadwell|l=arch}}<br>{{intel|Skylake|l=arch}}<br>{{intel|Kaby Lake|l=arch}}<br>{{intel|Amber Lake|l=arch}}<br>{{intel|Coffee Lake|l=arch}}<br>{{intel|Whiskey Lake|l=arch}} || '''EUs''' || colspan="2" | 2 × 256-bit FMA || rowspan="3" | {{x86|AVX2}} & FMA (256-bit)
 +
|-
 +
| '''DP''' || 16 FLOPs/cycle || 2 × 8 FLOPs
 +
|-
 +
| '''SP''' || 32 FLOPs/cycle || 2 × 16 FLOPs
 +
|-
 +
| rowspan="3" | {{intel|Skylake (server)|l=arch}} || '''EUs''' || colspan="2" | 2 × 512-bit FMA (varies by SKU) || rowspan="3" | {{x86|AVX-512}} & FMA (512-bit)
 +
|-
 +
| '''DP''' || 32 FLOPs/cycle || 2 × 16 FLOPs
 +
|-
 +
| '''SP''' || 64 FLOPs/cycle || 2 × 32 FLOPs
 +
|-
 +
| rowspan="3" | {{intel|Rocket Lake|l=arch}}<br>{{intel|Ice Lake|l=arch}}<br>{{intel|Tiger Lake|l=arch}} || '''EUs''' || colspan="2" | 2 × 512-bit FMA || rowspan="3" | {{x86|AVX-512}} & FMA (512-bit)
 +
|-
 +
| '''DP''' || 32 FLOPs/cycle || 2 × 16 FLOPs
 +
|-
 +
| '''SP''' || 64 FLOPs/cycle || 2 × 32 FLOPs
 +
|-
 +
! colspan="5" | [[Intel]] {{intel|MIC}} Microarchitectures
 +
|-
 +
| rowspan="3" | {{intel|Knights Landing|l=arch}} || '''EUs''' || colspan="2" | 2 × 512-bit FMA || rowspan="3" | {{x86|AVX-512}} & FMA (512-bit)
 +
|-
 +
| '''DP''' || 32 FLOPs/cycle || 2 × 16 FLOPs
 +
|-
 +
| '''SP''' || 64 FLOPs/cycle || 2 × 32 FLOPs
 +
|-
 +
! colspan="5" | [[AMD]] Microarchitectures
 +
|-
 +
| rowspan="3" | {{amd|K10|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit Multiplication + 1 × 128-bit Addition || rowspan="3" | {{x86|SSE}} (128-bit)
 +
|-
 +
| '''DP''' || 4 FLOPs/cycle || 2 FLOPs + 2 FLOPs
 +
|-
 +
| '''SP''' || 8 FLOPs/cycle || 4 FLOPs + 4 FLOPs
 +
|-
 +
| rowspan="3" | {{amd|Bulldozer|l=arch}}<br>{{amd|Piledriver|l=arch}}<br>{{amd|Steamroller|l=arch}}<br>{{amd|Excavator|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA (per two cores) || rowspan="3" | {{x86|AVX}} & {{x86|FMA4|FMA}} (128-bit)
 +
|-
 +
| '''DP''' || 8 FLOPs/cycle || 2 x 4 FLOPs
 +
|-
 +
| '''SP''' || 16 FLOPs/cycle || 2 x 8 FLOPs
 +
|-
 +
| rowspan="3" | {{amd|Zen|l=arch}}<br>{{amd|Zen+|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA || rowspan="3" | {{x86|AVX2}} & FMA (256-bit)
 +
|-
 +
| '''DP''' || 8 FLOPs/cycle || 2 x 4 FLOPs
 +
|-
 +
| '''SP''' || 16 FLOPs/cycle || 2 x 8 FLOPs
 +
|-
 +
| rowspan="3" | {{amd|Zen 2|l=arch}}<br>{{amd|Zen 3|l=arch}} || '''EUs''' || colspan="2" | 2 × 256-bit FMA || rowspan="3" | {{x86|AVX2}} & FMA (256-bit)
 +
|-
 +
| '''DP''' || 16 FLOPs/cycle || 2 x 8 FLOPs
 +
|-
 +
| '''SP''' || 32 FLOPs/cycle || 2 x 16 FLOPs
 +
|-
 +
! colspan="5" | [[Centaur]] Microarchitectures
 +
|-
 +
| rowspan="3" | {{centtech|CHA|l=arch}} || '''EUs''' || colspan="2" | 2 × 256-bit FMA || rowspan="3" | {{x86|AVX-512}} & FMA (512-bit)
 +
|-
 +
| '''DP''' || 16 FLOPs/cycle || 2 x 8 FLOPs
 +
|-
 +
| '''SP''' || 32 FLOPs/cycle || 2 x 16 FLOPs
 +
|}
 +
 +
=== ARM ===
 +
{| class="wikitable"
 +
|-
 +
! Microarchitecture !! colspan="3" | FLOPs !! ISA
 +
|-
 +
! colspan="5" | [[ARM]] Microarchitectures
 +
|-
 +
| rowspan="3" | {{armh|Cortex-A57|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit)
 +
|-
 +
| '''DP''' || 4 FLOPs/cycle || 4 FLOPs
 +
|-
 +
| '''SP''' || 8 FLOPs/cycle || 8 FLOPs
 +
|-
 +
| rowspan="3" | {{armh|Cortex-A76|l=arch}}<br>{{armh|Cortex-A77|l=arch}}<br>{{armh|Cortex-A78|l=arch}}<br>{{armh|Neoverse N1|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit)
 +
|-
 +
| '''DP''' || 8 FLOPs/cycle || 2 x 4 FLOPs
 +
|-
 +
| '''SP''' || 16 FLOPs/cycle || 2 x 8 FLOPs
 +
|-
 +
| rowspan="3" | {{armh|Neoverse N2|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA || rowspan="3" | {{arm|ARMv9}} {{arm|SVE2}} (128-bit)
 +
|-
 +
| '''DP''' || 8 FLOPs/cycle || 2 x 4 FLOPs
 +
|-
 +
| '''SP''' || 16 FLOPs/cycle || 2 x 8 FLOPs
 +
|-
 +
| rowspan="3" | {{armh|Neoverse V1|l=arch}} || '''EUs''' || colspan="2" | 2 × 256-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|SVE}} (256-bit)
 +
|-
 +
| '''DP''' || 16 FLOPs/cycle || 2 x 8 FLOPs
 +
|-
 +
| '''SP''' || 32 FLOPs/cycle || 2 x 16 FLOPs
 +
|-
 +
| rowspan="3" | {{armh|Cortex-A510|l=arch}} || '''EUs''' || colspan="2" | 1-2 × 128-bit FMA || rowspan="3" | {{arm|ARMv9}} {{arm|SVE2}} (128-bit)
 +
|-
 +
| '''DP''' || 2-4 FLOPs/cycle || 2-4 FLOPs
 +
|-
 +
| '''SP''' || 4-8 FLOPs/cycle || 4-8 FLOPs
 +
|-
 +
! colspan="5" | [[AppliedMicro]]/[[Ampere Computing]] Microarchitectures
 +
|-
 +
| rowspan="3" | {{apm|Storm|l=arch}}<br>{{apm|Shadowcat|l=arch}}<br>{{apm|Skylark|l=arch}} || '''EUs''' || colspan="2" | 1 × 64-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit)
 +
|-
 +
| '''DP''' || 2 FLOPs/cycle || 2 FLOPs
 +
|-
 +
| '''SP''' || 4 FLOPs/cycle || 4 FLOPs
 +
|-
 +
! colspan="5" | [[Cavium]] Microarchitectures
 +
|-
 +
| rowspan="3" | {{cavium|Vulcan|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit)
 +
|-
 +
| '''DP''' || 8 FLOPs/cycle || 2 x 4 FLOPs
 +
|-
 +
| '''SP''' || 16 FLOPs/cycle || 2 x 8 FLOPs
 +
|-
 +
! colspan="5" | [[Samsung]] Microarchitectures
 +
|-
 +
| rowspan="3" | {{samsung|M1|l=arch}}<br>{{samsung|M2|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit FMA + 1 × 128-bit Addition || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit)
 +
|-
 +
| '''DP''' || 6 FLOPs/cycle || 1 x 4 FLOPs + 1 x 2 FLOPs
 +
|-
 +
| '''SP''' || 12 FLOPs/cycle || 1 x 8 FLOPs + 1 x 4 FLOPs
 +
|-
 +
| rowspan="3" | {{samsung|M3|l=arch}} || '''EUs''' || colspan="2" | 3 × 128-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit)
 +
|-
 +
| '''DP''' || 12 FLOPs/cycle || 3 x 4 FLOPs
 +
|-
 +
| '''SP''' || 24 FLOPs/cycle || 3 x 8 FLOPs
 +
|-
 +
! colspan="5" | [[Phytium]] Microarchitectures
 +
|-
 +
| rowspan="3" | {{phytium|Xiaomi|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit)
 +
|-
 +
| '''DP''' || 4 FLOPs/cycle || 1 x 4 FLOPs
 +
|-
 +
| '''SP''' || 8 FLOPs/cycle || 1 x 8 FLOPs
 +
|-
 +
! colspan="5" | [[HiSilicon]] Microarchitectures
 +
|-
 +
| rowspan="3" | {{hisilicon|TaiShan v110|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit)
 +
|-
 +
| '''DP''' || 4 FLOPs/cycle || 1 x 4 FLOPs
 +
|-
 +
| '''SP''' || 8 FLOPs/cycle || 1 x 8 FLOPs
 +
|}
 +
 +
== See also ==
 +
* [[bytes per FLOP]]
 +
* [[floating point]]
 +
* [[floating point operation]]
 +
* [[operations per second]] (OPS)
 +
 +
[[category:floating point]]
 +
[[Category:computer performance]]

Latest revision as of 15:03, 25 January 2023

Floating-point operations per second (FLOPS) is a measure of compute performance used to quantify the number of floating-point operations a core, machine, or system is capable of in a one second.

Overview[edit]

FLOPS are a measure of performance used for comparing the peak theoretical performance of a core, microprocessor, or system using floating point operations. This unit is often used in the field of high-performance computing (e.g., supercomputers) in order to evaluate the peak theoretical performance of various scientific workloads. Traditionally, the FLOPS of a microprocessor could be calculated using the following equation:

Equation FLOPS Subscript core Baseline equals StartFraction FLOPs Over cycle EndFraction times StartFraction cycles Over second EndFraction

With the advent of multi-socket and multi-core architectures, additional levels of explicit parallelism have been introduced resulting in the following modified equation:

Equation FLOPS Subscript node Baseline equals StartFraction FLOPs Over cycle EndFraction times StartFraction cycles Over second EndFraction times StartFraction cores Over node EndFraction

and,

Equation FLOPS Subscript system Baseline equals StartFraction FLOPs Over cycle EndFraction times StartFraction cycles Over second EndFraction times StartFraction cores Over node EndFraction times StartFraction nodes Over system EndFraction

Modern microprocessors exploit data parallelism further through the introduction of various vector extensions such as x86's AVX and ARM's SVE. With those extensions, it's possible to perform multiple floating-point operations within a single instruction. For example, a typical fused multiply-accumulate (FMAC) operation can perform two floating-point operations at once. For a single core, this can be expressed as

Equation FLOPS Subscript core Baseline equals StartFraction instructions Over cycle EndFraction times StartFraction operations Over instruction EndFraction times StartFraction FLOPs Over operation EndFraction times StartFraction cycles Over second EndFraction

And for a full system, this can be extended to:

Equation FLOPS Subscript system Baseline equals StartFraction instructions Over cycle EndFraction times StartFraction operations Over instruction EndFraction times StartFraction FLOPs Over operation EndFraction times StartFraction cycles Over second EndFraction times StartFraction cores Over node EndFraction times StartFraction nodes Over system EndFraction

Nomenclature[edit]

  • KiloFLOPS / KFLOPS: 103 FLOPS
  • MegaFLOPS / MFLOPS: 106 FLOPS
  • GigaFLOPS / GFLOPS: 109 FLOPS
  • TeraFLOPS / TFLOPS: 1012 FLOPS
  • PetaFLOPS / PFLOPS: 1015 FLOPS
  • ExaFLOPS / EFLOPS: 1018 FLOPS
  • ZettaFLOPS / ZFLOPS: 1021 FLOPS
  • YottaFLOPS / YFLOPS: 1024 FLOPS

FLOPs by microarchitecture[edit]

x86[edit]

Microarchitecture FLOPs ISA
Intel Microarchitectures
Core
Penryn
Nehalem
EUs 1 × 128-bit Multiplication + 1 × 128-bit Addition SSE (128-bit)
DP 4 FLOPs/cycle 2 FLOPs + 2 FLOPs
SP 8 FLOPs/cycle 4 FLOPs + 4 FLOPs
Sandy Bridge
Ivy Bridge
EUs 1 × 256-bit Multiplication + 1 × 256-bit Addition AVX (256-bit)
DP 8 FLOPs/cycle 4 FLOPs + 4 FLOPs
SP 16 FLOPs/cycle 8 FLOPs + 8 FLOPs
Haswell
Broadwell
Skylake
Kaby Lake
Amber Lake
Coffee Lake
Whiskey Lake
EUs 2 × 256-bit FMA AVX2 & FMA (256-bit)
DP 16 FLOPs/cycle 2 × 8 FLOPs
SP 32 FLOPs/cycle 2 × 16 FLOPs
Skylake (server) EUs 2 × 512-bit FMA (varies by SKU) AVX-512 & FMA (512-bit)
DP 32 FLOPs/cycle 2 × 16 FLOPs
SP 64 FLOPs/cycle 2 × 32 FLOPs
Rocket Lake
Ice Lake
Tiger Lake
EUs 2 × 512-bit FMA AVX-512 & FMA (512-bit)
DP 32 FLOPs/cycle 2 × 16 FLOPs
SP 64 FLOPs/cycle 2 × 32 FLOPs
Intel MIC Microarchitectures
Knights Landing EUs 2 × 512-bit FMA AVX-512 & FMA (512-bit)
DP 32 FLOPs/cycle 2 × 16 FLOPs
SP 64 FLOPs/cycle 2 × 32 FLOPs
AMD Microarchitectures
K10 EUs 1 × 128-bit Multiplication + 1 × 128-bit Addition SSE (128-bit)
DP 4 FLOPs/cycle 2 FLOPs + 2 FLOPs
SP 8 FLOPs/cycle 4 FLOPs + 4 FLOPs
Bulldozer
Piledriver
Steamroller
Excavator
EUs 2 × 128-bit FMA (per two cores) AVX & FMA (128-bit)
DP 8 FLOPs/cycle 2 x 4 FLOPs
SP 16 FLOPs/cycle 2 x 8 FLOPs
Zen
Zen+
EUs 2 × 128-bit FMA AVX2 & FMA (256-bit)
DP 8 FLOPs/cycle 2 x 4 FLOPs
SP 16 FLOPs/cycle 2 x 8 FLOPs
Zen 2
Zen 3
EUs 2 × 256-bit FMA AVX2 & FMA (256-bit)
DP 16 FLOPs/cycle 2 x 8 FLOPs
SP 32 FLOPs/cycle 2 x 16 FLOPs
Centaur Microarchitectures
CHA EUs 2 × 256-bit FMA AVX-512 & FMA (512-bit)
DP 16 FLOPs/cycle 2 x 8 FLOPs
SP 32 FLOPs/cycle 2 x 16 FLOPs

ARM[edit]

Microarchitecture FLOPs ISA
ARM Microarchitectures
Cortex-A57 EUs 1 × 128-bit FMA ARMv8 NEON (128-bit)
DP 4 FLOPs/cycle 4 FLOPs
SP 8 FLOPs/cycle 8 FLOPs
Cortex-A76
Cortex-A77
Cortex-A78
Neoverse N1
EUs 2 × 128-bit FMA ARMv8 NEON (128-bit)
DP 8 FLOPs/cycle 2 x 4 FLOPs
SP 16 FLOPs/cycle 2 x 8 FLOPs
Neoverse N2 EUs 2 × 128-bit FMA ARMv9 SVE2 (128-bit)
DP 8 FLOPs/cycle 2 x 4 FLOPs
SP 16 FLOPs/cycle 2 x 8 FLOPs
Neoverse V1 EUs 2 × 256-bit FMA ARMv8 SVE (256-bit)
DP 16 FLOPs/cycle 2 x 8 FLOPs
SP 32 FLOPs/cycle 2 x 16 FLOPs
Cortex-A510 EUs 1-2 × 128-bit FMA ARMv9 SVE2 (128-bit)
DP 2-4 FLOPs/cycle 2-4 FLOPs
SP 4-8 FLOPs/cycle 4-8 FLOPs
AppliedMicro/Ampere Computing Microarchitectures
Storm
Shadowcat
Skylark
EUs 1 × 64-bit FMA ARMv8 NEON (128-bit)
DP 2 FLOPs/cycle 2 FLOPs
SP 4 FLOPs/cycle 4 FLOPs
Cavium Microarchitectures
Vulcan EUs 2 × 128-bit FMA ARMv8 NEON (128-bit)
DP 8 FLOPs/cycle 2 x 4 FLOPs
SP 16 FLOPs/cycle 2 x 8 FLOPs
Samsung Microarchitectures
M1
M2
EUs 1 × 128-bit FMA + 1 × 128-bit Addition ARMv8 NEON (128-bit)
DP 6 FLOPs/cycle 1 x 4 FLOPs + 1 x 2 FLOPs
SP 12 FLOPs/cycle 1 x 8 FLOPs + 1 x 4 FLOPs
M3 EUs 3 × 128-bit FMA ARMv8 NEON (128-bit)
DP 12 FLOPs/cycle 3 x 4 FLOPs
SP 24 FLOPs/cycle 3 x 8 FLOPs
Phytium Microarchitectures
Xiaomi EUs 1 × 128-bit FMA ARMv8 NEON (128-bit)
DP 4 FLOPs/cycle 1 x 4 FLOPs
SP 8 FLOPs/cycle 1 x 8 FLOPs
HiSilicon Microarchitectures
TaiShan v110 EUs 1 × 128-bit FMA ARMv8 NEON (128-bit)
DP 4 FLOPs/cycle 1 x 4 FLOPs
SP 8 FLOPs/cycle 1 x 8 FLOPs

See also[edit]