(→x86: intel and amd big cores) |
(Minor Edit) |
||
(21 intermediate revisions by 7 users not shown) | |||
Line 1: | Line 1: | ||
{{title|Floating-Point Operations Per Second (FLOPS)}} | {{title|Floating-Point Operations Per Second (FLOPS)}} | ||
− | '''Floating-point operations per second''' ('''FLOPS''') is a | + | '''Floating-point operations per second''' ('''FLOPS''') is a measure of [[compute performance]] used to quantify the number of [[floating-point]] [[floating-point operations|operations]] a [[physical core|core]], machine, or system is capable of in a one second. |
== Overview == | == Overview == | ||
Line 15: | Line 15: | ||
:<math>\text{FLOPS}_\text{system} = \frac{\text{FLOPs}}{\text{cycle}} \times \frac{\text{cycles}}{\text{second}} \times \frac{\text{cores}}{\text{node}} \times \frac{\text{nodes}}{\text{system}}</math> | :<math>\text{FLOPS}_\text{system} = \frac{\text{FLOPs}}{\text{cycle}} \times \frac{\text{cycles}}{\text{second}} \times \frac{\text{cores}}{\text{node}} \times \frac{\text{nodes}}{\text{system}}</math> | ||
− | Modern microprocessors exploit [[data parallelism]] further through the introduction of various vector extensions such as [[x86]]'s {{x86|AVX}} and [[ARM]]'s {{arm|SVE}}. With those extensions, it's possible to | + | Modern microprocessors exploit [[data parallelism]] further through the introduction of various vector extensions such as [[x86]]'s {{x86|AVX}} and [[ARM]]'s {{arm|SVE}}. With those extensions, it's possible to perform multiple floating-point operations within a single instruction. For example, a typical [[fused multiply-accumulate]] (FMAC) operation can perform two floating-point operations at once. For a single core, this can be expressed as |
:<math>\text{FLOPS}_\text{core} = \frac{\text{instructions}}{\text{cycle}} \times \frac{\text{operations}}{\text{instruction}} \times \frac{\text{FLOPs}}{\text{operation}} \times \frac{\text{cycles}}{\text{second}}</math> | :<math>\text{FLOPS}_\text{core} = \frac{\text{instructions}}{\text{cycle}} \times \frac{\text{operations}}{\text{instruction}} \times \frac{\text{FLOPs}}{\text{operation}} \times \frac{\text{cycles}}{\text{second}}</math> | ||
Line 23: | Line 23: | ||
:<math>\text{FLOPS}_\text{system} = \frac{\text{instructions}}{\text{cycle}} \times \frac{\text{operations}}{\text{instruction}} \times \frac{\text{FLOPs}}{\text{operation}} \times \frac{\text{cycles}}{\text{second}} \times \frac{\text{cores}}{\text{node}} \times \frac{\text{nodes}}{\text{system}}</math> | :<math>\text{FLOPS}_\text{system} = \frac{\text{instructions}}{\text{cycle}} \times \frac{\text{operations}}{\text{instruction}} \times \frac{\text{FLOPs}}{\text{operation}} \times \frac{\text{cycles}}{\text{second}} \times \frac{\text{cores}}{\text{node}} \times \frac{\text{nodes}}{\text{system}}</math> | ||
− | == FLOPS by microarchitecture == | + | === Nomenclature === |
+ | * KiloFLOPS / KFLOPS: 10<sup>3</sup> FLOPS | ||
+ | * MegaFLOPS / MFLOPS: 10<sup>6</sup> FLOPS | ||
+ | * GigaFLOPS / GFLOPS: 10<sup>9</sup> FLOPS | ||
+ | * TeraFLOPS / TFLOPS: 10<sup>12</sup> FLOPS | ||
+ | * PetaFLOPS / PFLOPS: 10<sup>15</sup> FLOPS | ||
+ | * ExaFLOPS / EFLOPS: 10<sup>18</sup> FLOPS | ||
+ | * ZettaFLOPS / ZFLOPS: 10<sup>21</sup> FLOPS | ||
+ | * YottaFLOPS / YFLOPS: 10<sup>24</sup> FLOPS | ||
+ | |||
+ | == FLOPs by microarchitecture == | ||
=== x86 === | === x86 === | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | ! Microarchitecture !! colspan="3" | | + | ! Microarchitecture !! colspan="3" | FLOPs !! ISA |
|- | |- | ||
! colspan="5" | [[Intel]] Microarchitectures | ! colspan="5" | [[Intel]] Microarchitectures | ||
Line 33: | Line 43: | ||
| rowspan="3" | {{intel|Core|l=arch}}<br>{{intel|Penryn|l=arch}}<br>{{intel|Nehalem|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit Multiplication + 1 × 128-bit Addition || rowspan="3" | {{x86|SSE}} (128-bit) | | rowspan="3" | {{intel|Core|l=arch}}<br>{{intel|Penryn|l=arch}}<br>{{intel|Nehalem|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit Multiplication + 1 × 128-bit Addition || rowspan="3" | {{x86|SSE}} (128-bit) | ||
|- | |- | ||
− | | '''DP''' || 4 | + | | '''DP''' || 4 FLOPs/cycle || 2 FLOPs + 2 FLOPs |
|- | |- | ||
− | | '''SP''' || 8 | + | | '''SP''' || 8 FLOPs/cycle || 4 FLOPs + 4 FLOPs |
|- | |- | ||
− | | rowspan="3" | {{intel|Sandy Bridge|l=arch}}<br>{{intel|Ivy Bridge|l=arch}} || '''EUs''' || colspan="2" | 1 × 256-bit Multiplication + 1 × 256-bit Addition || rowspan="3" | {{x86|AVX}} ( | + | | rowspan="3" | {{intel|Sandy Bridge|l=arch}}<br>{{intel|Ivy Bridge|l=arch}} || '''EUs''' || colspan="2" | 1 × 256-bit Multiplication + 1 × 256-bit Addition || rowspan="3" | {{x86|AVX}} (256-bit) |
|- | |- | ||
− | | '''DP''' || 8 | + | | '''DP''' || 8 FLOPs/cycle || 4 FLOPs + 4 FLOPs |
|- | |- | ||
− | | '''SP''' || 16 | + | | '''SP''' || 16 FLOPs/cycle || 8 FLOPs + 8 FLOPs |
|- | |- | ||
− | | rowspan="3" | {{intel|Haswell|l=arch}}<br>{{intel|Broadwell|l=arch}}<br>{{intel|Skylake|l=arch}}<br>{{intel|Kaby Lake|l=arch}}<br>{{intel| | + | | rowspan="3" | {{intel|Haswell|l=arch}}<br>{{intel|Broadwell|l=arch}}<br>{{intel|Skylake|l=arch}}<br>{{intel|Kaby Lake|l=arch}}<br>{{intel|Amber Lake|l=arch}}<br>{{intel|Coffee Lake|l=arch}}<br>{{intel|Whiskey Lake|l=arch}} || '''EUs''' || colspan="2" | 2 × 256-bit FMA || rowspan="3" | {{x86|AVX2}} & FMA (256-bit) |
|- | |- | ||
− | | '''DP''' || 16 | + | | '''DP''' || 16 FLOPs/cycle || 2 × 8 FLOPs |
|- | |- | ||
− | | '''SP''' || 32 | + | | '''SP''' || 32 FLOPs/cycle || 2 × 16 FLOPs |
|- | |- | ||
| rowspan="3" | {{intel|Skylake (server)|l=arch}} || '''EUs''' || colspan="2" | 2 × 512-bit FMA (varies by SKU) || rowspan="3" | {{x86|AVX-512}} & FMA (512-bit) | | rowspan="3" | {{intel|Skylake (server)|l=arch}} || '''EUs''' || colspan="2" | 2 × 512-bit FMA (varies by SKU) || rowspan="3" | {{x86|AVX-512}} & FMA (512-bit) | ||
|- | |- | ||
− | | '''DP''' || 32 | + | | '''DP''' || 32 FLOPs/cycle || 2 × 16 FLOPs |
+ | |- | ||
+ | | '''SP''' || 64 FLOPs/cycle || 2 × 32 FLOPs | ||
+ | |- | ||
+ | | rowspan="3" | {{intel|Rocket Lake|l=arch}}<br>{{intel|Ice Lake|l=arch}}<br>{{intel|Tiger Lake|l=arch}} || '''EUs''' || colspan="2" | 2 × 512-bit FMA || rowspan="3" | {{x86|AVX-512}} & FMA (512-bit) | ||
+ | |- | ||
+ | | '''DP''' || 32 FLOPs/cycle || 2 × 16 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 64 FLOPs/cycle || 2 × 32 FLOPs | ||
+ | |- | ||
+ | ! colspan="5" | [[Intel]] {{intel|MIC}} Microarchitectures | ||
+ | |- | ||
+ | | rowspan="3" | {{intel|Knights Landing|l=arch}} || '''EUs''' || colspan="2" | 2 × 512-bit FMA || rowspan="3" | {{x86|AVX-512}} & FMA (512-bit) | ||
|- | |- | ||
− | | '''SP''' || 64 | + | | '''DP''' || 32 FLOPs/cycle || 2 × 16 FLOPs |
+ | |- | ||
+ | | '''SP''' || 64 FLOPs/cycle || 2 × 32 FLOPs | ||
|- | |- | ||
! colspan="5" | [[AMD]] Microarchitectures | ! colspan="5" | [[AMD]] Microarchitectures | ||
Line 59: | Line 83: | ||
| rowspan="3" | {{amd|K10|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit Multiplication + 1 × 128-bit Addition || rowspan="3" | {{x86|SSE}} (128-bit) | | rowspan="3" | {{amd|K10|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit Multiplication + 1 × 128-bit Addition || rowspan="3" | {{x86|SSE}} (128-bit) | ||
|- | |- | ||
− | | '''DP''' || 4 | + | | '''DP''' || 4 FLOPs/cycle || 2 FLOPs + 2 FLOPs |
|- | |- | ||
− | | '''SP''' || 8 | + | | '''SP''' || 8 FLOPs/cycle || 4 FLOPs + 4 FLOPs |
|- | |- | ||
| rowspan="3" | {{amd|Bulldozer|l=arch}}<br>{{amd|Piledriver|l=arch}}<br>{{amd|Steamroller|l=arch}}<br>{{amd|Excavator|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA (per two cores) || rowspan="3" | {{x86|AVX}} & {{x86|FMA4|FMA}} (128-bit) | | rowspan="3" | {{amd|Bulldozer|l=arch}}<br>{{amd|Piledriver|l=arch}}<br>{{amd|Steamroller|l=arch}}<br>{{amd|Excavator|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA (per two cores) || rowspan="3" | {{x86|AVX}} & {{x86|FMA4|FMA}} (128-bit) | ||
|- | |- | ||
− | | '''DP''' || 8 | + | | '''DP''' || 8 FLOPs/cycle || 2 x 4 FLOPs |
|- | |- | ||
− | | '''SP''' || 16 | + | | '''SP''' || 16 FLOPs/cycle || 2 x 8 FLOPs |
|- | |- | ||
− | | rowspan="3" | {{amd|Zen|l=arch}}<br>{{amd|Zen+|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA || rowspan="3" | {{x86|AVX2}} & FMA ( | + | | rowspan="3" | {{amd|Zen|l=arch}}<br>{{amd|Zen+|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA || rowspan="3" | {{x86|AVX2}} & FMA (256-bit) |
|- | |- | ||
− | | '''DP''' || 8 | + | | '''DP''' || 8 FLOPs/cycle || 2 x 4 FLOPs |
|- | |- | ||
− | | '''SP''' || 16 | + | | '''SP''' || 16 FLOPs/cycle || 2 x 8 FLOPs |
+ | |- | ||
+ | | rowspan="3" | {{amd|Zen 2|l=arch}}<br>{{amd|Zen 3|l=arch}} || '''EUs''' || colspan="2" | 2 × 256-bit FMA || rowspan="3" | {{x86|AVX2}} & FMA (256-bit) | ||
+ | |- | ||
+ | | '''DP''' || 16 FLOPs/cycle || 2 x 8 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 32 FLOPs/cycle || 2 x 16 FLOPs | ||
+ | |- | ||
+ | ! colspan="5" | [[Centaur]] Microarchitectures | ||
+ | |- | ||
+ | | rowspan="3" | {{centtech|CHA|l=arch}} || '''EUs''' || colspan="2" | 2 × 256-bit FMA || rowspan="3" | {{x86|AVX-512}} & FMA (512-bit) | ||
+ | |- | ||
+ | | '''DP''' || 16 FLOPs/cycle || 2 x 8 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 32 FLOPs/cycle || 2 x 16 FLOPs | ||
|} | |} | ||
=== ARM === | === ARM === | ||
− | {{ | + | {| class="wikitable" |
+ | |- | ||
+ | ! Microarchitecture !! colspan="3" | FLOPs !! ISA | ||
+ | |- | ||
+ | ! colspan="5" | [[ARM]] Microarchitectures | ||
+ | |- | ||
+ | | rowspan="3" | {{armh|Cortex-A57|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 4 FLOPs/cycle || 4 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 8 FLOPs/cycle || 8 FLOPs | ||
+ | |- | ||
+ | | rowspan="3" | {{armh|Cortex-A76|l=arch}}<br>{{armh|Cortex-A77|l=arch}}<br>{{armh|Cortex-A78|l=arch}}<br>{{armh|Neoverse N1|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 8 FLOPs/cycle || 2 x 4 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 16 FLOPs/cycle || 2 x 8 FLOPs | ||
+ | |- | ||
+ | | rowspan="3" | {{armh|Neoverse N2|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA || rowspan="3" | {{arm|ARMv9}} {{arm|SVE2}} (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 8 FLOPs/cycle || 2 x 4 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 16 FLOPs/cycle || 2 x 8 FLOPs | ||
+ | |- | ||
+ | | rowspan="3" | {{armh|Neoverse V1|l=arch}} || '''EUs''' || colspan="2" | 2 × 256-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|SVE}} (256-bit) | ||
+ | |- | ||
+ | | '''DP''' || 16 FLOPs/cycle || 2 x 8 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 32 FLOPs/cycle || 2 x 16 FLOPs | ||
+ | |- | ||
+ | | rowspan="3" | {{armh|Cortex-A510|l=arch}} || '''EUs''' || colspan="2" | 1-2 × 128-bit FMA || rowspan="3" | {{arm|ARMv9}} {{arm|SVE2}} (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 2-4 FLOPs/cycle || 2-4 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 4-8 FLOPs/cycle || 4-8 FLOPs | ||
+ | |- | ||
+ | ! colspan="5" | [[AppliedMicro]]/[[Ampere Computing]] Microarchitectures | ||
+ | |- | ||
+ | | rowspan="3" | {{apm|Storm|l=arch}}<br>{{apm|Shadowcat|l=arch}}<br>{{apm|Skylark|l=arch}} || '''EUs''' || colspan="2" | 1 × 64-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 2 FLOPs/cycle || 2 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 4 FLOPs/cycle || 4 FLOPs | ||
+ | |- | ||
+ | ! colspan="5" | [[Cavium]] Microarchitectures | ||
+ | |- | ||
+ | | rowspan="3" | {{cavium|Vulcan|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 8 FLOPs/cycle || 2 x 4 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 16 FLOPs/cycle || 2 x 8 FLOPs | ||
+ | |- | ||
+ | ! colspan="5" | [[Samsung]] Microarchitectures | ||
+ | |- | ||
+ | | rowspan="3" | {{samsung|M1|l=arch}}<br>{{samsung|M2|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit FMA + 1 × 128-bit Addition || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 6 FLOPs/cycle || 1 x 4 FLOPs + 1 x 2 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 12 FLOPs/cycle || 1 x 8 FLOPs + 1 x 4 FLOPs | ||
+ | |- | ||
+ | | rowspan="3" | {{samsung|M3|l=arch}} || '''EUs''' || colspan="2" | 3 × 128-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 12 FLOPs/cycle || 3 x 4 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 24 FLOPs/cycle || 3 x 8 FLOPs | ||
+ | |- | ||
+ | ! colspan="5" | [[Phytium]] Microarchitectures | ||
+ | |- | ||
+ | | rowspan="3" | {{phytium|Xiaomi|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 4 FLOPs/cycle || 1 x 4 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 8 FLOPs/cycle || 1 x 8 FLOPs | ||
+ | |- | ||
+ | ! colspan="5" | [[HiSilicon]] Microarchitectures | ||
+ | |- | ||
+ | | rowspan="3" | {{hisilicon|TaiShan v110|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit FMA || rowspan="3" | {{arm|ARMv8}} {{arm|NEON}} (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 4 FLOPs/cycle || 1 x 4 FLOPs | ||
+ | |- | ||
+ | | '''SP''' || 8 FLOPs/cycle || 1 x 8 FLOPs | ||
+ | |} | ||
== See also == | == See also == | ||
+ | * [[bytes per FLOP]] | ||
* [[floating point]] | * [[floating point]] | ||
* [[floating point operation]] | * [[floating point operation]] |
Latest revision as of 14:03, 25 January 2023
Floating-point operations per second (FLOPS) is a measure of compute performance used to quantify the number of floating-point operations a core, machine, or system is capable of in a one second.
Overview[edit]
FLOPS are a measure of performance used for comparing the peak theoretical performance of a core, microprocessor, or system using floating point operations. This unit is often used in the field of high-performance computing (e.g., supercomputers) in order to evaluate the peak theoretical performance of various scientific workloads. Traditionally, the FLOPS of a microprocessor could be calculated using the following equation:
With the advent of multi-socket and multi-core architectures, additional levels of explicit parallelism have been introduced resulting in the following modified equation:
and,
Modern microprocessors exploit data parallelism further through the introduction of various vector extensions such as x86's AVX and ARM's SVE. With those extensions, it's possible to perform multiple floating-point operations within a single instruction. For example, a typical fused multiply-accumulate (FMAC) operation can perform two floating-point operations at once. For a single core, this can be expressed as
And for a full system, this can be extended to:
Nomenclature[edit]
- KiloFLOPS / KFLOPS: 103 FLOPS
- MegaFLOPS / MFLOPS: 106 FLOPS
- GigaFLOPS / GFLOPS: 109 FLOPS
- TeraFLOPS / TFLOPS: 1012 FLOPS
- PetaFLOPS / PFLOPS: 1015 FLOPS
- ExaFLOPS / EFLOPS: 1018 FLOPS
- ZettaFLOPS / ZFLOPS: 1021 FLOPS
- YottaFLOPS / YFLOPS: 1024 FLOPS
FLOPs by microarchitecture[edit]
x86[edit]
Microarchitecture | FLOPs | ISA | ||
---|---|---|---|---|
Intel Microarchitectures | ||||
Core Penryn Nehalem |
EUs | 1 × 128-bit Multiplication + 1 × 128-bit Addition | SSE (128-bit) | |
DP | 4 FLOPs/cycle | 2 FLOPs + 2 FLOPs | ||
SP | 8 FLOPs/cycle | 4 FLOPs + 4 FLOPs | ||
Sandy Bridge Ivy Bridge |
EUs | 1 × 256-bit Multiplication + 1 × 256-bit Addition | AVX (256-bit) | |
DP | 8 FLOPs/cycle | 4 FLOPs + 4 FLOPs | ||
SP | 16 FLOPs/cycle | 8 FLOPs + 8 FLOPs | ||
Haswell Broadwell Skylake Kaby Lake Amber Lake Coffee Lake Whiskey Lake |
EUs | 2 × 256-bit FMA | AVX2 & FMA (256-bit) | |
DP | 16 FLOPs/cycle | 2 × 8 FLOPs | ||
SP | 32 FLOPs/cycle | 2 × 16 FLOPs | ||
Skylake (server) | EUs | 2 × 512-bit FMA (varies by SKU) | AVX-512 & FMA (512-bit) | |
DP | 32 FLOPs/cycle | 2 × 16 FLOPs | ||
SP | 64 FLOPs/cycle | 2 × 32 FLOPs | ||
Rocket Lake Ice Lake Tiger Lake |
EUs | 2 × 512-bit FMA | AVX-512 & FMA (512-bit) | |
DP | 32 FLOPs/cycle | 2 × 16 FLOPs | ||
SP | 64 FLOPs/cycle | 2 × 32 FLOPs | ||
Intel MIC Microarchitectures | ||||
Knights Landing | EUs | 2 × 512-bit FMA | AVX-512 & FMA (512-bit) | |
DP | 32 FLOPs/cycle | 2 × 16 FLOPs | ||
SP | 64 FLOPs/cycle | 2 × 32 FLOPs | ||
AMD Microarchitectures | ||||
K10 | EUs | 1 × 128-bit Multiplication + 1 × 128-bit Addition | SSE (128-bit) | |
DP | 4 FLOPs/cycle | 2 FLOPs + 2 FLOPs | ||
SP | 8 FLOPs/cycle | 4 FLOPs + 4 FLOPs | ||
Bulldozer Piledriver Steamroller Excavator |
EUs | 2 × 128-bit FMA (per two cores) | AVX & FMA (128-bit) | |
DP | 8 FLOPs/cycle | 2 x 4 FLOPs | ||
SP | 16 FLOPs/cycle | 2 x 8 FLOPs | ||
Zen Zen+ |
EUs | 2 × 128-bit FMA | AVX2 & FMA (256-bit) | |
DP | 8 FLOPs/cycle | 2 x 4 FLOPs | ||
SP | 16 FLOPs/cycle | 2 x 8 FLOPs | ||
Zen 2 Zen 3 |
EUs | 2 × 256-bit FMA | AVX2 & FMA (256-bit) | |
DP | 16 FLOPs/cycle | 2 x 8 FLOPs | ||
SP | 32 FLOPs/cycle | 2 x 16 FLOPs | ||
Centaur Microarchitectures | ||||
CHA | EUs | 2 × 256-bit FMA | AVX-512 & FMA (512-bit) | |
DP | 16 FLOPs/cycle | 2 x 8 FLOPs | ||
SP | 32 FLOPs/cycle | 2 x 16 FLOPs |
ARM[edit]
Microarchitecture | FLOPs | ISA | ||
---|---|---|---|---|
ARM Microarchitectures | ||||
Cortex-A57 | EUs | 1 × 128-bit FMA | ARMv8 NEON (128-bit) | |
DP | 4 FLOPs/cycle | 4 FLOPs | ||
SP | 8 FLOPs/cycle | 8 FLOPs | ||
Cortex-A76 Cortex-A77 Cortex-A78 Neoverse N1 |
EUs | 2 × 128-bit FMA | ARMv8 NEON (128-bit) | |
DP | 8 FLOPs/cycle | 2 x 4 FLOPs | ||
SP | 16 FLOPs/cycle | 2 x 8 FLOPs | ||
Neoverse N2 | EUs | 2 × 128-bit FMA | ARMv9 SVE2 (128-bit) | |
DP | 8 FLOPs/cycle | 2 x 4 FLOPs | ||
SP | 16 FLOPs/cycle | 2 x 8 FLOPs | ||
Neoverse V1 | EUs | 2 × 256-bit FMA | ARMv8 SVE (256-bit) | |
DP | 16 FLOPs/cycle | 2 x 8 FLOPs | ||
SP | 32 FLOPs/cycle | 2 x 16 FLOPs | ||
Cortex-A510 | EUs | 1-2 × 128-bit FMA | ARMv9 SVE2 (128-bit) | |
DP | 2-4 FLOPs/cycle | 2-4 FLOPs | ||
SP | 4-8 FLOPs/cycle | 4-8 FLOPs | ||
AppliedMicro/Ampere Computing Microarchitectures | ||||
Storm Shadowcat Skylark |
EUs | 1 × 64-bit FMA | ARMv8 NEON (128-bit) | |
DP | 2 FLOPs/cycle | 2 FLOPs | ||
SP | 4 FLOPs/cycle | 4 FLOPs | ||
Cavium Microarchitectures | ||||
Vulcan | EUs | 2 × 128-bit FMA | ARMv8 NEON (128-bit) | |
DP | 8 FLOPs/cycle | 2 x 4 FLOPs | ||
SP | 16 FLOPs/cycle | 2 x 8 FLOPs | ||
Samsung Microarchitectures | ||||
M1 M2 |
EUs | 1 × 128-bit FMA + 1 × 128-bit Addition | ARMv8 NEON (128-bit) | |
DP | 6 FLOPs/cycle | 1 x 4 FLOPs + 1 x 2 FLOPs | ||
SP | 12 FLOPs/cycle | 1 x 8 FLOPs + 1 x 4 FLOPs | ||
M3 | EUs | 3 × 128-bit FMA | ARMv8 NEON (128-bit) | |
DP | 12 FLOPs/cycle | 3 x 4 FLOPs | ||
SP | 24 FLOPs/cycle | 3 x 8 FLOPs | ||
Phytium Microarchitectures | ||||
Xiaomi | EUs | 1 × 128-bit FMA | ARMv8 NEON (128-bit) | |
DP | 4 FLOPs/cycle | 1 x 4 FLOPs | ||
SP | 8 FLOPs/cycle | 1 x 8 FLOPs | ||
HiSilicon Microarchitectures | ||||
TaiShan v110 | EUs | 1 × 128-bit FMA | ARMv8 NEON (128-bit) | |
DP | 4 FLOPs/cycle | 1 x 4 FLOPs | ||
SP | 8 FLOPs/cycle | 1 x 8 FLOPs |