(→x86: intel and amd big cores) |
|||
Line 25: | Line 25: | ||
== FLOPS by microarchitecture == | == FLOPS by microarchitecture == | ||
=== x86 === | === x86 === | ||
− | {{ | + | {| class="wikitable" |
+ | |- | ||
+ | ! Microarchitecture !! colspan="3" | FLOPS !! ISA | ||
+ | |- | ||
+ | ! colspan="5" | [[Intel]] Microarchitectures | ||
+ | |- | ||
+ | | rowspan="3" | {{intel|Core|l=arch}}<br>{{intel|Penryn|l=arch}}<br>{{intel|Nehalem|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit Multiplication + 1 × 128-bit Addition || rowspan="3" | {{x86|SSE}} (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 4 FLOPS/cycle || 2 x FLOPS + 2 × FLOPS | ||
+ | |- | ||
+ | | '''SP''' || 8 FLOPS/cycle || 4 x FLOPS + 4 × FLOPS | ||
+ | |- | ||
+ | | rowspan="3" | {{intel|Sandy Bridge|l=arch}}<br>{{intel|Ivy Bridge|l=arch}} || '''EUs''' || colspan="2" | 1 × 256-bit Multiplication + 1 × 256-bit Addition || rowspan="3" | {{x86|AVX}} (265-bit) | ||
+ | |- | ||
+ | | '''DP''' || 8 FLOPS/cycle || 4 × FLOPS + 4 × FLOPS | ||
+ | |- | ||
+ | | '''SP''' || 16 FLOPS/cycle || 8 × FLOPS + 8 × FLOPS | ||
+ | |- | ||
+ | | rowspan="3" | {{intel|Haswell|l=arch}}<br>{{intel|Broadwell|l=arch}}<br>{{intel|Skylake|l=arch}}<br>{{intel|Kaby Lake|l=arch}}<br>{{intel|Coffee Lake|l=arch}}<br>{{intel|Whiskey Lake|l=arch}}<br>{{intel|Amber Lake|l=arch}} || '''EUs''' || colspan="2" | 2 × 256-bit FMA || rowspan="3" | {{x86|AVX2}} & FMA (265-bit) | ||
+ | |- | ||
+ | | '''DP''' || 16 FLOPS/cycle || 2 × 8 × FLOPS | ||
+ | |- | ||
+ | | '''SP''' || 32 FLOPS/cycle || 2 × 16 × FLOPS | ||
+ | |- | ||
+ | | rowspan="3" | {{intel|Skylake (server)|l=arch}} || '''EUs''' || colspan="2" | 2 × 512-bit FMA (varies by SKU) || rowspan="3" | {{x86|AVX-512}} & FMA (512-bit) | ||
+ | |- | ||
+ | | '''DP''' || 32 FLOPS/cycle || 2 × 16 × FLOPS | ||
+ | |- | ||
+ | | '''SP''' || 64 FLOPS/cycle || 2 × 32 × FLOPS | ||
+ | |- | ||
+ | ! colspan="5" | [[AMD]] Microarchitectures | ||
+ | |- | ||
+ | | rowspan="3" | {{amd|K10|l=arch}} || '''EUs''' || colspan="2" | 1 × 128-bit Multiplication + 1 × 128-bit Addition || rowspan="3" | {{x86|SSE}} (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 4 FLOPS/cycle || 2 x FLOPS + 2 × FLOPS | ||
+ | |- | ||
+ | | '''SP''' || 8 FLOPS/cycle || 4 x FLOPS + 4 × FLOPS | ||
+ | |- | ||
+ | | rowspan="3" | {{amd|Bulldozer|l=arch}}<br>{{amd|Piledriver|l=arch}}<br>{{amd|Steamroller|l=arch}}<br>{{amd|Excavator|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA (per two cores) || rowspan="3" | {{x86|AVX}} & {{x86|FMA4|FMA}} (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 8 FLOPS/cycle || 2 x 4 × FLOPS | ||
+ | |- | ||
+ | | '''SP''' || 16 FLOPS/cycle || 2 x 8 × FLOPS | ||
+ | |- | ||
+ | | rowspan="3" | {{amd|Zen|l=arch}}<br>{{amd|Zen+|l=arch}} || '''EUs''' || colspan="2" | 2 × 128-bit FMA || rowspan="3" | {{x86|AVX2}} & FMA (128-bit) | ||
+ | |- | ||
+ | | '''DP''' || 8 FLOPS/cycle || 2 x 4 × FLOPS | ||
+ | |- | ||
+ | | '''SP''' || 16 FLOPS/cycle || 2 x 8 × FLOPS | ||
+ | |} | ||
=== ARM === | === ARM === |
Revision as of 04:17, 22 September 2018
Floating-point operations per second (FLOPS) is a microprocessor performance unit used to quantify the number of floating-point operations a core, machine, or system is capable of in a one second.
Overview
FLOPS are a measure of performance used for comparing the peak theoretical performance of a core, microprocessor, or system using floating point operations. This unit is often used in the field of high-performance computing (e.g., supercomputers) in order to evaluate the peak theoretical performance of various scientific workloads. Traditionally, the FLOPS of a microprocessor could be calculated using the following equation:
With the advent of multi-socket and multi-core architectures, additional levels of explicit parallelism have been introduced resulting in the following modified equation:
and,
Modern microprocessors exploit data parallelism further through the introduction of various vector extensions such as x86's AVX and ARM's SVE. With those extensions, it's possible to performance multiple floating-point operations within a single instruction. For example, a typical fused multiply-accumulate (FMAC) operation can perform two floating-point operations at once. For a single core, this can be expressed as
And for a full system, this can be extended to:
FLOPS by microarchitecture
x86
Microarchitecture | FLOPS | ISA | ||
---|---|---|---|---|
Intel Microarchitectures | ||||
Core Penryn Nehalem |
EUs | 1 × 128-bit Multiplication + 1 × 128-bit Addition | SSE (128-bit) | |
DP | 4 FLOPS/cycle | 2 x FLOPS + 2 × FLOPS | ||
SP | 8 FLOPS/cycle | 4 x FLOPS + 4 × FLOPS | ||
Sandy Bridge Ivy Bridge |
EUs | 1 × 256-bit Multiplication + 1 × 256-bit Addition | AVX (265-bit) | |
DP | 8 FLOPS/cycle | 4 × FLOPS + 4 × FLOPS | ||
SP | 16 FLOPS/cycle | 8 × FLOPS + 8 × FLOPS | ||
Haswell Broadwell Skylake Kaby Lake Coffee Lake Whiskey Lake Amber Lake |
EUs | 2 × 256-bit FMA | AVX2 & FMA (265-bit) | |
DP | 16 FLOPS/cycle | 2 × 8 × FLOPS | ||
SP | 32 FLOPS/cycle | 2 × 16 × FLOPS | ||
Skylake (server) | EUs | 2 × 512-bit FMA (varies by SKU) | AVX-512 & FMA (512-bit) | |
DP | 32 FLOPS/cycle | 2 × 16 × FLOPS | ||
SP | 64 FLOPS/cycle | 2 × 32 × FLOPS | ||
AMD Microarchitectures | ||||
K10 | EUs | 1 × 128-bit Multiplication + 1 × 128-bit Addition | SSE (128-bit) | |
DP | 4 FLOPS/cycle | 2 x FLOPS + 2 × FLOPS | ||
SP | 8 FLOPS/cycle | 4 x FLOPS + 4 × FLOPS | ||
Bulldozer Piledriver Steamroller Excavator |
EUs | 2 × 128-bit FMA (per two cores) | AVX & FMA (128-bit) | |
DP | 8 FLOPS/cycle | 2 x 4 × FLOPS | ||
SP | 16 FLOPS/cycle | 2 x 8 × FLOPS | ||
Zen Zen+ |
EUs | 2 × 128-bit FMA | AVX2 & FMA (128-bit) | |
DP | 8 FLOPS/cycle | 2 x 4 × FLOPS | ||
SP | 16 FLOPS/cycle | 2 x 8 × FLOPS |
ARM
This section is empty; you can help add the missing info by editing this page. |