(New info from HotChips) |
(Added compiler support) |
||
(21 intermediate revisions by 3 users not shown) | |||
Line 4: | Line 4: | ||
|name=Falkor | |name=Falkor | ||
|designer=Qualcomm | |designer=Qualcomm | ||
− | |manufacturer= | + | |manufacturer=Samsung |
− | |introduction=2017 | + | |introduction=November 8, 2017 |
|process=10 nm | |process=10 nm | ||
+ | |cores=40 | ||
+ | |cores 2=46 | ||
+ | |cores 3=48 | ||
|type=Superscalar | |type=Superscalar | ||
|type 2=Superpipeline | |type 2=Superpipeline | ||
Line 16: | Line 19: | ||
|decode=4-way | |decode=4-way | ||
|isa=ARMv8 | |isa=ARMv8 | ||
+ | |feature=AArch64 | ||
+ | |extension=Hypervisor (EL2) | ||
+ | |extension 2=TrustZone (EL3) | ||
+ | |extension 3=NEON | ||
+ | |extension 4=CRC32 | ||
+ | |extension 5=Crypto | ||
+ | |extension 6=FP | ||
+ | |extension 7=RDM | ||
|l1i=64 KiB | |l1i=64 KiB | ||
|l1i per=core | |l1i per=core | ||
Line 22: | Line 33: | ||
|l1d per=core | |l1d per=core | ||
|l1d desc=8-way set associative | |l1d desc=8-way set associative | ||
− | | | + | |l2=512 KiB |
− | | | + | |l2 per=duplex |
+ | |l2 desc=8-way set associative | ||
+ | |l3=5 MiB | ||
+ | |l3 per=block | ||
+ | |l3 desc=20-way set associative | ||
+ | |successor=Saphira | ||
+ | |successor link=qualcomm/microarchitectures/saphira | ||
|inst=Yes | |inst=Yes | ||
}} | }} | ||
− | '''Falkor''' is an [[ARM]] microarchitecture | + | '''Falkor''' is an [[ARM]] microarchitecture designed by [[Qualcomm]] for the server market. Falkor-based microprocessors are manufactured on a [[10 nm process]] and sold under the {{qualcomm|Centriq}} brand. |
+ | |||
+ | == Codenames == | ||
+ | {| class="wikitable" | ||
+ | ! Core !! Platform !! Family | ||
+ | |- | ||
+ | | Falkor || Amberwing || {{qualcomm|Centriq}} | ||
+ | |} | ||
+ | |||
+ | == Process Technology == | ||
+ | {{further|10 nm lithography process}} | ||
+ | Falkor-based chips are manufactured on [[Samsung]]'s [[10 nm process|10nm 10LPE]]. | ||
+ | |||
+ | == Compiler Support == | ||
+ | {| class="wikitable" | ||
+ | |- | ||
+ | ! Compiler !! Arch-Specific || Arch-Favorable || Arch-Target | ||
+ | |- | ||
+ | | [[GCC]] || <code>-march=armv8-a</code> || <code>-mtune=falkor</code> || <code>-mcpu=falkor</code> | ||
+ | |- | ||
+ | | [[LLVM]] || <code>-march=armv8-a</code> || <code>-mtune=falkor</code> || <code>-mcpu=falkor</code> | ||
+ | |} | ||
+ | |||
+ | == Architecture == | ||
+ | [[File:falkor centriq 2400.png|right|350px]] | ||
+ | Falkor is a new architecture designed by Qualcomm from the ground up for the server market. While some of the core architecture ressmbles Qualcomm's mobile cores, the overall system architecture is considerably different to anything Qualcomm has previously designed. | ||
+ | |||
+ | === Overview === | ||
+ | * Core | ||
+ | ** 1 V nominal operating voltage | ||
+ | ** 64-bit ARM | ||
+ | *** [[AArch64]] only | ||
+ | **** Fully ARMv8-compliant | ||
+ | *** Supports EL3 (TrustZone) | ||
+ | *** Supports EL2 (hypervisor) | ||
+ | *** Supports AES, SHA1, SHA2-256 optional cryptography instructions | ||
+ | ** [[Out-of-order]] Pipeline | ||
+ | *** 4-way Decode | ||
+ | **** 3 instructions + 1 direct branch per cycle | ||
+ | *** 8-way Scheduler | ||
+ | *** 256-entry ReOrder Buffer | ||
+ | *** 76-entry Scheduler | ||
+ | *** 4 instructions/cycle retirement | ||
+ | *** 16B load + 16B store per cycle | ||
+ | * Core Duplex | ||
+ | ** 2 cores in a duplex | ||
+ | *** Shared L2 per duplex | ||
+ | |||
+ | === Block Diagram === | ||
+ | ==== System Overview ==== | ||
+ | {{empty section}} | ||
+ | ==== Individual Core ==== | ||
+ | {{empty section}} | ||
+ | |||
+ | === Memory Hierarchy === | ||
+ | * L0I Cache: | ||
+ | ** 24 KiB, 3-way set associative | ||
+ | ** 64-byte lines | ||
+ | ** way-predicted | ||
+ | ** parity with auto-correct | ||
+ | ** 0 cycle penalty for L0 hit | ||
+ | ** Exclusive of L1 | ||
+ | * L1I Cache: | ||
+ | ** 64 KiB, 8-way set associative | ||
+ | ** 64-byte lines | ||
+ | ** parity with auto-correct | ||
+ | ** 4 cycles penalty for hit (L0 miss) | ||
+ | *** Hardware prefetch on L1 misses | ||
+ | * L1D Cache: | ||
+ | ** 32 KiB, 8-way set associative | ||
+ | ** 64-byte lines | ||
+ | ** Write-through, read-allocate, write-no-allocate | ||
+ | ** Split virtual and physical tags | ||
+ | ** parity with auto-correct | ||
+ | ** 3 cycles minimum latency on hits | ||
+ | * L2 Cache: | ||
+ | ** Per duplex (shared by both cores) | ||
+ | ** Unified 512 KiB, 8-way set associative | ||
+ | ** 128-byte lines, interleaved | ||
+ | ** Inclusive of L1D$ (both cores) | ||
+ | ** ECC, single-error correction / double-error detection (SEC/DED) | ||
+ | ** 15 cycles minimum latency on hits | ||
+ | ** 32 B per direction per interleave per cycle | ||
+ | * L3 Cache: | ||
+ | ** Distributed in 12 blocks along the ring | ||
+ | ** 5 MiB/block (60 MiB in total) | ||
+ | ** 20-way set associative | ||
+ | ** 128-byte lines, 128 B interleaved | ||
+ | ** Non-inclusive | ||
+ | *** Standard cache or victim mode | ||
+ | ** ECC, single-error correction / double-error detection (SEC/DED) | ||
+ | ** Integrated L2 Snoop Filter | ||
+ | ** [[QoS]] | ||
+ | *** Way-based partitioning | ||
+ | ** Line and Way -based locking support | ||
+ | * System [[DRAM]]: | ||
+ | ** 768 GiB Max | ||
+ | ** x64 DDR4-2666 memory | ||
+ | ** ECC, single-error correction / double-error detection (SEC/DED) | ||
+ | ** 6 channels, interleaved | ||
+ | *** Up to quad-rank 3DS | ||
+ | *** 16-128 GiB/channel | ||
+ | *** RDIMM/LRDIMM | ||
+ | |||
+ | * TLBs: | ||
+ | ** DTLB | ||
+ | *** 64-entry | ||
+ | ** STLB | ||
+ | *** 512-entry "final" | ||
+ | *** 64-entry "non-final" | ||
+ | *** 64-entry Stage-2 | ||
+ | |||
+ | == Overview == | ||
+ | The chip has been designed by [[Qualcomm]] specifically for the data center. Specifically this architecture is an attempt to address high concurrency, high thread density while maintaining isolation and quality of service between the processes. The overall chip consists of 24 core duplexes incorporating 48 cores on a ring interconnect along with 6 channels of DDR4 and L3 cache distributed across the ring. The chip also integrates 32 PCIe 3.0 lanes, 8 SATA gen 3.0 lanes, and a mixture of various other I/O peripherals. | ||
+ | |||
+ | == System Architecture == | ||
+ | {{empty section}} | ||
+ | |||
+ | == Core == | ||
+ | {{empty section}} | ||
+ | |||
+ | == Die == | ||
+ | * Samsung's [[10 nm process]] (10LPE) | ||
+ | * 18,000,000,000 transistors | ||
+ | * 398 mm² die size | ||
+ | |||
+ | |||
+ | === Additional Shots === | ||
+ | Additional wafer shots by Qualcomm. | ||
+ | |||
+ | <gallery mode=slideshow> | ||
+ | File:qualcomm centriq 2400 wafer (color).png | ||
+ | File:qualcomm centriq 2400 wafer (gold).png | ||
+ | File:qualcomm centriq 2400 wafer (upright, color).png | ||
+ | File:qualcomm centriq 2400 wafer (upright, gold).png | ||
+ | </gallery> | ||
+ | |||
+ | == All Falkor-based chips == | ||
+ | <!-- NOTE: | ||
+ | This table is generated automatically from the data in the actual articles. | ||
+ | If a microprocessor is missing from the list, an appropriate article for it needs to be | ||
+ | created and tagged accordingly. | ||
+ | |||
+ | Missing a chip? please dump its name here: https://en.wikichip.org/wiki/WikiChip:wanted_chips | ||
+ | --> | ||
+ | {{comp table start}} | ||
+ | <table class="comptable sortable tc1 tc4 tc5"> | ||
+ | {{comp table header|main|8:List of Falkor-based Processors}} | ||
+ | {{comp table header|main|8:Main processor}} | ||
+ | {{comp table header|cols|Price|Launched|Cores|Threads|L3$|%Frequency|%Turbo|TDP}} | ||
+ | {{#ask: [[Category:microprocessor models by qualcomm]] [[microarchitecture::Falkor]] | ||
+ | |?full page name | ||
+ | |?model number | ||
+ | |?release price | ||
+ | |?first launched | ||
+ | |?core count | ||
+ | |?thread count | ||
+ | |?l3$ size | ||
+ | |?base frequency#GHz | ||
+ | |?turbo frequency#GHz | ||
+ | |?tdp | ||
+ | |format=template | ||
+ | |template=proc table 3 | ||
+ | |userparam=10 | ||
+ | |mainlabel=- | ||
+ | }} | ||
+ | {{comp table count|ask=[[Category:microprocessor models by qualcomm]] [[microarchitecture::Falkor]]}} | ||
+ | </table> | ||
+ | {{comp table end}} | ||
+ | |||
+ | == References == | ||
+ | * Thomas Speier & Barry Wolford, "Qualcomm Centriq 2400 Processor." Hot Chips 29 Symposium (HCS), 2017 IEEE. IEEE, 2017. | ||
+ | * Barry Wolford, "Architecting a multi-core server SoC for the cloud", 2017 Linley Processor Conference | ||
+ | |||
+ | == Documents == | ||
+ | {{empty section}} |
Latest revision as of 12:01, 19 May 2021
Edit Values | |
Falkor µarch | |
General Info | |
Arch Type | CPU |
Designer | Qualcomm |
Manufacturer | Samsung |
Introduction | November 8, 2017 |
Process | 10 nm |
Core Configs | 40, 46, 48 |
Pipeline | |
Type | Superscalar, Superpipeline |
OoOE | Yes |
Speculative | Yes |
Reg Renaming | Yes |
Stages | 10-15 |
Decode | 4-way |
Instructions | |
ISA | ARMv8 |
Extensions | Hypervisor (EL2), TrustZone (EL3), NEON, CRC32, Crypto, FP, RDM |
Cache | |
L1I Cache | 64 KiB/core 8-way set associative |
L1D Cache | 32 KiB/core 8-way set associative |
L2 Cache | 512 KiB/duplex 8-way set associative |
L3 Cache | 5 MiB/block 20-way set associative |
Succession | |
Falkor is an ARM microarchitecture designed by Qualcomm for the server market. Falkor-based microprocessors are manufactured on a 10 nm process and sold under the Centriq brand.
Contents
Codenames[edit]
Core | Platform | Family |
---|---|---|
Falkor | Amberwing | Centriq |
Process Technology[edit]
- Further information: 10 nm lithography process
Falkor-based chips are manufactured on Samsung's 10nm 10LPE.
Compiler Support[edit]
Compiler | Arch-Specific | Arch-Favorable | Arch-Target |
---|---|---|---|
GCC | -march=armv8-a |
-mtune=falkor |
-mcpu=falkor
|
LLVM | -march=armv8-a |
-mtune=falkor |
-mcpu=falkor
|
Architecture[edit]
Falkor is a new architecture designed by Qualcomm from the ground up for the server market. While some of the core architecture ressmbles Qualcomm's mobile cores, the overall system architecture is considerably different to anything Qualcomm has previously designed.
Overview[edit]
- Core
- 1 V nominal operating voltage
- 64-bit ARM
- AArch64 only
- Fully ARMv8-compliant
- Supports EL3 (TrustZone)
- Supports EL2 (hypervisor)
- Supports AES, SHA1, SHA2-256 optional cryptography instructions
- AArch64 only
- Out-of-order Pipeline
- 4-way Decode
- 3 instructions + 1 direct branch per cycle
- 8-way Scheduler
- 256-entry ReOrder Buffer
- 76-entry Scheduler
- 4 instructions/cycle retirement
- 16B load + 16B store per cycle
- 4-way Decode
- Core Duplex
- 2 cores in a duplex
- Shared L2 per duplex
- 2 cores in a duplex
Block Diagram[edit]
System Overview[edit]
This section is empty; you can help add the missing info by editing this page. |
Individual Core[edit]
This section is empty; you can help add the missing info by editing this page. |
Memory Hierarchy[edit]
- L0I Cache:
- 24 KiB, 3-way set associative
- 64-byte lines
- way-predicted
- parity with auto-correct
- 0 cycle penalty for L0 hit
- Exclusive of L1
- L1I Cache:
- 64 KiB, 8-way set associative
- 64-byte lines
- parity with auto-correct
- 4 cycles penalty for hit (L0 miss)
- Hardware prefetch on L1 misses
- L1D Cache:
- 32 KiB, 8-way set associative
- 64-byte lines
- Write-through, read-allocate, write-no-allocate
- Split virtual and physical tags
- parity with auto-correct
- 3 cycles minimum latency on hits
- L2 Cache:
- Per duplex (shared by both cores)
- Unified 512 KiB, 8-way set associative
- 128-byte lines, interleaved
- Inclusive of L1D$ (both cores)
- ECC, single-error correction / double-error detection (SEC/DED)
- 15 cycles minimum latency on hits
- 32 B per direction per interleave per cycle
- L3 Cache:
- Distributed in 12 blocks along the ring
- 5 MiB/block (60 MiB in total)
- 20-way set associative
- 128-byte lines, 128 B interleaved
- Non-inclusive
- Standard cache or victim mode
- ECC, single-error correction / double-error detection (SEC/DED)
- Integrated L2 Snoop Filter
- QoS
- Way-based partitioning
- Line and Way -based locking support
- System DRAM:
- 768 GiB Max
- x64 DDR4-2666 memory
- ECC, single-error correction / double-error detection (SEC/DED)
- 6 channels, interleaved
- Up to quad-rank 3DS
- 16-128 GiB/channel
- RDIMM/LRDIMM
- TLBs:
- DTLB
- 64-entry
- STLB
- 512-entry "final"
- 64-entry "non-final"
- 64-entry Stage-2
- DTLB
Overview[edit]
The chip has been designed by Qualcomm specifically for the data center. Specifically this architecture is an attempt to address high concurrency, high thread density while maintaining isolation and quality of service between the processes. The overall chip consists of 24 core duplexes incorporating 48 cores on a ring interconnect along with 6 channels of DDR4 and L3 cache distributed across the ring. The chip also integrates 32 PCIe 3.0 lanes, 8 SATA gen 3.0 lanes, and a mixture of various other I/O peripherals.
System Architecture[edit]
This section is empty; you can help add the missing info by editing this page. |
Core[edit]
This section is empty; you can help add the missing info by editing this page. |
Die[edit]
- Samsung's 10 nm process (10LPE)
- 18,000,000,000 transistors
- 398 mm² die size
Additional Shots[edit]
Additional wafer shots by Qualcomm.
All Falkor-based chips[edit]
List of Falkor-based Processors | ||||||||
---|---|---|---|---|---|---|---|---|
Main processor | ||||||||
Model | Price | Launched | Cores | Threads | L3$ | Frequency | Turbo | TDP |
2434 | $ 888.00 € 799.20 £ 719.28 ¥ 91,757.04 | 8 November 2017 | 40 | 40 | 50 MiB 51,200 KiB 52,428,800 B 0.0488 GiB | 2.3 GHz 2,300 MHz 2,300,000 kHz | 2.5 GHz 2,500 MHz 2,500,000 kHz | 110 W 110,000 mW 0.148 hp 0.11 kW |
2452 | $ 1,383.00 € 1,244.70 £ 1,120.23 ¥ 142,905.39 | 8 November 2017 | 46 | 46 | 57.5 MiB 58,880 KiB 60,293,120 B 0.0562 GiB | 2.2 GHz 2,200 MHz 2,200,000 kHz | 2.6 GHz 2,600 MHz 2,600,000 kHz | 120 W 120,000 mW 0.161 hp 0.12 kW |
2460 | $ 1,995.00 € 1,795.50 £ 1,615.95 ¥ 206,143.35 | 8 November 2017 | 48 | 48 | 60 MiB 61,440 KiB 62,914,560 B 0.0586 GiB | 2.2 GHz 2,200 MHz 2,200,000 kHz | 2.6 GHz 2,600 MHz 2,600,000 kHz | 120 W 120,000 mW 0.161 hp 0.12 kW |
Count: 3 |
References[edit]
- Thomas Speier & Barry Wolford, "Qualcomm Centriq 2400 Processor." Hot Chips 29 Symposium (HCS), 2017 IEEE. IEEE, 2017.
- Barry Wolford, "Architecting a multi-core server SoC for the cloud", 2017 Linley Processor Conference
Documents[edit]
This section is empty; you can help add the missing info by editing this page. |
codename | Falkor + |
core count | 40 +, 46 + and 48 + |
designer | Qualcomm + |
first launched | November 8, 2017 + |
full page name | qualcomm/microarchitectures/falkor + |
instance of | microarchitecture + |
instruction set architecture | ARMv8 + |
manufacturer | Samsung + |
microarchitecture type | CPU + |
name | Falkor + |
pipeline stages (max) | 15 + |
pipeline stages (min) | 10 + |
process | 10 nm (0.01 μm, 1.0e-5 mm) + |