From WikiChip
Falkor - Microarchitectures - Qualcomm
< qualcomm
Revision as of 12:01, 19 May 2021 by 78.82.251.111 (talk) (Added compiler support)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Edit Values
Falkor µarch
General Info
Arch TypeCPU
DesignerQualcomm
ManufacturerSamsung
IntroductionNovember 8, 2017
Process10 nm
Core Configs40, 46, 48
Pipeline
TypeSuperscalar, Superpipeline
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages10-15
Decode4-way
Instructions
ISAARMv8
ExtensionsHypervisor (EL2), TrustZone (EL3), NEON, CRC32, Crypto, FP, RDM
Cache
L1I Cache64 KiB/core
8-way set associative
L1D Cache32 KiB/core
8-way set associative
L2 Cache512 KiB/duplex
8-way set associative
L3 Cache5 MiB/block
20-way set associative
Succession

Falkor is an ARM microarchitecture designed by Qualcomm for the server market. Falkor-based microprocessors are manufactured on a 10 nm process and sold under the Centriq brand.

Codenames[edit]

Core Platform Family
Falkor Amberwing Centriq

Process Technology[edit]

Further information: 10 nm lithography process

Falkor-based chips are manufactured on Samsung's 10nm 10LPE.

Compiler Support[edit]

Compiler Arch-Specific Arch-Favorable Arch-Target
GCC -march=armv8-a -mtune=falkor -mcpu=falkor
LLVM -march=armv8-a -mtune=falkor -mcpu=falkor

Architecture[edit]

falkor centriq 2400.png

Falkor is a new architecture designed by Qualcomm from the ground up for the server market. While some of the core architecture ressmbles Qualcomm's mobile cores, the overall system architecture is considerably different to anything Qualcomm has previously designed.

Overview[edit]

  • Core
    • 1 V nominal operating voltage
    • 64-bit ARM
      • AArch64 only
        • Fully ARMv8-compliant
      • Supports EL3 (TrustZone)
      • Supports EL2 (hypervisor)
      • Supports AES, SHA1, SHA2-256 optional cryptography instructions
    • Out-of-order Pipeline
      • 4-way Decode
        • 3 instructions + 1 direct branch per cycle
      • 8-way Scheduler
      • 256-entry ReOrder Buffer
      • 76-entry Scheduler
      • 4 instructions/cycle retirement
      • 16B load + 16B store per cycle
  • Core Duplex
    • 2 cores in a duplex
      • Shared L2 per duplex

Block Diagram[edit]

System Overview[edit]

New text document.svg This section is empty; you can help add the missing info by editing this page.

Individual Core[edit]

New text document.svg This section is empty; you can help add the missing info by editing this page.

Memory Hierarchy[edit]

  • L0I Cache:
    • 24 KiB, 3-way set associative
    • 64-byte lines
    • way-predicted
    • parity with auto-correct
    • 0 cycle penalty for L0 hit
    • Exclusive of L1
  • L1I Cache:
    • 64 KiB, 8-way set associative
    • 64-byte lines
    • parity with auto-correct
    • 4 cycles penalty for hit (L0 miss)
      • Hardware prefetch on L1 misses
  • L1D Cache:
    • 32 KiB, 8-way set associative
    • 64-byte lines
    • Write-through, read-allocate, write-no-allocate
    • Split virtual and physical tags
    • parity with auto-correct
    • 3 cycles minimum latency on hits
  • L2 Cache:
    • Per duplex (shared by both cores)
    • Unified 512 KiB, 8-way set associative
    • 128-byte lines, interleaved
    • Inclusive of L1D$ (both cores)
    • ECC, single-error correction / double-error detection (SEC/DED)
    • 15 cycles minimum latency on hits
    • 32 B per direction per interleave per cycle
  • L3 Cache:
    • Distributed in 12 blocks along the ring
    • 5 MiB/block (60 MiB in total)
    • 20-way set associative
    • 128-byte lines, 128 B interleaved
    • Non-inclusive
      • Standard cache or victim mode
    • ECC, single-error correction / double-error detection (SEC/DED)
    • Integrated L2 Snoop Filter
    • QoS
      • Way-based partitioning
    • Line and Way -based locking support
  • System DRAM:
    • 768 GiB Max
    • x64 DDR4-2666 memory
    • ECC, single-error correction / double-error detection (SEC/DED)
    • 6 channels, interleaved
      • Up to quad-rank 3DS
      • 16-128 GiB/channel
      • RDIMM/LRDIMM
  • TLBs:
    • DTLB
      • 64-entry
    • STLB
      • 512-entry "final"
      • 64-entry "non-final"
      • 64-entry Stage-2

Overview[edit]

The chip has been designed by Qualcomm specifically for the data center. Specifically this architecture is an attempt to address high concurrency, high thread density while maintaining isolation and quality of service between the processes. The overall chip consists of 24 core duplexes incorporating 48 cores on a ring interconnect along with 6 channels of DDR4 and L3 cache distributed across the ring. The chip also integrates 32 PCIe 3.0 lanes, 8 SATA gen 3.0 lanes, and a mixture of various other I/O peripherals.

System Architecture[edit]

New text document.svg This section is empty; you can help add the missing info by editing this page.

Core[edit]

New text document.svg This section is empty; you can help add the missing info by editing this page.

Die[edit]

  • Samsung's 10 nm process (10LPE)
  • 18,000,000,000 transistors
  • 398 mm² die size


Additional Shots[edit]

Additional wafer shots by Qualcomm.

All Falkor-based chips[edit]

 List of Falkor-based Processors
 Main processor
ModelPriceLaunchedCoresThreadsL3$FrequencyTurboTDP
2434$ 888.00
€ 799.20
£ 719.28
¥ 91,757.04
8 November 2017404050 MiB
51,200 KiB
52,428,800 B
0.0488 GiB
2.3 GHz
2,300 MHz
2,300,000 kHz
2.5 GHz
2,500 MHz
2,500,000 kHz
110 W
110,000 mW
0.148 hp
0.11 kW
2452$ 1,383.00
€ 1,244.70
£ 1,120.23
¥ 142,905.39
8 November 2017464657.5 MiB
58,880 KiB
60,293,120 B
0.0562 GiB
2.2 GHz
2,200 MHz
2,200,000 kHz
2.6 GHz
2,600 MHz
2,600,000 kHz
120 W
120,000 mW
0.161 hp
0.12 kW
2460$ 1,995.00
€ 1,795.50
£ 1,615.95
¥ 206,143.35
8 November 2017484860 MiB
61,440 KiB
62,914,560 B
0.0586 GiB
2.2 GHz
2,200 MHz
2,200,000 kHz
2.6 GHz
2,600 MHz
2,600,000 kHz
120 W
120,000 mW
0.161 hp
0.12 kW
Count: 3

References[edit]

  • Thomas Speier & Barry Wolford, "Qualcomm Centriq 2400 Processor." Hot Chips 29 Symposium (HCS), 2017 IEEE. IEEE, 2017.
  • ​Barry Wolford, "Architecting a multi-core server SoC for the cloud", 2017 Linley Processor Conference

Documents[edit]

New text document.svg This section is empty; you can help add the missing info by editing this page.
codenameFalkor +
core count40 +, 46 + and 48 +
designerQualcomm +
first launchedNovember 8, 2017 +
full page namequalcomm/microarchitectures/falkor +
instance ofmicroarchitecture +
instruction set architectureARMv8 +
manufacturerSamsung +
microarchitecture typeCPU +
nameFalkor +
pipeline stages (max)15 +
pipeline stages (min)10 +
process10 nm (0.01 μm, 1.0e-5 mm) +