From WikiChip
Difference between revisions of "qualcomm/microarchitectures/falkor"
< qualcomm

(Memory Hierarchy)
(Memory Hierarchy)
Line 81: Line 81:
 
** Split virtual and physical tags
 
** Split virtual and physical tags
 
** parity with auto-correct
 
** parity with auto-correct
** 3 cycles latency on hits
+
** 3 cycles minimum latency on hits
 
* L2 Cache:
 
* L2 Cache:
** Per duplex
+
** Per duplex (shared by both cores)
** 512 KiB, ?-way set associative
+
** Unified 512 KiB, 8-way set associative
** 64-byte lines
+
** 128-byte lines, interleaved
 +
** Inclusive of L1D$ (both cores)
 +
** ECC, single-error correction / double-error detection (SEC/DED)
 +
** 15 cycles minimum latency on hits
 +
** 32 B per direction per interleave per cycle
 
* L3 Cache:
 
* L3 Cache:
 
** Distributed in 12 blocks along the ring
 
** Distributed in 12 blocks along the ring

Revision as of 18:19, 13 October 2017

Edit Values
Falkor µarch
General Info
Arch TypeCPU
DesignerQualcomm
ManufacturerTSMC
Introduction2017
Process10 nm
Core Configs42
Pipeline
TypeSuperscalar, Superpipeline
OoOEYes
SpeculativeYes
Reg RenamingYes
Stages10-15
Decode4-way
Instructions
ISAARMv8
Cache
L1I Cache64 KiB/core
8-way set associative
L1D Cache32 KiB/core
8-way set associative

Falkor is an ARM microarchitecture designed by Qualcomm for the server market. Falkor-based microprocessors are manufactured on a 10 nm process and sold under the Centriq brand.

Process Technology

Further information: 10 nm lithography process

Falkor-based chips are manufactured on TSMC's 10 nm process.

Architecture

Falkor is a new architecture designed by Qualcomm from the ground up for the server market. While some of the core architecture ressmbles Qualcomm's mobile cores, the overall system architecture is considerably different to anything Qualcomm has previously designed.

Overview

  • Core
    • 1 V nominal operating voltage
    • 64-bit ARM
      • AArch64 only
        • Fully ARMv8-compliant
      • Supports EL3 (TrustZone)
      • Supports EL2 (hypervisor)
      • Supports AES, SHA1, SHA2-256 optional cryptography instructions
    • Out-of-order Pipeline
      • 4-way Decode
        • 3 instructions + 1 direct branch per cycle
      • 8-way Scheduler
      • 256-entry ReOrder Buffer
      • 76-entry Scheduler
      • 4 instructions/cycle retirement
      • 16B load + 16B store per cycle
  • Core Duplex
    • 2 cores in a duplex
      • Shared L2 per duplex

Block Diagram

System Overview

New text document.svg This section is empty; you can help add the missing info by editing this page.

Individual Core

New text document.svg This section is empty; you can help add the missing info by editing this page.

Memory Hierarchy

  • L0I Cache:
    • 24 KiB, 3-way set associative
    • 64-byte lines
    • way-predicted
    • parity with auto-correct
    • 0 cycle penalty for L0 hit
    • Exclusive of L1
  • L1I Cache:
    • 64 KiB, 8-way set associative
    • 64-byte lines
    • parity with auto-correct
    • 4 cycles penalty for hit (L0 miss)
      • Hardware prefetch on L1 misses
  • L1D Cache:
    • 32 KiB, 8-way set associative
    • 64-byte lines
    • Write-through, read-allocate, write-no-allocate
    • Split virtual and physical tags
    • parity with auto-correct
    • 3 cycles minimum latency on hits
  • L2 Cache:
    • Per duplex (shared by both cores)
    • Unified 512 KiB, 8-way set associative
    • 128-byte lines, interleaved
    • Inclusive of L1D$ (both cores)
    • ECC, single-error correction / double-error detection (SEC/DED)
    • 15 cycles minimum latency on hits
    • 32 B per direction per interleave per cycle
  • L3 Cache:
    • Distributed in 12 blocks along the ring
    • 5 MiB/block (60 MiB in total)
    • 20-way set associative
    • 128-byte lines, 128 B interleaved
    • Non-inclusive
      • Standard cache or victim mode
    • ECC, single-error correction / double-error detection (SEC/DED)
    • Integrated L2 Snoop Filter
    • QoS
      • Way-based partitioning
    • Line and Way -based locking support
  • System DRAM:
    • 768 GiB Max
    • x64 DDR4-2666 memory
    • ECC, single-error correction / double-error detection (SEC/DED)
    • 6 channels, interleaved
      • Up to quad-rank 3DS
      • 16-128 GiB/channel
      • RDIMM/LRDIMM
  • TLBs:
    • DTLB
      • 64-entry
    • STLB
      • 512-entry "final"
      • 64-entry "non-final"
      • 64-entry Stage-2

Overview

The chip has been designed by Qualcomm specifically for the data center. Specifically this architecture is an attempt to address high concurrency, high thread density while maintaining isolation and quality of service between the processes. The overall chip consists of 24 core duplexes incorporating 48 cores on a ring interconnect along with 6 channels of DDR4 and L3 cache distributed across the ring. The chip also integrates 32 PCIe 3.0 lanes, 8 SATA gen 3.0 lanes, and a mixture of various other I/O peripherals.

System Architecture

New text document.svg This section is empty; you can help add the missing info by editing this page.

Core

New text document.svg This section is empty; you can help add the missing info by editing this page.

References

  • Thomas Speier & Barry Wolford, "Qualcomm Centriq 2400 Processor." Hot Chips 29 Symposium (HCS), 2017 IEEE. IEEE, 2017.
  • ​Barry Wolford, "Architecting a multi-core server SoC for the cloud", 2017 Linley Processor Conference

Documents

New text document.svg This section is empty; you can help add the missing info by editing this page.