Difference between revisions of "qualcomm/microarchitectures/falkor"

	Edit Values
	Falkor µarch
	General Info
Arch Type	CPU
Designer	Qualcomm
Manufacturer	TSMC
Introduction	2017
Process	10 nm
Core Configs	42
	Pipeline
Type	Superscalar, Superpipeline
OoOE	Yes
Speculative	Yes
Reg Renaming	Yes
Stages	10-15
Decode	4-way
	Instructions
ISA	ARMv8
	Cache
L1I Cache	64 KiB/core; 8-way set associative
L1D Cache	32 KiB/core; 8-way set associative

Revision as of 17:19, 13 October 2017

Falkor is an ARM microarchitecture designed by Qualcomm for the server market. Falkor-based microprocessors are manufactured on a 10 nm process and sold under the Centriq brand.

Process Technology

Further information: 10 nm lithography process

Falkor-based chips are manufactured on TSMC's 10 nm process.

Architecture

Falkor is a new architecture designed by Qualcomm from the ground up for the server market. While some of the core architecture ressmbles Qualcomm's mobile cores, the overall system architecture is considerably different to anything Qualcomm has previously designed.

Overview

Core
- 1 V nominal operating voltage
- 64-bit ARM
  - AArch64 only
    - Fully ARMv8-compliant
  - Supports EL3 (TrustZone)
  - Supports EL2 (hypervisor)
  - Supports AES, SHA1, SHA2-256 optional cryptography instructions
- Out-of-order Pipeline
  - 4-way Decode
    - 3 instructions + 1 direct branch per cycle
  - 8-way Scheduler
  - 256-entry ReOrder Buffer
  - 76-entry Scheduler
  - 4 instructions/cycle retirement
  - 16B load + 16B store per cycle
Core Duplex
- 2 cores in a duplex
  - Shared L2 per duplex

Block Diagram

System Overview

This section is empty; you can help add the missing info by editing this page.

Individual Core

This section is empty; you can help add the missing info by editing this page.

Memory Hierarchy

L0I Cache:
- 24 KiB, 3-way set associative
- 64-byte lines
- way-predicted
- parity with auto-correct
- 0 cycle penalty for L0 hit
- Exclusive of L1
L1I Cache:
- 64 KiB, 8-way set associative
- 64-byte lines
- parity with auto-correct
- 4 cycles penalty for hit (L0 miss)
  - Hardware prefetch on L1 misses
L1D Cache:
- 32 KiB, 8-way set associative
- 64-byte lines
- Write-through, read-allocate, write-no-allocate
- Split virtual and physical tags
- parity with auto-correct
- 3 cycles minimum latency on hits
L2 Cache:
- Per duplex (shared by both cores)
- Unified 512 KiB, 8-way set associative
- 128-byte lines, interleaved
- Inclusive of L1D$ (both cores)
- ECC, single-error correction / double-error detection (SEC/DED)
- 15 cycles minimum latency on hits
- 32 B per direction per interleave per cycle
L3 Cache:
- Distributed in 12 blocks along the ring
- 5 MiB/block (60 MiB in total)
- 20-way set associative
- 128-byte lines, 128 B interleaved
- Non-inclusive
  - Standard cache or victim mode
- ECC, single-error correction / double-error detection (SEC/DED)
- Integrated L2 Snoop Filter
- QoS
  - Way-based partitioning
- Line and Way -based locking support
System DRAM:
- 768 GiB Max
- x64 DDR4-2666 memory
- ECC, single-error correction / double-error detection (SEC/DED)
- 6 channels, interleaved
  - Up to quad-rank 3DS
  - 16-128 GiB/channel
  - RDIMM/LRDIMM

TLBs:
- DTLB
  - 64-entry
- STLB
  - 512-entry "final"
  - 64-entry "non-final"
  - 64-entry Stage-2

Overview

The chip has been designed by Qualcomm specifically for the data center. Specifically this architecture is an attempt to address high concurrency, high thread density while maintaining isolation and quality of service between the processes. The overall chip consists of 24 core duplexes incorporating 48 cores on a ring interconnect along with 6 channels of DDR4 and L3 cache distributed across the ring. The chip also integrates 32 PCIe 3.0 lanes, 8 SATA gen 3.0 lanes, and a mixture of various other I/O peripherals.

System Architecture

This section is empty; you can help add the missing info by editing this page.

Core

This section is empty; you can help add the missing info by editing this page.

References

Thomas Speier & Barry Wolford, "Qualcomm Centriq 2400 Processor." Hot Chips 29 Symposium (HCS), 2017 IEEE. IEEE, 2017.
Barry Wolford, "Architecting a multi-core server SoC for the cloud", 2017 Linley Processor Conference

Documents

This section is empty; you can help add the missing info by editing this page.

@@ Line 81: / Line 81: @@
 ** Split virtual and physical tags
 ** parity with auto-correct
-** 3 cycles latency on hits
+** 3 cycles minimum latency on hits
 * L2 Cache:
-** Per duplex
+** Per duplex (shared by both cores)
-** 512 KiB, ?-way set associative
+** Unified 512 KiB, 8-way set associative
-** 64-byte lines
+** 128-byte lines, interleaved
+** Inclusive of L1D$ (both cores)
+** ECC, single-error correction / double-error detection (SEC/DED)
+** 15 cycles minimum latency on hits
+** 32 B per direction per interleave per cycle
 * L3 Cache:
 ** Distributed in 12 blocks along the ring

codename	Falkor +
core count	42 +
designer	Qualcomm +
first launched	2017 +
full page name	qualcomm/microarchitectures/falkor +
instance of	microarchitecture +
instruction set architecture	ARMv8 +
manufacturer	TSMC +
microarchitecture type	CPU +
name	Falkor +
pipeline stages (max)	15 +
pipeline stages (min)	10 +
process	10 nm (0.01 μm, 1.0e-5 mm) +

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung

Intel

AMD

Ampere

Apple

Cavium

HiSilicon

MediaTek

NXP

Qualcomm

Renesas

Samsung

Revision as of 17:19, 13 October 2017

Contents

Process Technology

Architecture

Overview

Block Diagram

System Overview

Individual Core

Memory Hierarchy

Overview

System Architecture

Core

References

Documents