From WikiChip
Editing cavium/microarchitectures/vulcan
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.
Latest revision | Your text | ||
Line 3: | Line 3: | ||
|atype=CPU | |atype=CPU | ||
|name=Vulcan | |name=Vulcan | ||
− | |designer= | + | |designer=Broadcomm |
|designer 2=Cavium | |designer 2=Cavium | ||
|manufacturer=TSMC | |manufacturer=TSMC | ||
Line 24: | Line 24: | ||
|isa=ARMv8.1 | |isa=ARMv8.1 | ||
|extension=NEON | |extension=NEON | ||
− | |||
|l1i=32 KiB | |l1i=32 KiB | ||
|l1i per=core | |l1i per=core | ||
Line 38: | Line 37: | ||
|predecessor=XLP II | |predecessor=XLP II | ||
|predecessor link=broadcom/microarchitectures/larrabee | |predecessor link=broadcom/microarchitectures/larrabee | ||
− | |||
− | |||
}} | }} | ||
− | '''Vulcan''' is a [[16 nm]] high-performance {{arch|64}} [[ARM]] microarchitecture designed by [[Broadcom]] and later | + | '''Vulcan''' is a [[16 nm]] high-performance {{arch|64}} [[ARM]] microarchitecture designed by [[Broadcom]] and later [[Cavium]] for the server market. |
Introduced in [[2018]], Vulcan-based microprocessors are branded as part of the {{cavium|ThunderX2}} family. | Introduced in [[2018]], Vulcan-based microprocessors are branded as part of the {{cavium|ThunderX2}} family. | ||
== History == | == History == | ||
− | Vulcan can trace its roots all the way back to [[Raza Microelectronics]] {{raza|XLR}} family of [[MIPS]] processors from [[2006]]. With the introduction of their {{raza|XLR}} family in [[2009]], Raza (and later [[NetLogic]]) moved to a high-performance superscalar design with fine-grained 4-way multithreading support. In [[2011]], [[Broadcom]] acquired [[NetLogic Microsystems]] and integrated them | + | Vulcan can trace its roots all the way back to [[Raza Microelectronics]] {{raza|XLR}} family of [[MIPS]] processors from [[2006]]. With the introduction of their {{raza|XLR}} family in [[2009]], Raza (and later [[NetLogic]]) moved to a high-performance superscalar design with fine-grained 4-way multithreading support. In [[2011]], [[Broadcom]] acquired [[NetLogic Microsystems]] and integrated them Broadcom's Embedded Processor Group. |
− | In [[2013]], Broadcom announced that they have licensed the ARMv7 and ARMv8 | + | In [[2013]], Broadcom announced that they have licensed the ARMv7 and ARMv8 architectures, allowing them to develop their own microarchitectures based on the ISA. Vulcan is the outcome of this effort which involved adopting the [[ARM]] ISA instead of [[MIPS]] and enhancing the cores in various ways. Vulcan development started in early [[2012]] and has was expected to enter mass production in mid-[[2015]]. |
− | In [[ | + | In [[2017]] [[Cavium]] acquired Vulcan from broadcom which was introduced later that year. In early [[2018]], Vulcan-based microprocessor entered general availability under the {{cavium|ThunderX2}} brand. |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Architecture == | == Architecture == | ||
Line 89: | Line 54: | ||
=== Key changes from {{broadcom|XLP II|l=arch}} === | === Key changes from {{broadcom|XLP II|l=arch}} === | ||
* Converted to [[ARM]] ISA (from [[MIPS]]) | * Converted to [[ARM]] ISA (from [[MIPS]]) | ||
− | ** Aarch64 | + | ** Aarch64, Aarch32 |
* [[16 nm lithography process|16nm FinFET process]] (from [[28 nm|28 nm planar]]) | * [[16 nm lithography process|16nm FinFET process]] (from [[28 nm|28 nm planar]]) | ||
* 40% IPC improvement | * 40% IPC improvement | ||
Line 106: | Line 71: | ||
**** Unified schedule (from distributed) | **** Unified schedule (from distributed) | ||
***** 60 entries | ***** 60 entries | ||
− | *** Large ROB (180 entries, up from | + | *** Large ROB (180 entries, up from 128) |
*** Execution Units | *** Execution Units | ||
**** New FP Unit (2, up from 1) | **** New FP Unit (2, up from 1) | ||
Line 138: | Line 103: | ||
*** 256 KiB, 8-way set associative | *** 256 KiB, 8-way set associative | ||
** L3 Cache | ** L3 Cache | ||
− | |||
*** 1 MiB/core slice | *** 1 MiB/core slice | ||
*** Shared | *** Shared | ||
− | |||
− | |||
** System DRAM | ** System DRAM | ||
− | |||
*** 8 Channels | *** 8 Channels | ||
*** DDR4, up to 2666 MT/s | *** DDR4, up to 2666 MT/s | ||
Line 170: | Line 131: | ||
== Core == | == Core == | ||
− | Vulcan is an [[out-of-order]] superscalar with support for up to four simultaneous hardware threads | + | Vulcan is an [[out-of-order]] superscalar with support for up to four simultaneous hardware threads. |
=== Front-end === | === Front-end === | ||
Vulcan's front-end is tasked with fetching instructions from a ready thread instruction stream and feeding them into the decode in order to be delivered to the execution units. Since Vulcan supports up to four threads, a thread scheduler determines from which thread's instruction stream to operate on. This determination is done on each cycle with the help of the branch predictor with no added cost. | Vulcan's front-end is tasked with fetching instructions from a ready thread instruction stream and feeding them into the decode in order to be delivered to the execution units. Since Vulcan supports up to four threads, a thread scheduler determines from which thread's instruction stream to operate on. This determination is done on each cycle with the help of the branch predictor with no added cost. | ||
− | |||
− | |||
− | |||
==== Fetch ==== | ==== Fetch ==== | ||
[[Instruction fetch]] is done on a 32-byte window or 8 (4-byte) [[ARM]] instructions. This is twice the throughput of the previous architecture and is designed in order to better absorb bubbles in the pipeline. The instruction stream is decomposed into its constituent instructions where they are queued to go for the [[instruction decode|decoder]]. The queue is shared by all threads. The size of the queue has not been disclosed. | [[Instruction fetch]] is done on a 32-byte window or 8 (4-byte) [[ARM]] instructions. This is twice the throughput of the previous architecture and is designed in order to better absorb bubbles in the pipeline. The instruction stream is decomposed into its constituent instructions where they are queued to go for the [[instruction decode|decoder]]. The queue is shared by all threads. The size of the queue has not been disclosed. | ||
Line 182: | Line 140: | ||
Each cycle, up to four instructions are sent to the [[instruction decode|decoder]]. In prior design, [[Broadcom]]'s products decoded [[MIPS]] instructions. With Vulcan, the switching to ARM meant the decoder had to be replaced with much more complex logic that decodes the original [[instruction]] and emits [[micro-ops]]. For the most part, there is a 1:1 mapping between instructions and µOP with an average of 15% more µOPs emitted from instructions. The extra complexity has added another pipeline stage to the decode. | Each cycle, up to four instructions are sent to the [[instruction decode|decoder]]. In prior design, [[Broadcom]]'s products decoded [[MIPS]] instructions. With Vulcan, the switching to ARM meant the decoder had to be replaced with much more complex logic that decodes the original [[instruction]] and emits [[micro-ops]]. For the most part, there is a 1:1 mapping between instructions and µOP with an average of 15% more µOPs emitted from instructions. The extra complexity has added another pipeline stage to the decode. | ||
==== Loop Buffer ==== | ==== Loop Buffer ==== | ||
− | Sitting between the [[instruction decode|decoder]] and the [[instruction scheduler|scheduler]] is a | + | Sitting between the [[instruction decode|decoder]] and the [[instruction scheduler|scheduler]] is a [[loop buffer]]. The loop buffer, in conjunction with the [[branch predictor]], will queue recent tight loop operations. The buffer will play back the operations repeatedly until a branch take occurs. When this takes place, the front-end (instruction fetch, decode, etc..) is largely power-gated in order to save power. Although Broadcom originally told us the buffer was had 48-entries, when the product was re-released by [[Cavium]] in late 2018, WikiChip was unable to confirm this number. |
− | === Execution engine === | + | ==== Execution engine ==== |
Vulcan's back-end handles the execution of out-of-order operations. Vulcan's back-end has been substantially enhanced from prior designs including a complete redesign of the [[instruction scheduler|scheduler]]. Most of the improvements dealt with entirely reworking the scheduler in order to more efficiently extract additional instruction-level parallelism opportunities. From decode, instructions are sent to the [[Reorder Buffer]] (ROB) at the rate of up to 4 µOPs each cycle. | Vulcan's back-end handles the execution of out-of-order operations. Vulcan's back-end has been substantially enhanced from prior designs including a complete redesign of the [[instruction scheduler|scheduler]]. Most of the improvements dealt with entirely reworking the scheduler in order to more efficiently extract additional instruction-level parallelism opportunities. From decode, instructions are sent to the [[Reorder Buffer]] (ROB) at the rate of up to 4 µOPs each cycle. | ||
− | ==== Renaming & Allocation ==== | + | ===== Renaming & Allocation ===== |
In the prior XLP II microarchitecture, [[NetLogic]] had a five-queue instruction distributed scheduler mechanism whereby each queue is associated with certain execution units. In Cavium, Broadcom got rid of the distributed scheduler and replaced it with a more efficient unified scheduler, similar in design to Intel's {{intel|Skylake (Client)|Skylake|l=arch}}. | In the prior XLP II microarchitecture, [[NetLogic]] had a five-queue instruction distributed scheduler mechanism whereby each queue is associated with certain execution units. In Cavium, Broadcom got rid of the distributed scheduler and replaced it with a more efficient unified scheduler, similar in design to Intel's {{intel|Skylake (Client)|Skylake|l=arch}}. | ||
Vulcan's [[Reorder Buffer]] is 180-entry in size, an 80-entry increase over prior design. The ROB tracks all µOPs in flight. Vulcan renames and retires up to 4 µOPs per cycle. | Vulcan's [[Reorder Buffer]] is 180-entry in size, an 80-entry increase over prior design. The ROB tracks all µOPs in flight. Vulcan renames and retires up to 4 µOPs per cycle. | ||
− | ==== Scheduler ==== | + | ===== Scheduler ===== |
Vulcan uses a 60-entry unified scheduler. Each cycle, up to six µOPs can be issued to the execution units, a much wider design from the four µOPs issue in prior design. It's worth noting that in order to support four threads, Vulcan duplicates most of the logic for each thread such as all the registers, architectural states, program counters, and interrupts. Ready µOPs from any thread are issued on each cycle. | Vulcan uses a 60-entry unified scheduler. Each cycle, up to six µOPs can be issued to the execution units, a much wider design from the four µOPs issue in prior design. It's worth noting that in order to support four threads, Vulcan duplicates most of the logic for each thread such as all the registers, architectural states, program counters, and interrupts. Ready µOPs from any thread are issued on each cycle. | ||
− | ==== Execution Units ==== | + | ===== Execution Units ===== |
− | Up to six µOPs can be sent into Vulcan's six execution units each cycle. As far as integer operations, up to three operations can be issued each cycle. One of the ALUs also handles branch instructions. | + | Up to six µOPs can be sent into Vulcan's six execution units each cycle. As far as integer operations, up to three operations can be issued each cycle. One of the ALUs also handles branch instructions. In the XLP II, there were two simple integer ALUs and a single complex integer ALU unit. Only the complex integer ALU unit was able to perform operations such as multiplication and division. Though unconfirmed, it's suspected that both ALUs can now do complex integer operations as well. |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | Vulcan has doubled the number of [[floating point]] units to two and widened them to 128-bit to support [[ARM]]'s {{arm|NEON}}} operations (prior design was only 64-bit wide). In theory, Vulcan's peak performance now stands at 8 FLOPS/cycle or 8 GFLOPS at 1 GHz. | ||
− | + | ==== Memory subsystem ==== | |
+ | {{empty section}} | ||
+ | <!-- | ||
== Die == | == Die == | ||
− | + | * Broadcom's original die size was around 600 mm². It's unknown how much the die has changed when it was modified by Cavium. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | * Broadcom's original die size was | ||
* TSMC's [[16 nm process]] | * TSMC's [[16 nm process]] | ||
− | |||
Line 240: | Line 171: | ||
:[[File:cavium vulcan die (annotated).png|600px]] | :[[File:cavium vulcan die (annotated).png|600px]] | ||
+ | --> | ||
== All Vulcan Chips == | == All Vulcan Chips == | ||
− | + | {{empty section}} | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | {{ | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | == | + | == References == |
− | * Broadcom | + | * ''Some information was obtained directly from Broadcom'' |
− | * | + | * ''Some information was obtained directly from Cavium'' |
− | |||
− | |||
== See also == | == See also == | ||
* Qualcomm's {{qualcomm|Falkor|l=arch}} | * Qualcomm's {{qualcomm|Falkor|l=arch}} | ||
* Intel's {{intel|Skylake (server)|Skylake|l=arch}} | * Intel's {{intel|Skylake (server)|Skylake|l=arch}} |
Facts about "Vulcan - Microarchitectures - Cavium"
codename | Vulcan + |
core count | 16 +, 20 +, 24 +, 28 +, 30 + and 32 + |
designer | Cavium + and Broadcom + |
first launched | 2018 + |
full page name | cavium/microarchitectures/vulcan + |
instance of | microarchitecture + |
instruction set architecture | ARMv8.1 + |
manufacturer | TSMC + |
microarchitecture type | CPU + |
name | Vulcan + |
pipeline stages (max) | 15 + |
pipeline stages (min) | 13 + |
process | 16 nm (0.016 μm, 1.6e-5 mm) + |