|Introduction||October 12, 2011|
Bulldozer introduced the concept of Clustered MultiThreading (CMT) to the AMD64 architecture. In CMT, major functional units are shared between groups of cores to reduce the average die-area cost of implementing them. Other CMT implementations include the UltraSPARC-T1, codenamed Niagara, which shared a single FPU between many small cores, each of which also implemented four-way SMT. Overall, Bulldozer marked a major and ambitious departure from AMD's previous designs - and not a very successful one.
Bulldozer shared one 4-pipe FPU, one L1 I-cache, one set of branch-prediction hardware, one 4-way instruction decoder, and the L2 unified cache between a pair of cores, referred to by AMD as a module. In the FX-8100 series CPUs, four modules were bundled together with a shared L3 cache, to make an 8-core CPU for the AM3+ socket. Six and four core versions were also sold, by disabling one or two faulty modules; an individual core within a module could not be salvaged through die-harvesting. The Bulldozer core was not used in an APU design, but its successors were.
Pre-release marketing by AMD indicated that the hardware dedicated to each core would include four "integer pipelines", which were each initially assumed by many to be functionally equivalent to the three integer pipelines in K7, K8 and K10. The L1 D-cache would also be dedicated per core. It was later revealed, however, that only two of the pipelines per core were ALUs, with the other two being AGUs associated with memory loads and stores. Worse, Bulldozer's AGUs could not be used to execute the more complex forms of the LEA (Load Effective Address) instruction efficiently because they had no connection to the result bus, so these instructions had to be cracked for multiple passes through an ALU instead. Effectively, Bulldozer was better described as having two "integer pipelines" and two load-store units per core.
Each of the three "integer pipelines" of K7, K8 and K10 included both an ALU and an AGU for a total of three of each, so Bulldozer (with only two of each) actually had a 33% reduction in integer throughput per core per clock relative to its immediate predecessor, instead of the 33% increase (ie. four of each) that AMD marketing had implied. This had an all too predictable effect on integer performance.
The shared FPU was considerably beefed up from K10's, with two FMAC pipelines - capable of executing adds, multiplies, and the new fused-multiply-add (FMA) instructions - and two additional pipelines for other FPU-related operations. In principle, even with both cores heavily using the shared FPU, this increased capability should have retained rough parity in throughput per core per clock with K10. It could fairly be claimed that this FPU was Bulldozer's best feature.
The four-way instruction decoder, along with the branch predictor, instruction fetcher, and L1 I-cache, could be dedicated to one core if the other core was in sleep mode. Otherwise, they would each dedicate themselves to alternate cores on successive cycles, effectively halving the fetch and decode bandwidth observed by each core. The decoder was capable of handling four single-op instructions, one double-op and two single-op, or up to four ops from a microcoded instruction, per cycle. Two consecutive double-op instructions had to be decoded in separate cycles. These limitations proved to have a significant effect on performance.
Other differences from K10 lay in the cache hierarchy. The L1 I-cache, which as previously mentioned was now shared between two cores, was still 64KB and 2-way set-associative, which quickly proved to be a bottleneck when running heterogeneous workloads, because code running on one core would repeatedly evict that required by the other. The L1 D-cache was sharply reduced in size to 16KB and became write-through instead of write-back. Cache and memory latencies were found to be much higher than in K10, which was particularly disappointing since K8 and K10 had made a point of having low latencies with their on-die memory controllers.
Benchmarks quickly demonstrated that, despite Bulldozer's high core count, high clock speeds (an overclocking marketing stunt reached 8GHz on LN2) and correspondingly high power consumption, its overall performance was generally no better and often worse than K10. Some aspects of this improved slightly with better OS support, but there were fundamental problems that no mere software tweaks could overcome. Compared to Bulldozer's immediate competitor, Sandy Bridge, the advantage was clearly with the latter, especially for games which, at the time, rarely used more than two or three threads effectively and thrived on low memory latency.
All Bulldozer Chips
|Model||Family||Core||Launched||Power Dissipation||Freq||Max Mem|
|first launched||October 12, 2011 +|
|full page name||amd/microarchitectures/bulldozer +|
|instance of||microarchitecture +|
|instruction set architecture||x86-64 +|
|microarchitecture type||CPU +|
|process||32 nm (0.032 μm, 3.2e-5 mm) +|