Advanced Micro Devices, apparently timing the announcement to coincide with a quiet period for rival Intel, on Sept. 10 unveiled a single-chip quad-core processor. The "native" device, code-named Barcelona, is AMD's first quad-core microprocessor and was architected to migrate the K8 core architecture into a product that would compete with, and outperform, Intel's new Core architecture, used in the Core 2 Duo processor line.
The next-generation Opteron processor integrates four enhanced-performance x86 cores, each with 512-Kbyte L2 cache and an enhanced 128-bit floating-point unit. The cores are integrated with a shared 2-Mbyte L3 cache and an improved on-chip memory controller that supports up to four 16-bit HyperTransport links and a dual-channel 128-bit DDR2/DDR3 interface.
The design contains more than 460 million transistors, about 120 million less than Intel's quad-core, which itself comprises two dual-core chips in a single package and is code-named Clovertown. The AMD chip is fabricated in a 65-nanometer silicon-on-insulator (SOI) CMOS process with dual stress liners and embedded SiGe for pMOS source/drains. The design uses 11 layers of copper interconnect and advanced low-k dielectrics that tie the four cores together. The dual stress nitride liners with embedded SiGe source/drain regions increase both n- and p-channel mobility, resulting in higher current drive. As before, AMD's implementation of its 65-nm technology on an SOI substrate can increase latch-up resistance and reduce short-channel effects over an analogous bulk silicon implementation.
Transistor performance
Interestingly, the transistor performances of Intel's Woodcrest and AMD's Barcelona appear to match fairly closely, with the Barcelona's gate leakage about half that of the Woodcrest. This is not so surprising as Intel uses a 25 percent thinner gate dielectric. AMD's device shows consistently lower gate dielectric leakage than Intel's, especially on the pFETs. The current drive for both devices is comparable, with the Barcelona coming out marginally higher for the pFET but lower for the nFET devices measured. However, the leakage current (I[subscript] off) for the nFETs was two to five times lower in the Woodcrest, suggesting the need for a bit more optimization of AMD's transistor. Since AMD and Intel have always considered the total package, system-level performance for a particular application generally has the final word. That information, however, is not yet available for the Barcelona.
Some of the changes AMD has made are intended to:
1. Increase bandwidth in operation execution--for example, the decode and instruction fetch--thereby increasing loads per cycle from the cache. This should improve AMD's video-encoding performance.
2. Improve performance by adding an indirect branch predictor, which reduces mispredicted branches and increases processor efficiency. This architectural improvement adopted in the Barcelona architecture follows Intel's implementation in the Prescott processor.
3. Offload certain frequent operations to dedicated hardware, using a sideband stack optimizer. This approach, similar in function to Intel's dedicated stack manager, removes some of the load from the processor's decoders and helps reduce pipeline clogging.
4. Add the capability to reorder load instructions and enable memory access optimization; this serves to increase instruction load speed--again, similar to the capability implemented by Intel in its Core 2 processor architecture.
5. Reduce the frequency of switching between read and write memory-control operations by using a "write-bursting" operation (with standard DDR2 memory, one or the other can be done, but not simultaneously; switching from one to the other introduces delays). In Intel's case, the fully buffered dual in-line memory module (FB-DIMM) architecture allows these operations to be performed simultaneously while also increasing reliability.
6. Improve entire chip performance by adding a new DRAM prefetcher (prefetchers have already been used extensively in different areas and components of the microprocessor). This prefetcher is located within the memory controller where none had existed before. It monitors the various memory requests to predict trends to identify and pull data that appears likely to be used in the future. This is stored in a separate buffer, which, incidentally, is identical to the write-bursting buffer used in the memory controller that improves performance and efficiency.
Power efficiency
Each core contains its own PLL, clock distribution system and power grid, with independent power/performance management capability (the core voltage and individual core frequencies operate independently of the Northbridge). This enables them to enter power-efficient states while the processor interface operates at full speed to service DDR2/3 memory and HyperTransport traffic.
AMD has incorporated temperature controls for each of the four cores by implementing eight remote temperature sensors distributed across the core, and an additional six remote sensors in the Northbridge block. The controller tracks temperatures against predetermined limits and selects power-saving mode options to reduce die temperature.
The cache is implemented with a standard 6T memory cell. AMD has provided custom tuning of the write pulse time after device fabrication by enabling programming with electrical fuses. This helps to provide scalability across a wide range of cache sizes.
So what does all this mean? At the transistor level, performance is fairly well matched, with this exception: Intel and AMD appear to have optimized their devices differently, resulting in Intel having lower I[subscript off] leakage current, and AMD having lower gate dielectric leakage. How this relates to overall system performance will be seen in time.
The race continues
When shipments start, the advanced technology expected to be employed in the Penryn architecture will be difficult or impossible to match until AMD's 45-nm technology is introduced in turn. Intel is not only racing the clock with AMD for the microprocessor performance crown, but also with Matsushita Electric Industrial Co. Ltd. for technology leadership. In this less-visible race, Matsushita may beat Intel to 45-nm commercialization, albeit without a high-k gate offering. This is a contest that AMD has chosen not to participate in, but rather to pursue the same objective in its own fashion, and on its own timetable.
See related chart