Contemplations about Effective Parallel Computing Systems

In the past, 3D accelerator cards, famously known as Graphics Processing Units (GPUs), were primarily associated with high-end workstations. Back then, video games and similar applications ran smoothly on desktop CPUs, with enhancements like MMX, 3DNow!, and SSE boosting game performance. But as GPUs gained popularity, they took over most SIMD vector tasks. However, GPUs have always struggled as general-purpose parallel computers, much to the bane of computer wizards like Raph Levien. This struggle was a significant motivator for Levien to express his concerns.

Over the years, the relationship between CPUs and GPUs has grown closer, with PCIe being a major leap forward compared to AGP and PCI. Yet, GPUs are still awful at performing arbitrary computing tasks, and even PCIe links are painfully slow compared to communication within GPU and CPU dies. With the advent of asynchronous graphic APIs, this divide became more pronounced. Levien's solution is to invert this relationship.

There's historical precedent for this, as demonstrated by Intel's Larrabee and IBM's Cell processor, both of which merged CPU and GPU characteristics on a single die. However, both struggled with creating software for such innovative architectures. The PlayStation 3 was forced to incorporate a GPU because of these issues. DirectStorage API in DirectX also adds CPU features to GPUs by bypassing the CPU when loading assets from storage.

As Levien points out, contemporary AI accelerators share similar characteristics, featuring multiple SIMD-capable, CPU-like cores. Maybe the future will follow in the footsteps of Cell.

Larrabee and the Cell processor represented a novel approach to designing systems that leverage the strengths of CPUs and GPUs to enhance performance, efficiency, and data processing capabilities. Though the first iterations of these projects did not achieve widespread adoption, they have left an indelible mark on modern computing advancements.

In modern times, while direct CPU-GPU integration on a single die, as depicted by projects like Larrabee and the Cell processor, is not a mainstream trend, advancements in interconnects, software optimizations, and system architectures continue to redefine what is achievable in computing by exploiting the advantages of both CPUs and GPUs.

Recently, progress in CPU-based AI acceleration has been considerable. Modern CPUs can now deliver impressive AI inference speeds thanks to advancements in CPU architecture and software optimizations. This development suggests that even general-purpose processors can execute AI tasks without the need for a separate GPU for every application.

On the other hand, GPUs remain the preferred choice for AI workloads due to their ability to manage massive parallel processing, essential for data-intensive AI tasks. Leading AI GPUs such as the NVIDIA RTX 4090 and A100 continue to dominate in this area.

Intel has entered the GPU market with new GPUs aimed at AI and workstation applications, while NVIDIA has unveiled its RTX PRO Blackwell GPUs, focusing on advancing AI, professional graphics, and high-performance computing.

NVIDIA's NVLink technology is being integrated into CPUs by companies like Fujitsu and Qualcomm, allowing for superior communication between CPU and GPU components. This integration boosts performance by eliminating data transfer bottlenecks.

NVIDIA's Grace Blackwell system is another example of high integration and scalability, combining GPUs with large memory pools and fast interconnects. This architecture is designed for extreme AI performance, demonstrating a significant advancement in chip design and system integration.

Despite not currently being a dominant trend, advancements in interconnects, software optimizations, and system architectures continue to push the boundaries of what is achievable in computing by capitalizing on the merits of both CPUs and GPUs.

Data-and-cloud-computing advancements have led to technologies that integrate CPUs and GPUs for improved performance. For instance, NVIDIA's NVLink technology allows for superior communication between CPU and GPU components, reducing data transfer bottlenecks.

With the emergence of asynchronous graphic APIs and AI accelerators, the line between CPUs and GPUs is becoming increasingly blurred, hinting at a future where general-purpose processors may execute AI tasks without the need for a separate GPU for every application.