Loading [a11y]/accessibility-menu.js
WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization | IEEE Conference Publication | IEEE Xplore

WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization


Abstract:

Graphics processing units (GPUs) are an important class of parallel processors that offer high compute throughput and memory bandwidth. GPUs are used in a variety of impo...Show More

Abstract:

Graphics processing units (GPUs) are an important class of parallel processors that offer high compute throughput and memory bandwidth. GPUs are used in a variety of important computing domains, such as machine learning, high performance computing, sparse linear algebra, autonomous vehicles, and robotics. However, some applications from these domains can underperform due to sensitivity to memory latency and bandwidth. Some of this sensitivity can be reduced by better overlapping memory access with compute. Current GPUs often leverage pipeline parallelism in the form of warp specialization to enable better overlap. However, current warp specialization support on GPUs is limited in three ways. First, warp specialization is a complex and manual program transformation that is out of reach for many applications and developers. Second, it is limited to coarse-grained transfers between global memory and the shared memory scratchpad (SMEM); fine-grained memory access patterns are not well supported. Finally, the GPU hardware is unaware of the pipeline parallelism expressed by the programmer, and is unable to take advantage of this information to make better decisions at runtime. In this paper we introduce WASP, hardware and compiler support for warp specialization that addresses these limitations. WASP enables fine-grained streaming and gather memory access patterns through the use of warp-level register file queues and hardware-accelerated address generation. Explicit warp to pipeline stage naming enables the GPU to be aware of pipeline parallelism, which WASP capitalizes on by designing pipeline-aware warp mapping, register allocation, and scheduling. Finally, we design and implement a compiler that can automatically generate warp specialized kernels, reducing programmer burden. Overall, we find that runtime performance can be improved on a variety of important applications by an average of 47% over a modern GPU baseline.
Date of Conference: 02-06 March 2024
Date Added to IEEE Xplore: 02 April 2024
ISBN Information:

ISSN Information:

Conference Location: Edinburgh, United Kingdom

I. Introduction

GPUs are the dominant parallel programming substrate with widespread use in deep learning, high performance computing, sparse linear algebra, autonomous vehicles, and robotics [26], [35], [41]. GPU programmers spend considerable time optimizing their code to best exploit available GPU resources. However, some GPU applications are unable to consistently attain high compute throughput or memory bandwidth despite the presence of abundant parallelism [16], [17], [36], [39]. One reason that these applications cannot reach peak performance is due to memory latency and bandwidth sensitivity. This sensitivity generally comes from the inability of the kernel to overlap memory accesses with other useful work, causing the resources in the GPU to become underutilized. One way to provide better overlap and reduce memory sensitivity is to refactor the application to exploit pipeline parallelism within a kernel.

Contact IEEE to Subscribe

References

References is not available for this document.