I. Introduction
Rapidly-increasing data size in various domains [49], [115] has created huge challenges for the memory system of GPUs. Although High-Bandwidth Memory (HBM) has been adopted to meet the high memory bandwidth (BW) requirements of GPU s, it fails to fulfill the memory capacity needs of critical workloads, such as deep learning and large-scale graph analytics. Moreover, the memory capacity of GPUs has grown much slower than the compute throughput (Fig. 1a).