Bandwidth-Effective DRAM Cache for GPU s with Storage-Class Memory | IEEE Conference Publication | IEEE Xplore

Bandwidth-Effective DRAM Cache for GPU s with Storage-Class Memory


Abstract:

We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity...Show More

Abstract:

We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that mandate memory oversubscription, resulting in substantial speedups. However, the DRAM cache needs to be carefully designed to address the latency and bandwidth limitations of the SCM while minimizing cost overhead and considering GPU's characteristics. Because the massive number of GPU threads can easily thrash the DRAM cache and degrade performance, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multi-dimensional characteristics of memory accesses by G PU s with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probe traffic and increase effective DRAM BW with minimal cost overhead, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache implementation with Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power consumption and exploiting SCM's SLC/MLC modes to adapt to workload's memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, the HMS improves performance by up to 12.5× (2.9× overall) and reduces energy by up to 89.3% (48.1 % overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91–9...
Date of Conference: 02-06 March 2024
Date Added to IEEE Xplore: 02 April 2024
ISBN Information:

ISSN Information:

Conference Location: Edinburgh, United Kingdom

Funding Agency:

References is not available for this document.

I. Introduction

Rapidly-increasing data size in various domains [49], [115] has created huge challenges for the memory system of GPUs. Although High-Bandwidth Memory (HBM) has been adopted to meet the high memory bandwidth (BW) requirements of GPU s, it fails to fulfill the memory capacity needs of critical workloads, such as deep learning and large-scale graph analytics. Moreover, the memory capacity of GPUs has grown much slower than the compute throughput (Fig. 1a).

Select All
1.
Graph500 Benchmark specification.” [Online], Available: https://graph500.org/?page_id=12
2.
Nvidia tensor cores.” [Online], Available: https://www.nvidia.com/en-us/data-center/tensor-cores
3.
“High bandwidth memory (hbm) dram,” JEDEC Standard, 2013.
4.
“Nvidia tesla p100,” NVIDIA whitepaper, 2016.
5.
“Nvidia tesla v100 gpu architecture,” NVIDIA whitepaper, 2017.
6.
“Nvidia nvswitch: The world's highest-bandwidth on-node switch,” NVIDIA Whitepaper, 2018.
7.
“Introducing amd cdna architecture,” AMD whitepaper, 2020.
8.
“Nvidia a100 tensor core gpu architecture,” NVIDIA Whitepaper, 2020.
9.
Nvidia grace hopper superchip architecture,” 2020. [Online]. Available: https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-hopper
10.
“Compute express link specification 3.0,” CXL Consortium, 2022.
11.
“Nvidia h100 tensor core gpu architecture,” NVIDIA whitepaper, 2022.
12.
High bandwidth memory (hbm2e) interface intel agilex® 7 m-series fpga ip user guide,” July 2023. [Online]. Available: https://cdrdv2-public.intel.com/781867/ug-773264-781867.pdf
13.
N. Agarwal and T. F. Wenisch, “Thermostat: Application-transparent page management for two-tiered main memory,” in Proceedings of the 22nd International Conference on Architectural Support for Program-ming Languages and Operating Systems (ASPLOS), 2017.
14.
T. Allen and R. Ge, “Demystifying gpu uvm cost with deep runtime and workload analysis,” in Proceedings of the 33rd International Parallel and Distributed Processing Symposium (IPDPS), 2021.
15.
T. Allen and R. Ge, “In-depth analyses of unified virtual memory system for gpu accelerated computing,” in Proceedings of the 34th International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021.
16.
AMD, “Amd instinct™ mi100 accelerator.” [Online], Available: https://www.amd.com/en/products/server-accelerators/instinct-mi100
17.
AMD, “Amd instinct™ mi250x accelerator.” [Online]. Available: https://www.amd.com/en/products/server-accelerators/instinct-mi250x
18.
AMD, “Amd instinct™ mi300x accelerator.” [Online], Available: https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html
19.
AMD, “Amd radeon instinct™ mi25 accelerator.” [Online], Available: https://www.amd.com/ko/products/professional-graphics/instinct-mi25
20.
V.-G. Anghel, “Exploring current immersion cooling deployments,” March 2023. [Online], Available: https://www.datacenterdynamics.com/en/analysis/exploring-current-immersion-cooling-deployments/
21.
A. Azad, M. M. Aznaveh, S. Beamer, M. Blanco, J. Chen, L. D’ Alessandro, R. Dathathri, T. Davis, K. Deweese, J. Firoz, H. A. Gabb, G. Gill, B. Hegyi, S. Kolodziej, T. M. Low, A. Lumsdaine, T. Manlaibaatar, T. G. Mattson, S. McMillan, R. Peri, K. Pingali, U. Sridhar, G. Szarnyas, Y. Zhang, and Y. Zhang, “Evaluation of graph analytics frameworks using the gap benchmark suite,” in 2020 IEEE International Symposium on Workload Characterization (IISWC), 2020, pp. 216–227.
22.
A. Bakhoda, J. Kim, and T. M. Aamodt, “Throughput-effective on-chip networks for manycore accelerators,” in 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010.
23.
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in Proceedings of the 2nd International Symposium on Performance Analysis of Systems and Software (ISPASS), 2009.
24.
R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V. Srinivas, “Cacti 7: New tools for interconnect exploration in innovative off-chip memories,” in Proceedings of the 14th Transactions on Architecture and Code Optimization (TACO), 2017.
25.
P. Behnam and M. N. Bojnordi, “Redcache: Reduced dram caching,” in Proceedings of the 57th Design Automation Conference (DAC), 2020.
26.
G. Boeing, “Osmnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks,” Computers, Environment and Urban Systems, vol. 65, pp. 126–139, 2017.
27.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Proceedings of the 33rd Advances in Neural Information Processing Systems (NeurIPS), 2020.
28.
N. Chatterjee, M. O'Connor, D. Lee, D. R. Johnson, S. W. Keckler, M. Rhu, and W. J. Dally, “Architecting an energy-efficient dram system for gpus,” in Proceedings of the 23rd International Symposium on High Performance Computer Architecture (HPCA), 2017.
29.
S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, “Pannotia: Understanding irregular gpgpu graph applications,” in Proceedings of the 16th International Symposium on Workload Characterization (IISWC), 2013.
30.
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in Proceedings of the 12th International Symposium on Workload Characterization (IISWC), 2009.

Contact IEEE to Subscribe

References

References is not available for this document.