Loading [a11y]/accessibility-menu.js
INTERPRET: Inter-Warp Register Reuse for GPU Tensor Core | IEEE Conference Publication | IEEE Xplore

INTERPRET: Inter-Warp Register Reuse for GPU Tensor Core


Abstract:

Tensor cores in the recent NVIDIA GPUs are under the spotlight due to their superior computation throughput for general matrix-matrix multiplication (GEMM) that has been ...Show More

Abstract:

Tensor cores in the recent NVIDIA GPUs are under the spotlight due to their superior computation throughput for general matrix-matrix multiplication (GEMM) that has been widely used for deep learning applications. For massive-scale GEMMs, the entire matrix is practically divided into sub-matrices and assigned to multiple thread blocks and warps, and then processed by the tensor cores. Meanwhile, the same sub-matrix is regularly reused as an input to different sub-GEMMs, which causes redundant load operations from different warps and waste of register file spaces. To tackle this issue, we propose INTERPRET, a novel tensor core microarchitecture designed to minimize unnecessary accesses to the cache/memory hierarchy by leveraging the inter-warp data reuse characteristics. INTERPRET adopts a register renaming scheme to reduce the redundant load requests as well as the waste of register files, resulting in the reduction of the effective data load latency. INTERPRET further improves performance via non-speculative tensor preloading by leveraging the register file space saved by the register renaming. As INTERPRET is implemented based on the data access patterns of tensor core operations exhibiting a high level of regularity, the synergistic integration of the register renaming and tensor preloading can significantly improve the processing efficiency. Our experiments show that the proposed design achieves an average speedup of 34.1% and reduces energy consumption by 27.9%.
Date of Conference: 21-25 October 2023
Date Added to IEEE Xplore: 27 December 2023
ISBN Information:
Conference Location: Vienna, Austria

Funding Agency:


I. Introduction

The enthusiasm for researching deep neural networks (DNNs), particularly driven by convolutional neural networks (CNNs), is now expanding its reach into various application domains, including sequence-to-sequence models (seq2seq), recurrent neural networks (RNNs), and graph neural networks (GNNs) [1]–[3]. Accordingly, DNN applications have become indispensable tools in various fields [4]–[15]. These currently prevalent neural networks have a common feature. Convolution layers, primary elements of CNN, generally go through a process called lowering (im2col) [16]. This strategy allows for improved thread-level parallelism by untying deeply nested loops of convolution with a simple general matrix-to-matrix multiplication (GEMM). Transformers that have become well known through BERT [13] and GPT [17] include an attention mechanism consisting of multiple GEMMs to obtain a key, query, and value matrices and to compute attention distribution eventually. Likewise, GEMM is the principal operation of RNN-type networks to obtain hidden states and several state vectors, which is a general matrix-to-vector multiplication (GEMV) by default but computed as GEMM by the batch size. The same is true for fully connected layers and embedding operations that many types of networks adopt to extract inference results or specific information.

Contact IEEE to Subscribe

References

References is not available for this document.