Abstract:
Tensor cores in the recent NVIDIA GPUs are under the spotlight due to their superior computation throughput for general matrix-matrix multiplication (GEMM) that has been ...Show MoreMetadata
Abstract:
Tensor cores in the recent NVIDIA GPUs are under the spotlight due to their superior computation throughput for general matrix-matrix multiplication (GEMM) that has been widely used for deep learning applications. For massive-scale GEMMs, the entire matrix is practically divided into sub-matrices and assigned to multiple thread blocks and warps, and then processed by the tensor cores. Meanwhile, the same sub-matrix is regularly reused as an input to different sub-GEMMs, which causes redundant load operations from different warps and waste of register file spaces. To tackle this issue, we propose INTERPRET, a novel tensor core microarchitecture designed to minimize unnecessary accesses to the cache/memory hierarchy by leveraging the inter-warp data reuse characteristics. INTERPRET adopts a register renaming scheme to reduce the redundant load requests as well as the waste of register files, resulting in the reduction of the effective data load latency. INTERPRET further improves performance via non-speculative tensor preloading by leveraging the register file space saved by the register renaming. As INTERPRET is implemented based on the data access patterns of tensor core operations exhibiting a high level of regularity, the synergistic integration of the register renaming and tensor preloading can significantly improve the processing efficiency. Our experiments show that the proposed design achieves an average speedup of 34.1% and reduces energy consumption by 27.9%.
Published in: 2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)
Date of Conference: 21-25 October 2023
Date Added to IEEE Xplore: 27 December 2023
ISBN Information: