I. Introduction
In recent years, General-purpose Graphics Processing Units (GPGPUs) have become increasingly popular in the field of artificial intelligence such as Large Language Models and Generative AI. To maximize GPGPU's parallelism and ease its programming, specialized programming models and libraries have been developed, such as CUDA [1] and OpenCL [2]. In these programming models, programmers describe the behavior of a single work item (OpenCL terminology)/thread (CUDA terminology) and let the software stack generate instructions executing on the hardware in parallel.