Abstract:
In large language model inference, efficient utilization of GPU memory is of utmost importance. Current systems suffer from unreasonable GPU memory allocation: excessive ...Show MoreMetadata
Abstract:
In large language model inference, efficient utilization of GPU memory is of utmost importance. Current systems suffer from unreasonable GPU memory allocation: excessive memory is idle under low load, while memory shortages occur under high load. Moreover, existing GPU memory management methods do not consider the different memory requirements for prefill and decode stages in the inference process. This paper proposes a new solution - DynamicAttention. It allocates a continuous virtual GPU memory space at startup, but does not actually allocate physical GPU memory. Instead, it dynamically allocates memory according to the demand during runtime. Different memory consumption prediction algorithms are designed specifically for the Prefill and Decode stages. Experimental results show that compared with the existing SOTA solutions, DynamicAttention improves GPU memory utilization by 4 times under low load and 15% under high load. Furthermore, it increases throughput by 1.6 times.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: