Loading [MathJax]/extensions/MathMenu.js
DynamicAttention: Dynamic KV Cache for Disaggregate LLM Inference | IEEE Conference Publication | IEEE Xplore

DynamicAttention: Dynamic KV Cache for Disaggregate LLM Inference


Abstract:

In large language model inference, efficient utilization of GPU memory is of utmost importance. Current systems suffer from unreasonable GPU memory allocation: excessive ...Show More

Abstract:

In large language model inference, efficient utilization of GPU memory is of utmost importance. Current systems suffer from unreasonable GPU memory allocation: excessive memory is idle under low load, while memory shortages occur under high load. Moreover, existing GPU memory management methods do not consider the different memory requirements for prefill and decode stages in the inference process. This paper proposes a new solution - DynamicAttention. It allocates a continuous virtual GPU memory space at startup, but does not actually allocate physical GPU memory. Instead, it dynamically allocates memory according to the demand during runtime. Different memory consumption prediction algorithms are designed specifically for the Prefill and Decode stages. Experimental results show that compared with the existing SOTA solutions, DynamicAttention improves GPU memory utilization by 4 times under low load and 15% under high load. Furthermore, it increases throughput by 1.6 times.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.