Abstract:
Large Language Models (LLMs) are developing at a rapid pace, which requires the development of effective inference approaches to keep up with the increasing demand for co...Show MoreMetadata
Abstract:
Large Language Models (LLMs) are developing at a rapid pace, which requires the development of effective inference approaches to keep up with the increasing demand for computational resources. This study investigates the inference capabilities of LLMs, with a specific emphasis o n the prefill a nd decode stages. Wee valuate a nd contrast multiple cutting-edge techniques, such as vLLM, SplitFuse, and Sarathi's chunked-prefill strategy. Every strategy has i ts o wn advantages in maximizing GPU utilization and enhancing throughput during inference. We present an innovative method for scheduling tasks that adapts the allocation of resources according to the proportion of prefill a nd decode requests. This strategy aims toe nhance the efficiency o f current methods by improving the time p er output token (TPOT) and throughput while maintaining competitive time to first token (TTFT) metrics. The experimental evaluation, conducted using the vLLM inference framework and the Llama2- 7b-chat model, shows that our approach significantly improves inference performance in comparison to existing strategies. The findings demonstrate that dynamic scheduling c an efficiently enhance resource utilization and system responsiveness, providing a more adaptable solution for demanding LLM applications.
Date of Conference: 26-28 October 2024
Date Added to IEEE Xplore: 11 December 2024
ISBN Information: