Loading web-font TeX/Main/Regular
LORA: A Latency-Oriented Recurrent Architecture for GPT Model on Multi-FPGA Platform with Communication Optimization | IEEE Conference Publication | IEEE Xplore

LORA: A Latency-Oriented Recurrent Architecture for GPT Model on Multi-FPGA Platform with Communication Optimization


Abstract:

Large Language Models (LLMs) have been widely deployed in data centers to provide various services, among which the most representative is the Generative Pre-trained Tran...Show More

Abstract:

Large Language Models (LLMs) have been widely deployed in data centers to provide various services, among which the most representative is the Generative Pre-trained Transformer (GPT). The GPT model has heavy memory and computing overhead, and its inference process has two stages with distinct computing characteristics: Prefill and Decode. Utilizing existing GPUs and FPGA accelerators to construct a platform for deploying GPT in data centers faces the challenges of needing more effective synchronization schemes or structures with higher computational intensity. This paper proposes LORA, a low latency end-to-end GPT acceleration platform utilizing multiple FPGAs. Firstly, we optimize the synchronization timing of the GPT model to reduce the computation and communication overhead. Secondly, we devise some efficient synchronization steps for specific layers of the GPT model that overlap part of the computation and communication delay to improve the latency of our platform. Finally, we deploy recurrent structures on each FPGA to accelerate the different stages of the GPT model. Implemented on the Xilinx Alveo U280 FPGAs, LORA achieves an average 11.1 \times speedup over NVIDIA V100 GPUs on the modern GPT-2 model. Compared to the existing multi-FPGA accelerator appliance, LORA shows performance improvements of up to 4 \times and 2.7 \times in the Prefill and Decode stages.
Date of Conference: 02-06 September 2024
Date Added to IEEE Xplore: 09 October 2024
ISBN Information:

ISSN Information:

Conference Location: Torino, Italy

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.