Conferences >2024 34th International Confe...

LORA: A Latency-Oriented Recurrent Architecture for GPT Model on Multi-FPGA Platform with Communication Optimization

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Large Language Models (LLMs) have been widely deployed in data centers to provide various services, among which the most representative is the Generative Pre-trained Tran...Show More

Metadata

Abstract:

Large Language Models (LLMs) have been widely deployed in data centers to provide various services, among which the most representative is the Generative Pre-trained Transformer (GPT). The GPT model has heavy memory and computing overhead, and its inference process has two stages with distinct computing characteristics: Prefill and Decode. Utilizing existing GPUs and FPGA accelerators to construct a platform for deploying GPT in data centers faces the challenges of needing more effective synchronization schemes or structures with higher computational intensity. This paper proposes LORA, a low latency end-to-end GPT acceleration platform utilizing multiple FPGAs. Firstly, we optimize the synchronization timing of the GPT model to reduce the computation and communication overhead. Secondly, we devise some efficient synchronization steps for specific layers of the GPT model that overlap part of the computation and communication delay to improve the latency of our platform. Finally, we deploy recurrent structures on each FPGA to accelerate the different stages of the GPT model. Implemented on the Xilinx Alveo U280 FPGAs, LORA achieves an average

$11.1 \times$ speedup over NVIDIA V100 GPUs on the modern GPT-2 model. Compared to the existing multi-FPGA accelerator appliance, LORA shows performance improvements of up to

$4 \times$ and

$2.7 \times$ in the Prefill and Decode stages.

Published in: 2024 34th International Conference on Field-Programmable Logic and Applications (FPL)

Date of Conference: 02-06 September 2024

Date Added to IEEE Xplore: 09 October 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/FPL64840.2024.00053

Conference Location: Torino, Italy

Funding Agency:

Contents

References is not available for this document.

LORA: A Latency-Oriented Recurrent Architecture for GPT Model on Multi-FPGA Platform with Communication Optimization

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

LORA: A Latency-Oriented Recurrent Architecture for GPT Model on Multi-FPGA Platform with Communication Optimization

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Authors

Figures

References

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?