Journals & Magazines >IEEE Transactions on Circuits... >Early Access

OFQ-LLM: Outlier-Flexing Quantization for Efficient Low-Bit Large Language Model Acceleration

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Large Language Models (LLMs) have achieved significant success in various Natural Language Processing (NLP) tasks, becoming essential to modern intelligent computing. The...Show More

Metadata

Abstract:

Large Language Models (LLMs) have achieved significant success in various Natural Language Processing (NLP) tasks, becoming essential to modern intelligent computing. Their large memory footprint and high computational cost hinder efficient deployment. Post-Training Quantization (PTQ) is a promising technique to alleviate this issue and accelerate LLM inference. However, the presence of outliers impedes the advancement of LLM quantization to lower bit levels. In this paper, we introduce OFQ-LLM, an algorithm-hardware co-design solution that adopts outlier-flexing quantization to efficiently accelerate LLM at low-bit levels. The key insight of OFQ-LLM is that normal data can be efficiently quantized in a slightly reduced data encoding space, while the rest encoding space can be used for flexible outlier values. During quantization, we use rescale-based clipping (RBC) to optimize accuracy for normal data and group outlier clustering (GOC) to flexibly represent outlier values. At the hardware level, we introduce a memory-aligned outlier-flexing encoding scheme to encode activations and weights in LLMs at a low bit level. The outlier-normal mixed hardware architecture is devised to leverage the encoding scheme and accelerate LLMs with high speed and high energy efficiency. Our experiments show that OFQ-LLM achieves better accuracy compared to state-of-the-art (SOTA) low-bit LLM PTQ works. OFQ-LLM-based accelerator surpasses the SOTA outlier-aware accelerators by up to 2.69× core energy efficiency, up to 3.83× speed up and 2.44× energy reduction in LLM prefilling phase, and up to 2.01× speed up and 2.88× energy reduction in LLM decoding phase, with superior accuracy.

Published in: IEEE Transactions on Circuits and Systems I: Regular Papers ( Early Access )

Page(s): 1 - 14

Date of Publication: 13 March 2025

ISSN Information:

DOI: 10.1109/TCSI.2025.3547732

OFQ-LLM: Outlier-Flexing Quantization for Efficient Low-Bit Large Language Model Acceleration

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

IEEE Account

Purchase Details

Profile Information

Need Help?

OFQ-LLM: Outlier-Flexing Quantization for Efficient Low-Bit Large Language Model Acceleration

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Authors

Keywords

Metrics

IEEE Account

Purchase Details

Profile Information

Need Help?