Abstract:
Large Language Models (LLMs) have achieved significant success in various Natural Language Processing (NLP) tasks, becoming essential to modern intelligent computing. The...Show MoreMetadata
Abstract:
Large Language Models (LLMs) have achieved significant success in various Natural Language Processing (NLP) tasks, becoming essential to modern intelligent computing. Their large memory footprint and high computational cost hinder efficient deployment. Post-Training Quantization (PTQ) is a promising technique to alleviate this issue and accelerate LLM inference. However, the presence of outliers impedes the advancement of LLM quantization to lower bit levels. In this paper, we introduce OFQ-LLM, an algorithm-hardware co-design solution that adopts outlier-flexing quantization to efficiently accelerate LLM at low-bit levels. The key insight of OFQ-LLM is that normal data can be efficiently quantized in a slightly reduced data encoding space, while the rest encoding space can be used for flexible outlier values. During quantization, we use rescale-based clipping (RBC) to optimize accuracy for normal data and group outlier clustering (GOC) to flexibly represent outlier values. At the hardware level, we introduce a memory-aligned outlier-flexing encoding scheme to encode activations and weights in LLMs at a low bit level. The outlier-normal mixed hardware architecture is devised to leverage the encoding scheme and accelerate LLMs with high speed and high energy efficiency. Our experiments show that OFQ-LLM achieves better accuracy compared to state-of-the-art (SOTA) low-bit LLM PTQ works. OFQ-LLM-based accelerator surpasses the SOTA outlier-aware accelerators by up to 2.69× core energy efficiency, up to 3.83× speed up and 2.44× energy reduction in LLM prefilling phase, and up to 2.01× speed up and 2.88× energy reduction in LLM decoding phase, with superior accuracy.
Published in: IEEE Transactions on Circuits and Systems I: Regular Papers ( Early Access )