Loading [MathJax]/extensions/MathMenu.js
AirGun: Adaptive Granularity Quantization for Accelerating Large Language Models | IEEE Conference Publication | IEEE Xplore

AirGun: Adaptive Granularity Quantization for Accelerating Large Language Models


Abstract:

Transformer-based models have evolved into Large Language Models (LLMs) by increasing model sizes to achieve higher accuracy, but they incur significant computational and...Show More

Abstract:

Transformer-based models have evolved into Large Language Models (LLMs) by increasing model sizes to achieve higher accuracy, but they incur significant computational and memory costs. As quantization is a promising method to mitigate the huge cost of LLMs, the presence of outliers can lead to accuracy drops during quantization. Previous work pointed out LLMs have outliers only in specific input channels of activations. This suggests that per-input channel quantization would be beneficial, but it poses excessive computational overhead without optimization. To address these challenges, we propose a hardware and software co-design that mitigates the overhead of per-input channel quantization. We first propose AirGun, a quantization method that adaptively quantizes LLM modules. We observe that LLMs have high quantization sensitivity only in specific modules. Based on our observation, AirGun applies hardware-efficient per-tensor quantization for non-sensitive modules and per-input channel quantization for sensitive modules. For per-input channel quantization, we introduce early reconstruction and adaptive dyadic numbering, dismissing the overhead while exploiting its advantages. Additionally, we propose the AirGun accelerator that fully utilizes the advantages of AirGun. As a result, the AirGun accelerator achieves a 4.19 × speedup and 63.16 % lower energy consumption compared to the previous LM accelerator while achieving higher accuracy.
Date of Conference: 18-20 November 2024
Date Added to IEEE Xplore: 02 January 2025
ISBN Information:

ISSN Information:

Conference Location: Milan, Italy

Contact IEEE to Subscribe

References

References is not available for this document.