Conferences >2024 IEEE 42nd International ...

AirGun: Adaptive Granularity Quantization for Accelerating Large Language Models

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Transformer-based models have evolved into Large Language Models (LLMs) by increasing model sizes to achieve higher accuracy, but they incur significant computational and...Show More

Metadata

Abstract:

Transformer-based models have evolved into Large Language Models (LLMs) by increasing model sizes to achieve higher accuracy, but they incur significant computational and memory costs. As quantization is a promising method to mitigate the huge cost of LLMs, the presence of outliers can lead to accuracy drops during quantization. Previous work pointed out LLMs have outliers only in specific input channels of activations. This suggests that per-input channel quantization would be beneficial, but it poses excessive computational overhead without optimization. To address these challenges, we propose a hardware and software co-design that mitigates the overhead of per-input channel quantization. We first propose AirGun, a quantization method that adaptively quantizes LLM modules. We observe that LLMs have high quantization sensitivity only in specific modules. Based on our observation, AirGun applies hardware-efficient per-tensor quantization for non-sensitive modules and per-input channel quantization for sensitive modules. For per-input channel quantization, we introduce early reconstruction and adaptive dyadic numbering, dismissing the overhead while exploiting its advantages. Additionally, we propose the AirGun accelerator that fully utilizes the advantages of AirGun. As a result, the AirGun accelerator achieves a 4.19 × speedup and 63.16 % lower energy consumption compared to the previous LM accelerator while achieving higher accuracy.

Published in: 2024 IEEE 42nd International Conference on Computer Design (ICCD)

Date of Conference: 18-20 November 2024

Date Added to IEEE Xplore: 02 January 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ICCD63220.2024.00103

Conference Location: Milan, Italy

Contents

References is not available for this document.

AirGun: Adaptive Granularity Quantization for Accelerating Large Language Models

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

AirGun: Adaptive Granularity Quantization for Accelerating Large Language Models

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?