Optimizing Exponent Bias for Sub-8bit Floating-Point Inference of Fine-tuned Transformers | IEEE Conference Publication | IEEE Xplore

Optimizing Exponent Bias for Sub-8bit Floating-Point Inference of Fine-tuned Transformers


Abstract:

The Transformer-based fine-tuned neural networks have demonstrated remarkable success in natural language processing (NLP) at the cost of a substantial computational burd...Show More

Abstract:

The Transformer-based fine-tuned neural networks have demonstrated remarkable success in natural language processing (NLP) at the cost of a substantial computational burden. Post-training quantization (PTQ) is a promising technique to reduce the computational cost without expensive re-training. But prior works either demand complex calibration or suffer noticeable accuracy degradation. This paper proposes a practical method for sub-8bit floating-point (FP) PTQ. The proposed method optimizes the exponent bias to minimize quantization error in terms of signal-to-quantization noise ratio (SQNR) progressively like stochastic gradient descent. We evaluate that the proposed method achieves close to full-precision model accuracy for 6 to 8 bit FP PTQ of fine-tuned BERT on GLUE and SQuAD tasks with negligible run-time overhead.
Date of Conference: 13-15 June 2022
Date Added to IEEE Xplore: 05 September 2022
ISBN Information:
Conference Location: Incheon, Korea, Republic of

Contact IEEE to Subscribe

References

References is not available for this document.