Loading [a11y]/accessibility-menu.js
Post-training Quantization or Quantization-aware Training? That is the Question | IEEE Conference Publication | IEEE Xplore

Post-training Quantization or Quantization-aware Training? That is the Question


Abstract:

Quantization has been demonstrated to be one of the most effective model compression solutions that can potentially be adapted to support large models on a resource-const...Show More

Abstract:

Quantization has been demonstrated to be one of the most effective model compression solutions that can potentially be adapted to support large models on a resource-constrained edge device while maintaining a minimal power budget. There are two forms of quantization: post-training quantization (PTQ) and quantization-aware training (QAT). The former starts from a trained model with floating-point computation and then gets quantized afterward, while the latter compensates for the quantization-related errors by training the neural network using the quantized version in the forward pass during training. Though QAT is able to produce accuracy benefits, it suffers from a long training process and less flexibility during deployment. Traditionally, researchers usually make the one-time bold decision between QAT and PTQ depending on the quantized bit-width and hardware requirement. In this work, we observed that even though the hardware cost is approximately the same for various quantization schemes, the sensitivity to training for each quantized layer is different. This leads to that certain scheme requires QAT more than others. We argue that it is necessary to look into this dimension by measuring the accuracy difference for each layer under QAT and PTQ conditions. In this paper, we introduce a methodology to provide a systematic and explainable way to quantity the tradeoffs between the quantization forms. This is especially beneficial for evaluating a layer-wise mixed-precision quantization (MPQ) scheme, where different bit-widths across are allowed and the search space is enormous.
Date of Conference: 26-27 June 2023
Date Added to IEEE Xplore: 22 August 2023
ISBN Information:
Conference Location: Shanghai, China

Funding Agency:


I. Introduction

As intensive computing tasks such as AI and machine learning computations started to shift toward the edge, a big challenge for hardware designers is how to fit these large models into resource-constrained tiny devices. These constraints pushed the development of various model compression techniques [1]. Among these, quantization appears to be an effective solution that reduces the bit-widths of weights and activations by allowing an accuracy drop. It leads to a significant reduction in memory, lower network latency, and better power efficiency [2]. Despite the benefits, the tradeoff with quantization is always between accuracy loss and improvements in hardware resources in terms of latency, memory usage, and power. Uniform quantization realization is based on uniformly spaced quantization levels across the model, a well-accepted scheme is the INT8 model [3]. However, if we push this further by allowing lower quantization levels such as INT4 or below, the accuracy drop will be significant and unacceptable, layer-wise mixed-precision quantization (MPQ) appeared to be an alternative solution where each layer is quantized with different bit precision, thus it is able to deliver very fine-granular tradeoff exploration between accuracy and hardware resources. The quantization can be performed either by retraining the model, a process called Quantization-Aware Training (QAT) [4], or done without re-training, a process called Post-Training Quantization (PTQ). It is arguably true that QAT is preferred over PTQ method for good accuracy. At the same time, QAT incurs huge computational costs of re-training the models. To recover the accuracy, the retraining process requires several hundred epochs especially for lower-bit quantization levels such as INT4 and INT2 [2]. In addition, re-training is costly from economic and environmental perspectives [5]. The decision for QAT vs. PTQ is usually based on quantization levels and hardware requirements, and it also depends on the lifetime of a model. While for mixed-precision quantization schemes that support all different quantization levels from lower to high bit widths, it appears that the decision is more binary. While this can lead to unnecessary training processes that can be totally avoided for certain model schemes. In this paper, we postulate that different MPQ schemes with similar hardware cost or similar accuracy will perform differently with respect to the PTQ or QAT necessity. To qualify this study, we propose a metric denoted as to measure the influence of QAT on accuracy over PTQ. By performing experiments, we indeed proved our hypothesis and observed strong layer-wise sensitivity to for various quantization schemes. With this observation, we develop a new methodology for searching the optimal layer-wise MPQ scheme by introducing a new dimension called QAT necessity. This methodology also provides a qualitative way of analyzing the impact of training on various quantized models.

Contact IEEE to Subscribe

References

References is not available for this document.