I. Introduction
As intensive computing tasks such as AI and machine learning computations started to shift toward the edge, a big challenge for hardware designers is how to fit these large models into resource-constrained tiny devices. These constraints pushed the development of various model compression techniques [1]. Among these, quantization appears to be an effective solution that reduces the bit-widths of weights and activations by allowing an accuracy drop. It leads to a significant reduction in memory, lower network latency, and better power efficiency [2]. Despite the benefits, the tradeoff with quantization is always between accuracy loss and improvements in hardware resources in terms of latency, memory usage, and power. Uniform quantization realization is based on uniformly spaced quantization levels across the model, a well-accepted scheme is the INT8 model [3]. However, if we push this further by allowing lower quantization levels such as INT4 or below, the accuracy drop will be significant and unacceptable, layer-wise mixed-precision quantization (MPQ) appeared to be an alternative solution where each layer is quantized with different bit precision, thus it is able to deliver very fine-granular tradeoff exploration between accuracy and hardware resources. The quantization can be performed either by retraining the model, a process called Quantization-Aware Training (QAT) [4], or done without re-training, a process called Post-Training Quantization (PTQ). It is arguably true that QAT is preferred over PTQ method for good accuracy. At the same time, QAT incurs huge computational costs of re-training the models. To recover the accuracy, the retraining process requires several hundred epochs especially for lower-bit quantization levels such as INT4 and INT2 [2]. In addition, re-training is costly from economic and environmental perspectives [5]. The decision for QAT vs. PTQ is usually based on quantization levels and hardware requirements, and it also depends on the lifetime of a model. While for mixed-precision quantization schemes that support all different quantization levels from lower to high bit widths, it appears that the decision is more binary. While this can lead to unnecessary training processes that can be totally avoided for certain model schemes. In this paper, we postulate that different MPQ schemes with similar hardware cost or similar accuracy will perform differently with respect to the PTQ or QAT necessity. To qualify this study, we propose a metric denoted as to measure the influence of QAT on accuracy over PTQ. By performing experiments, we indeed proved our hypothesis and observed strong layer-wise sensitivity to for various quantization schemes. With this observation, we develop a new methodology for searching the optimal layer-wise MPQ scheme by introducing a new dimension called QAT necessity. This methodology also provides a qualitative way of analyzing the impact of training on various quantized models.