Training Multi-bit Quantized and Binarized Networks with A Learnable Symmetric Quantizer

Quantizing weights and activations of deep neural networks is essential for deploying them in resource-constrained devices, or cloud platforms for at-scale services. While binarization is a special case of quantization, this extreme case often leads to several training difficulties, and necessitates specialized models and training methods. As a result, recent quantization methods do not provide binarization, thus losing the most resource-efficient option, and quantized and binarized networks have been distinct research areas. We examine binarization difficulties in a quantization framework and find that all we need to enable the binary training are a symmetric quantizer, good initialization, and careful hyperparameter selection. These techniques also lead to substantial improvements in multi-bit quantization. We demonstrate our unified quantization framework, denoted as UniQ, on the ImageNet dataset with various architectures such as ResNet-18,-34 and MobileNetV2. For multi-bit quantization, UniQ outperforms existing methods to achieve the state-of-the-art accuracy. In binarization, the achieved accuracy is comparable to existing state-of-the-art methods even without modifying the original architectures.


Introduction
Deep neural networks have achieved tremendous success in various fields including computer vision [31], natural language processing [50], and speech recognition [8], having demonstrated unprecedented predictive performance. However, the computational complexity and memory access count required by the existing models pose a challenge in deploying them to resource-constrained devices, and applying to latency-critical services. To address this challenge, efficient network architectures, manually designed [44] or automatically searched [46], and model compression techniques such as pruning [17] and quantization [5,24,45,54] have been studied. In practical model deployment, quanti-R zation is a necessary step, and often the last means to control the performance and efficiency trade-off once the model architecture is fixed [24]. Quantizing weights and activations to a lower precision not only reduces the computational complexity but also the model size, memory footprint, and memory access count but at cost of degraded performance. Recent quantization methods [11,29] can overcome accuracy degradation from their full-precision counterparts even with 4-bit weights and activations. However, most of these methods do not present 1-bit results, possibly due to their severely degraded performance or because 1-bit training does not converge, ditching the most efficient option.
While Zhou et al. [55] considered binarized neural networks together with multi-bit quantization, binarized neural networks have been a distinct research topic from the quantized models. Most studies on binarized models [7,23,36,42] focused on the 1-bit case solely. Binarized networks have gained attention because of the expensive floatingpoint multiplications and additions being replaced by efficient XNOR and POPCOUNT operations. However, this is not specific to binarized networks, and any low-precision networks can be executed efficiently by such operations as  Figure 2: The overview about our method. A unified framework is used for both model quantization and binarization.
shown in [55]. Thus, a binary kernel can execute any bitwidth networks seamlessly depending on the accuracy and efficiency requirements.
In contrast, the process of building binarized networks is different from building quantized networks. Binarized network researchers often modify a base architecture to improve performance. Several studies increased the representation capacity by using more weight and activation bases [34,56]. Most studies incorporate changes to improve training efficiency such as dual skip connections [20,36,53]. In addition, more aggressive changes are sought for via neural architecture search [27,39,53]. In addition to model changes, binarized networks use a quantizer specialized for binary such as the sign function.
The distinct creation process for binarized networks causes several difficulties in practice. First, modifications of a base model often affect the number of floating-point operations and memory access count [10]. Thus, the actual latency and power consumption can differ from the expected results, and a binarized model cannot be guaranteed to be better than the 2-bit counterpart. Second, training binarized networks requires additional skilled workforce because it needs special expertise, especially a deep understanding of the model itself. Finally, given a model, binarization provides only one option for accuracy and efficiency, and the limited exploration of the trade-off may lead to a suboptimal solution.
This paper proposes a unified quantization framework, denoted as UniQ, for multi-bit quantized networks and binarized networks. UniQ achieves up to 4.8% and 2.7% accuracy improvements on the ImageNet dataset over the existing state-of-the-art quantization and binarization methods, respectively, as shown in Figure 1. Figure 2 illustrates UniQ in contrast to the conventional approach.
This paper first considers two popular weight quantizers in quantized networks. A weight quantizer, first proposed in [55] and adopted in [4,26] later, maps weight values into the range [0, 1] first and then performs quantization and re-maps the quantized values to the range [-1,1]. This method implies the importance of the symmetry; however, the weight mapping becomes an impediment to taking advantage of pre-trained models. The other weight quantizer, used in [2,11], maps inputs to a real value represented by the product of a scaling factor and a signed integer. The signed integer is assumed to be represented by two's complement, which has asymmetric ranges. This quantization method does not transform the weights of the pre-trained models. However, we hypothesize that the asymmetry has a negative impact on extremely low-precision training. Thus, we design a symmetric quantizer, where the step size can be learned via the gradient descent procedure.
While the step size is learnable with the task loss, we observe that the initialization of the learnable parameter has a significant impact on the final solution. For the initialization of this parameter, prior works used a heuristic [11] or a numerical method [2]. In contrast, we propose an analytic initialization method that is optimal in the mean squared sense. According to our ablation study, our proposed framework shows significant improvements over prior works as a combined result of the symmetric quantizer and the optimal initialization.
The proposed quantizer and init method can be applied to the binary case seamlessly, but the significant improvements demonstrated in multi-bit are not shown in 1-bit. We scrutinize the training dynamics of the binary case and find that the binary case receives strong gradient signals at the beginning of training and the distribution of quantizer input changes extremely fast compared to multi-bit cases. We hypothesize that this difference is caused as the initial point after binarization is too far from the pre-trained model solution. As a simple yet effective solution, we suggest to use warm-up strategy, which has been used widely for largebatch training. However, we empirically show that in 1-bit training, this improves accuracy substantially even when a small batch size is used.
Our major contributions are summarized as follows: • In multi-bit quantization, the proposed unified method outperforms existing methods consistently to achieve the state-of-the-art accuracy of ResNet-18,-34 and Mo-bileNetV2 on ImageNet.
• In binarization, the proposed method achieves comparable results to the state-of-the-art methods. These results have been achieved only by enhancing the training process without modifying the original network architectures, meaning that our method can be used in conjunction with network modification techniques.
• We propose an optimal, analytic initialization for step sizes.

Related Work
Modern neural networks have increased their computational complexity and memory requirements. Therefore, recent works have proposed efficient architectures [21,22,44,47] and model compression techniques such as network binarization [7,23,36,42], low-bit quantization [5,24,45,54], and knowledge distillation [29,40] to reduce the model size and amount of computation. Among these, lowbit quantization is one of the most popular methods and is widely used in the research literatures and real-life applications [6,35,49].
Efficient Models. Recently optimized networks such as EfficientNet [47], MobileNet-v1 [22], -v2 [44], -v3 [21] have achieved high accuracy by replacing the standard convolutional layers with depth-wise separable convolutions, thereby significantly reducing the number of parameters. Even for such efficient architectures, quantization is necessary to deploy them in specialized hardware [13] and provides further reductions in its size and number of calculations. Recent works [16,25,29] attempted to quantize these models, but at the expense of the significant loss in prediction accuracy. Compared to DSQ [16] and QKD [29], our method yields consistently higher results for all bit-widths when tested with MobileNet-V2, which again, demonstrates its effectiveness even for highly optimized networks.
Model Binarization. As a special case of quantization, model binarization has been studied extensively and has received much attention owing to its efficiency for deployment in edge devices. Using binarized weights and activations resulted in ∼32× memory saving over the fullprecision counterpart and brought ∼58× computational efficiency on CPUs by taking advantage of bitwise operations [42]. Unfortunately, these networks usually lead to severe accuracy degradation. To mitigate this problem, many existing methods proposed the idea of modifying the original architecture. In [42], the order of layers within a block was changed to improve the information flow. In [36], an additional skip connection was added to each block in the residual networks. In [28], the input and output widths of each layer were adjusted. Rather than modifying the original architecture, this study focuses only on improving the quantizer itself and the training process. Our proposed method is orthogonal to the aforementioned model modification methods.
Multi-bit Quantization. In contrast, recent works on quantization [5,12,14,48,51,54] have achieved substantial efficiency improvements without the need to re-design or develop the whole new architecture. Thus, it can substantially reduce the design effort. PACT [5] and LQ-Nets [54] first proposed the idea of learning quantizer parameters. LQ-Net parameterizes quantization levels for a non-uniform quantizer. In QIL [14], a non-uniform quantizer was constructed using a non-linear transformer fol-lowed by a uniform quantizer. While these non-uniform quantizers provide a higher degree of freedom, they usually require more computation and memory than uniform quantizers. PACT [5] parameterized the clipping level in a uniform activation quantizer. LSQ [11] showed better accuracy by making the step size learnable. SAT [11] studied efficient training for quantized networks. Both PACT [5] and SAT [5] adopted the weight quantizer of DoReFaNet [55], which is symmetric but transforms the weights into a new range. This makes it difficult to take advantage of pretrained models. In contrast, we design a uniform, symmetric quantizer that does not require the transformation and can leverage pre-trained models fully. Besides, none of the recent methods reports their results for binarized neural networks, due to severe performance degradation, or their quantizers are not properly designed to support binarized networks. In contrast, we propose a unified framework that can support all bit-widths including 1-bit binarization. We obtain new state-of-the-art results for multi-bit quantization and promising results for model binarization by using the new quantizer design without any modifications in the original architecture needed.
Knowledge Distillation. Another popular method is knowledge distillation that is widely used in many computer vision tasks. The basic idea is that the knowledge from the teacher networks is transferred to the student networks, providing an additional guidance signal to the training process of the smaller-sized student network. Applying distillation methods to low-precision networks was performed by [29,38,40,52] where a real-valued network is used as the teacher and a low-precision bit network as a student. QKD [29] reported competitive results in multi-bit quantization using knowledge distillation. LSQ [11] also showed that knowledge distillation provided additional improvements in their quantization results. However, we outperform these methods even without resorting to the idea of transferring knowledge, by focusing more on the fundamental issues. In addition, our method is orthogonal to knowledge distillation, and can be used in conjunction to further boost the performance.

Preliminaries
In this section, we first review the weight quantizers commonly used in the literature. In [2,11,32], a weight w ∈ R is approximately represented by where ∆ is a scaling factor, called the step size, andŵ is a signed integer, which is assumed to be represented by two's complement. This is often referred to as the fixed-point representation. This representation has asymmetric ranges; it can represent one more negative number than positive numbers. For example, when ∆ = 1, it can represent -2, -1, 0, and 1 for 2-bit weights. We hypothesize that this asymmetry has a negative impact on low-precision training. While it has an asymmetric range in the strict sense, it is considered symmetric in [2]. Thus, to avoid confusion, we refer to this as semi-symmetric.
In [4,26,55], a weight w ∈ R is quantized into a k-bit value by . In this quantizer, the weights are first mapped into the range [0,1] and quantized into a k-bit value in the range [-1,1]. This quantizer has the symmetric property, but is problematic for two reasons. First, it transforms the weights into a new range and loses the knowledge of pretrained models. Second, the transformed range may cause the vanishing or exploding gradient problem because the variance of weights becomes substantially different from that suggested in Xavier [15] or Kaiming initialization [18]. Thus, SAT [26] proposed to scale the transformed weights again using the constant in Xavier initialization. However, the first problem remains. We address these two problems by using a new symmetric quantizer and optimal initialization that minimizes the mean square quantization error.

Learnable Symmetric Quantizer
A quantizer Q : R → R is a piecewise constant function and each interval is mapped to a corresponding output. The end points of the intervals are referred to as decision levels and the output is called the reconstruction level. A uniform quantizer has evenly spaced decision levels and reconstruction levels. The length of the intervals is called the step size, denoted as ∆. The total number of reconstruction levels is denoted by N . We denote the clip function to be used by quantizers by clip N (x) = min(max(x, 0), N − 1). The round function is denoted by · . In practice, N is usually even, and thus a symmetric quantizer in the strict sense do not include the value of zero as a reconstruction level. For example, when N = 4 and ∆ = 1, the symmetric quantizer has -1.5, -0.5, 0.5, and 1.5 as the reconstruction levels. We quantize weights by the uniform symmetric quantizer where α = ∆ · (N − 1)/2. Eq. (3) can be rewritten as whereq w = 2 clip N ((x + α)/∆) − N + 1;q w can be encoded into log 2 N bits using ±1 encoding.
We consider the ReLU non-linearity, wihch is widely used in the deep learning literature, as the activation function. Because almost half of the ReLU responses are zero, we fix the zero value as a reconstruction level instead of parameterizing an offset. Thus, for activations, we use Designing a uniform quantizer usually boils down to deciding one parameter, the step size ∆. Instead of designing the quantizers manually, we make ∆ a learnable parameter as in recent prior works [11,2], and optimize it with the task loss via the gradient descent procedure. The round function has a zero derivative almost everywhere, and the exact derivative is not useful in learning. Thus, we adopt the straight-through estimator (STE), which assumes a unit derivative for the entire input range of the round function. Then, we have We can also find ∂Qa ∂∆ similarly. While this allows us to learn the step size, the initialization of this parameter is necessary and in our experience, careful initialization improves accuracy substantially.

Optimal MSE Initialization
The quantized networks are usually initialized with a pre-trained model, and the learnable step sizes are also initialized depending on the statistics of the pre-trained model. Let X be the random variable for a quantizer input and its pdf is denoted by p(x). The optimal step size for Q w is defined in the mean squared error (MSE) sense by where D w (∆) = E[(x − Q w (x; ∆)) 2 ]. Using Leibniz integral rule, we take the derivative of D w and obtain By setting (8) to zero, we can find the optimal step size. In general, this equation does not have a closed-form solution and a numerical method is required. However, for common probability distributions such as Gaussian and Laplace, the step size for each N of interest can be pre-computed assuming a unit variance, and can be scaled by the standard deviation of the quantizer input. In our implementation, a Gaussian distribution is assumed for weights. For the activation quantizer Q a , we also define the optimal step size ∆ * a and the mean squared error D a , and derive dDa d∆ similarly. However, in the case of the activation quantizer, a Gaussian distribution is assumed for pre-activations (activations prior to the non-linearity). The activations after the ReLU non-linearity follow a rectified Gaussian, a modification of Gaussian where the negative elements are reset to zero. While a rectified Gaussian is a mixture of a discrete distribution for zero and a continuous distribution for the positive elements, we pre-compute the step size using the continuous part only because Q a is designed to include zero as a reconstruction level by construction. When we pre-compute the step sizes, the standard Gaussian is assumed for pre-activations. We denote the pre-computed step sizes for activations and weights by ∆ u a and ∆ u w , respectively. The constants for each N are summarized in Table 1, which also shows the optimal signal-to-quantizationnoise (SQNR) ratio. We use it later to analyze the training dynamics of quantized models. Even if the step size is set optimally, the MSE is proportional to the signal energy (the variance of the quantizer input) and it is not useful to see the optimality of the step size during training where the signal energy varies substantially over time. Using the precomputed step sizes, we finally obtain and Note that 2 E[X 2 ] is the standard deviation of the preactivations because they were assumed to have a symmetric distribution around zero. The statistics of the quantizer input are estimated by the sample statistics. For activations, we use a given number of batches to estimate the standard deviation. For a simple implementation, we calculate the sample standard deviation of each batch and use its maximum values over the batches. In order to accurately estimate the input statistics of an activation quantizer using multiple batches, we need to forward-propagate multiple batches for each layer in a layer-wise manner. In our experience, this adds unnecessary complexity to the implementation and the simple method provides similar performance.

Training Instability in 1-bit
While our symmetric weight quantizer and MSE init support all bits seamlessly to 1-bit, in our experience, the binary training is not effective in the same setup as that for other bits. We investigate the training dynamics,can which leads to the following observations. First, 1-bit SGD training receives strong gradient signals initially because the initial point after binarization is far from the solution of the pre-trained model, which we use for init, in contrast to 2-bit or higher training. Second, the step sizes are not adapted to maximize the signal-to-quantization-noise ratio (SQNR) or maintain the initial SQNR during the initial fast learning. While the objective of the learnable quantizer is not to maximize the SQNR, we observe that the step size usually changes along with the standard deviation of the quantizer input in 2-bit or higher training, maintaining a reasonable SQNR. Thus, a SQNR significantly lower than the optimal level is not considered as desirable. We hypothesize that abrupt changes in the quantizer input distribution get the step size stuck in a local minimum. We show empirical evidence that warm-up training mitigates this issue. In addition, we empirically find that Adam is more robust to this problem than SGD. This appears to be owing to the gradient normalization in Adam.

Experimental Results
To demonstrate the effectiveness of our proposed method, we evaluate it on the CIFAR-100 [30] and the Im-ageNet datasets [43]. The CIFAR-100 dataset consists of 60,000 32x32 color images from 100 classes with a total of 50,000 training and 10,000 test images. The ImageNet dataset consists of more than 1.2M training images from 1000 classes and 50K validation images. We use various popular network architectures such as ResNet-18, -32, -34 [19] and MobieNet-V2 [44] for evaluation. The experiment results are compared with various recent works on multi-bit quantization and neural network binarization.

ImageNet Results
Implementation details. In the following experiments, we quantize all convolutional and fully connected layers to ultra-low precision except the first and last layers, which are represented by 8-bit precision as was done in [11]. In case of binarized networks, we leave the first, last, and downsampling layers the full-precision as were done in prior works [28,36]. We quantize weights and activations to the same bit-width for all experiments. For multi-bit and binarized networks, we use SGD and Adam, respectively. For SGD, we use 0.01 as the initial learning rate and decay it using the cosine learning rate schedule without restarts [37]. For Adam, the learning rate is fixed to 0.001 for 5 epochs as the warm-up and then increase to 0.004 and then follow   [51]. The dash symbol "-" indicates no data available and "FP" represents the full precision network accuracy in our implementation.
the cosine schedule. For all experiments, we use layer-wise and kernel-wise quantizations for activations and weights, respectively, with an exception of MobileNetV2, in which layer-wise quantization is used for both weights and activations considering the relative parameter overhead. In addition, weight decay is not used for step size parameters. Our implementation is based on PyTorch.
We use original ResNet-18, ResNet-34, and Mo-bileNetV2 architectures, without any changes in their structure. For ResNets, we use the pre-activation version. We follow the commonly used data augmentation strategy as in [54,11], where the training images are randomly cropped and resized to 224 × 224, and horizontally flipped half the time. For testing, the single-center crop of size 224×224 is applied. The transformed images are finally normalized by the mean and standard deviation. All networks are trained for 90 epochs with a batch size of 256 (2 GPUs), a momentum of 0.9, and a weight decay of 10 −4 , 0.5 × 10 −4 , 0.25 × 10 −4 , 0 for 4-bit, 3-bit, 2-bit and 1-bit quantized models, respectively. We use the pre-trained models available at PytorchCV 1 for weight initialization. For multi-bit and binarized networks, we use the floating-point and 2-bit models for initialization, respectively. For step size initialization, the first 1000 training batches are used to estimate the statistics of activations.
Comparison with prior works on multi-bit quantization. We compare our method to existing methods in Table 2. For the existing methods, the results are directly cited from the original papers unless mentioned otherwise. By looking at the reported table, we can observe that our method outperforms all the previous quantization methods in top-1 accuracy. Specifically, we can achieve significant performance gain over the recent state-of-the-art methods 56.4 XNOR-Net++ [3] 57.1 IR-Net [41] 58.1 ProxyBNN [20] 58.7 RBNN [33] 59.9 BinaryDuo [28] 60.4 UniQ (Ours) 60.5  Table 3: Top-1 accuracy comparison to the existing stateof-the-art binarization methods on ImageNet. "Original" is used to denote methods not requiring architecture modification. †DoReFa-Net uses 2-bit for activations.

ResNet
LSQ, QKD, and SAT on all comparing architectures. The improvements range from 0.1% to 4.8% compared to the second-best method (QKD). Note that QKD and SAT need a total of 120 and 150 training epochs, respectively, while our method only requires 90 epochs to obtain better accuracy. In addition, it is worth to mention that QKD uses knowledge distillation but UniQ does not. Knowledge distillation is known to provide additional improvements on quantization results as shown in LSQ [11]. MobileNetV2 has an architecture already optimized for efficiency such as depthwise convolutions and the accuracy is known to be sensitive to quantization [11,29]. For MobileNetV2, UniQ outperforms the existing state-of-the-art method, QKD, by significant margins of 2.4% and 4.8% for 2-bit and 3-bit quantized models, respectively. A substantial increase in prediction accuracy can be seen in other 2-bit models. With ResNet-18, UniQ achieves 67.8% top-1 accuracy, with 0.2% and 0.4% improvements over LSQ and QKD, respectively. With ResNet-34, it achieves a top-1 accuracy of 72.1%, which is a 0.4% improvement over QKD and LSQ.
Comparison with prior works on binarized neural networks. We further compare our method with the stateof-the-art binarization methods in Table 3. As shown in the last column, many existing methods modify a base architecture and it is difficult to be compared. While the model size and the number of operations are the same, the memory access count can be different. For example, the dual skip connection employed in Bi-Real does not affect the model size and the number of operations for convolutions but the memory access count. Nonetheless, to our best knowledge, UniQ outperforms the previous state-of-the-art accuracy for binarized ResNet-34 by a significant margin of 2.69%. For ResNet-18, UniQ even achieves a comparable accuracy to BinaryDuo, which requires to increase the width of the skip connections.

CIFAR-100 Results
Implementation details. We use the pre-activation variant of ResNet-32 for all the experiments on CIFAR-100. We train for 350 epochs with a mini-batch size of 128. All quantized models are initialized from the pre-trained fullprecision counterparts, which we train from scratch. For simplicity, we use the same weight decay of 5×10 −4 across all CIFAR-100 experiments. The standard data augmentation includes random cropping and horizontal flipping is applied for each training image. The first 100 training batches are used for step size initialization. We use layer-wise quantization for both weights and activations. For other settings, we follow the same settings as described in Section 5.1 Comparison with LSQ. For a fair comparison, we compare LSQ to our method in our setting. We implement LSQ carefully and cross-check the correctness with [11,2]. Our final results are summarized in Table 4. When using ResNet-32, we can observe that, for 4-bit quantized models, LSQ can match the accuracy of the full-precision baseline, which is in line with the results reported in the original paper [11]. It is worth noting that when the bit-width is reduced to 3, our method can still archive the same accuracy of 71.4% compared to the 4-bit LSQ quantized model. For 2-bit, the accuracy drops by only 2.1% when using our method compared to 2.9% for LSQ. For the most aggressive 1-bit quantization, we can achieve 62.4%, while no data is reported for LSQ as its quantizer is not suitable for binarization.  Imbalanced weight distribution. For UniQ and LSQ, we show the distributions of the trained weights in Figure 3, for two different layers. We can see that the trained weights of LSQ have the form of negatively-skewed distribution, with a long tail on the negative side, and many weights values around the maximum reconstruction level. In contrast, UniQ has a symmetric distribution around the zero value, and the weights are relatively evenly distributed. The higher entropy of the balanced distribution may suggest that UniQ allows the network to retain more knowledge on weights than LSQ.
Step size initialization. To show how important the initialization of the step sizes is, we also compare the results when the step size of the proposed quantizer is initialized with 0.1, 0.2, LSQ's heuristic, and the proposed optimal method. In LSQ, the step size is initialized to where X is the quantizer input and Q P = 2 bit − 1 (Q P = 2 bit−1 − 1) for activations (weights). The results are summarized in Table 5. The accuracy values vary substantially depending on initialization. It is interesting to see that, our proposed quantizer with LSQ init performs poorly for all bit-widths. In contrast, by replacing LSQ init with our init, a significant performance boost can be seen for all bit-widths. The 1.0% performance boost over LSQ init is seen in 2-bit. These results suggest that the step size parameters need to be initialized properly, otherwise it will lead to performance degradation.
Training dynamics. While our quantizer and init method are general to all bit-widths, the binarized networks     trained by UniQ do not provide a satisfactory performance at the same hyperparameter setting for 2-bit or higher. We thus investigate the activation quantizer of a convolutional layer. Specifically, we observe the evolution of the step size, the standard deviation (SD) of the quantizer input, and SQNR during 30 epochs of training. Figure (4a) and (4b) show 2-bit and 1-bit training, respectively, at the same setting, where we use SGD and the learning rate of 0.01. In 2-bit, the step size shrinks as the SD decreases, which maintains the initial SQNR not very far from the optimal level, shown in the dotted orange line. However, in 1-bit, the SD jumps sharply at the beginning, whereas the step size does not change accordingly, dropping the SQNR substantially, suggesting that the step size might be trapped at a local minimum. Figure (4c) shows how the warm-up can help mitigate this problem. We use 5 epochs of warm-up with a learning rate of 0.001 and observe that the SD changes smoothly and the step size seems to be adapted to that, maintaining a better SQNR. Figure (4d) shows an evidence that Adam is more robust to this issue. While we use the learning rate of 0.004, which is a high rate for Adam, we do not observe the issue, and the SQNR approximately stays at the optimal level in this case. Warm-up for 1-bit Training. We further perform experiments to find the best optimizer and warm-up strategy for binary training. The empirical results from Table 6 are in line with our analysis on the training dynamics. The results indicate that initially training a binarized model with a small learning rate for some epochs can improve the prediction accuracy substantially. In contrast, the models trained with a higher rate at the beginning can be easily trapped in a saddle point or bad local minimum; without increasing the learning rate at some point, the training may converge too slow. The results also suggest that Adam may perform better than SGD, but SGD also provides a decent performance in contrast to the common belief that SGD gives severely degraded results than Adam in binary training [1]. It is also shown that warm-up training helps Adam as well as SGD. Moreover, we observe that increasing the number of warm-up epochs does not improve the accuracy significantly. Thus, we choose the 5 epoch warm-up period.

Conclusion
In this paper, we have proposed a quantization method generalized for both multi-bit quantized and binarized models. We have designed a symmetric quantizer with a trainable step size and proposed an analytic, optimal initialization of the step size. In addition, we have investigated the difficulties in the 1-bit training and suggested practical methods to overcome them. For multi-bit quantization, the proposed method have achieved new record accuracies of ResNet-18,-34, MobileNetV2 on ImageNet. For binarization, without modifying original network architectures, we have achieved better or comparable accuracy to that of recent binarized networks by focusing on the fundamental training problems.