Learning Sparse Low-Precision Neural Networks With Learnable Regularization

We consider learning deep neural networks (DNNs) that consist of low-precision weights and activations for efficient inference of fixed-point operations. In training low-precision networks, gradient descent in the backward pass is performed with high-precision weights while quantized low-precision weights and activations are used in the forward pass to calculate the loss function for training. Thus, the gradient descent becomes suboptimal, and accuracy loss follows. In order to reduce the mismatch in the forward and backward passes, we utilize mean squared quantization error (MSQE) regularization. In particular, we propose using a learnable regularization coefficient with the MSQE regularizer to reinforce the convergence of high-precision weights to their quantized values. We also investigate how partial L2 regularization can be employed for weight pruning in a similar manner. Finally, combining weight pruning, quantization, and entropy coding, we establish a low-precision DNN compression pipeline. In our experiments, the proposed method yields low-precision MobileNet and ShuffleNet models on ImageNet classification with the state-of-the-art compression ratios of 7.13 and 6.79, respectively. Moreover, we examine our method for image super resolution networks to produce 8-bit low-precision models at negligible performance loss.


I. INTRODUCTION
Deep neural networks (DNNs) have achieved performance breakthroughs in many of computer vision tasks [1]. The revolutionary progress of deep learning comes with overparametrized multi-layer network architectures, and nowadays millions or tens of millions parameters in more than one hundred layers are not exceptional anymore. Network compression for efficient inference is of great interest for deployment of large-size DNNs on resource-limited platforms such as battery-powered mobile devices [2], [3]. In such resourceconstrained hardware, not only memory and power are limited but also basic floating-point arithmetic operations are in some cases not supported. Hence, it is preferred and sometimes necessary to deliver compressed DNNs of lowprecision fixed-point weights and activations (feature maps).
In this paper, we propose a network compression scheme that produces sparse low-precision DNNs through learning The associate editor coordinating the review of this manuscript and approving it for publication was Kang Li . with regularization. In particular, we let the regularization coefficient be learnable, instead of treating it as a fixed hyperparameter, to make a smooth and efficient transition of a highprecision model into a sparse quantized model. The proposed compression pipeline is summarized in Figure 1.
• For weight pruning, we utilize partial L2 regularization to make a portion of small-value weights tend to zero so we can safely prune them at negligible accuracy loss. • For weight quantization, we regularize (unpruned) weights with another regularization term of the mean squared quantization error (MSQE). In this stage, we also quantize the activations (feature maps) of each layer to mimic low-precision operations at inference time. The quantization bin sizes for weights and activations are optimized to minimize their MSQEs in each layer.
• The pruned and quantized model is converted into a low-precision model and its low-precision weights are further compressed in size with lossless entropy coding such as Huffman coding and universal source coding algorithms (e.g., see [4 Section 11.3]) for memoryefficient deployment. It is difficult to train low-precision DNNs with standard gradient descent since the learning rate is typically set to be a small floating-point value but low-precision weights cannot be adjusted in fine resolution. To enable training lowprecision DNNs, a series of papers on binary neural networks suggests utilizing high-precision shadow weights to accumulate the negatives of the gradients in fine resolution, while the gradients are obtained from the network loss function calculated with binarized (or quantized) weights [5]- [7]. That is, high-precision weights are quantized in the forward pass, but the quantization function is replaced with the identity function in the backward pass for gradient descent. This approximate gradient descent algorithm is further refined in the subsequent works [8]- [15].
BinaryRelax [16] proposed relaxation of the quantization problem via Moreau envelope (also known as Moreau-Yosida regularization) [17], [18] and used pseudo quantized weights in the forward pass to solve the relaxed quantization problem. In particular, the pseudo quantized weights are obtained by weighted average of high-precision weights and their quantized values. By manually adjusting the weighting factor in the weighted average, the pseudo quantized weights are pushed towards their quantized values gradually in training. In [19], the blended coarse gradient descent (BCGD) algorithm was proposed, where the BinaryConnect scheme [5] and the standard projected gradient descent algorithm (PGD) [20] are combined with some blending parameter. For quantization of activations, parameterized clipping activation (PACT) [21] proposed using an activation clipping parameter that is optimized during training to find the right quantization scale. The two-valued proxy derivative of the parametric activation function in [21] was further enhanced by three-valued proxy partial derivative in [19]. LQ-Nets [22] proposed finding optimal quantization levels in a subspace compatible with bit-wise operations. In [23], it was proposed to learn separate scaling factors for fine-grained weight subgroups (e.g., pixel-wise or row-wise scaling factors).
The mismatch in the forward and backward passes results in sub-optimal gradient descent that causes accuracy loss. The mismatch is more problematic for the models using lowerprecision weights and activations, since the quantization error is more significant. There have been some attempts to reduce this mismatch by introducing better backward pass approximation, e.g., using clipped ReLU and log-tailed ReLU instead of the linear function (e.g., see [11]). Recently, it was proposed to use smooth differentiable approximation of the staircase quantization function. In [24], affine combination of high-precision weights and their quantized values, called alpha blending, was used to replace the quantization function. In [25], the quantization function was approximated as a linear combination of several sigmoid functions with learnable biases and scales. Similarly, differentiable soft quantization (DSQ) [26] exploited a series of hyperbolic tangent functions to approximate the staircase quantization function. The proposed approximation gradually approaches to the quantization function in training by adjusting the blending factor or the temperature parameter in the sigmoid function. Our approach is different from these efforts. We use regularization to steer high-precision weights to converge to their quantized values so that the mismatch between high-precision weights and quantized weights becomes smaller instead of enhancing the backward pass approximation.
We reduce the mismatch between high-precision weights and quantized weights with MSQE regularization. In particular, we propose making the regularization coefficient learnable. Using learnable regularization, high-precision weights are reinforced to converge to their quantized values gradually in training. We empirically show that our learnable regularization yields more accurate low-precision models than the conventional regularization with a fixed regularization coefficient. MSQE is a well-known distortion metric in data quantization, and it has been used in network quantization as well to reduce the performance loss from quantization (e.g., see [8], [27]). Our contribution is to use MSQE as a regularizer with a learnable coefficient, which is new to the best of our knowledge. The loss-aware weight quantization in [12], [13] proposed the proximal Newton algorithm to minimize the loss function under the constraints of low-precision weights, which is however impractical for large-size networks due to the prohibitive computational cost to estimate the Hessian matrix of the loss function. Our method simply uses the stochastic gradient descent, while the mismatch between high-precision weights and quantized weights is minimized with the MSQE regularization. No regularization is considered in [12], [13]. Relaxed quantization [28] introduced a differentiable quantization procedure by transforming continuous distributions of weights and activations to differentiable soft categorical distributions. Our method is much simpler than the relaxation procedure in [28], since it only requires MSQE regularization. Furthermore, it shows better performance than [28] empirically in MobileNet quantization.
Weight pruning curtails redundant weights completely from DNNs so one can skip the computations for pruned weights. Some successful pruning algorithms can be found in [29]- [33]. In this paper, we discuss how partial L2 regularization can be used for weight pruning. Finally, combining weight pruning, quantization, and entropy coding, as shown in Figure 1, we achieve the state-of-the-art compression results for low-precision MobileNet [34] and ShuffleNet [35] on ImageNet classification.
Weight sharing is another network compression scheme studied in [36]- [43]. It reduces the number of distinct weight values in DNNs by quantization. In contrast to lowprecision weights from uniform quantization, weight sharing allows non-uniform quantization. For non-uniform quantization (e.g., k-means clustering), quantization output levels (e.g., cluster centers) do not have to be evenly spaced, and they are usually high-precision floating-point values. The quantization output levels are the shared weight values used in inference. Thus, floating-point arithmetic operations are still needed in inference, although the quantized weights can be compressed in size by lossless source coding (e.g., Huffman coding).
We finally note that reinforcement learning has been proposed as a promising methodology to search for quantized and/or compressed models that satisfy certain latency, energy, and/or model size requirements, given hardware specifications to deploy the models [44], [45].

II. LOW-PRECISION DNN MODEL
We consider low-precision DNNs that are capable of efficient processing in the inference stage by using fixed-point arithmetic operations. In particular, we focus on the fixed-point implementation of convolutional and fully-connected layers, since they are the dominant parts of computational costs and memory requirements in DNNs (see [2,Table 2]).
The major bottleneck of efficient DNN processing is known to be in memory accesses [2, Section V-B]. Horowitz provides rough energy costs of various arithmetic and memory access operations for 45 nm technology [46, Figure 1.1.9], where we can find that memory accesses typically consume more energy than arithmetic operations, and the memory access cost increases with the read size. Hence, for example, deploying binary models, instead of 32bit models, it is expected to reduce energy consumption by 32× at least, due to 32 times fewer memory accesses.
Low-precision weights and activations basically stem from uniform quantization (e.g., see [47,Section 5.4]), where quantization bin boundaries are uniformly spaced and quantization output levels are the midpoints of bin intervals. Quantized weights and activations are represented by fixedpoint numbers of small bit-width. Scaling factors (i.e., quantization bin sizes) are defined in each layer for fixed-point weights and activations, respectively, to alter their dynamic ranges. Figure 2 shows the fixed-point design of a general convolutional layer consisting of convolution, bias addition and non-uniform activation. Fixed-point weights and input feature maps are given with common scaling factors δ l and l , respectively, where l is the layer index. Then, the convolution operation can be implemented by fixed-point multipliers and accumulators. Biases are added, if present, after the convolution, and then the output is scaled properly by the product of the scaling factors for weights and input feature maps, i.e., δ l l , as shown in the figure. Here, the scaling factor for the biases is specially set to be δ l l so that fixed-point bias addition can be done easily without another scaling. Then, a non-linear activation function follows. Finally, the output activations are fed into the next layer as the input.
Using rectified linear unit (ReLU) activation, two scaling operations across two layers, i.e., scaling operations by δ l l and 1/ l+1 , can be combined into one scaling operation by δ l l / l+1 before (or after) ReLU activation. Furthermore, if the scaling factors are power-of-two numbers, then one can even implement scaling by bit-shift. Similarly, low-precision fully-connected layers can be implemented by replacing convolution with matrix multiplication in the figure.

III. REGULARIZATION FOR LOW-PRECISION DNNs
In this section, we present the regularizers that are utilized to learn pruned and quantized DNNs of low-precision weights and activations. We first define the quantization function. Given the number of bits, i.e., bit-width n, the quantization function yields where x is the input and δ is the scaling factor; we let where x is the largest integer smaller than or equal to x. For ReLU activation, the ReLU output is always non-negative, and thus we use the unsigned quantization function given by for n ≥ 1, where clip + n (x) = min(max(x, 0), 2 n − 1).

A. REGULARIZATION FOR WEIGHT QUANTIZATION
Consider a general non-linear neural network consisting of L layers. Let W 1 , W 2 , . . . , W L be the sets of high-precision weights in layers 1 to L, respectively. For notational simplicity, we let A L 1 = A 1 , A 2 , . . . , A L for any symbol A. We define the MSQE regularizer for weights of all L layers as VOLUME 8, 2020 where n is the bit-width for quantized weights, δ l is the scaling factor (i.e., quantization bin size) for quantized weights, and N is the total number of weights from all layers, i.e., where |W l | is the number of weights in layer l. We assumed that bit-width n is the same for all layers, just for notational simplicity, but it can be easily extended to more general cases such that each layer has a different bit-width.
Including the MSQE regularizer in (3), the cost function to optimize in training is given by (4) where, with a slight abuse of notation, Q n (W L 1 ; δ L 1 ) denotes the set of quantized weights of all L layers, E(X ; Q n (W L 1 )) is the target loss function evaluated on the training dataset X using the quantized weights, and λ is the regularization coefficient. We set the scaling factors δ L 1 to be learnable parameters and optimize them along with weights W L 1 . Remark 1: We clarify that we use high-precision weights in the backward pass for gradient descent by replacing approximately the quantization function Q n with the identity function. In the forward pass, we use quantized weights and activations, and the target objective function E is also calculated with the quantized weights and activations to mimic the low-precision inference-stage loss. Hence, the final trained models are low-precision models, which can be operated on low-precision fixed-point hardware in inference with no accuracy loss. Note that our method still has the gradient mismatch problem, similar to the existing approaches (see Section I). However, by adding the MSQE regularizer, we encourage high-precision weights to converge to their quantized values so that we reduce the mismatch.

1) LEARNABLE REGULARIZATION COEFFICIENT
The regularization coefficient λ in (4) is a hyper-parameter that controls the trade-off between the loss and the regularization. It is conventionally fixed ahead of training. However, searching for a good hyper-parameter value is usually time-consuming. Hence, we propose the learnable regularization coefficient, i.e., we let the regularization coefficient be another learnable parameter.
We start training with a small initial value for λ, i.e., with little regularization. However, we promote the increase of λ in training by adding a penalty term for a small regularization coefficient, which is − log λ for λ > 0, in the cost function (see (5)). The increasing coefficient λ reinforces the convergence of high-precision weights to their quantized values for reducing the MSQE. It consequently alleviates the gradient mismatch problem (see Remark 1). The cost function in (4) is altered into For gradient descent, we need the gradients of (5) with respect to weights, scaling factors and the regularization coefficient, respectively, which are provided in Appendix.
Remark 2: In (5), observe that we use quantized weights in the forward pass to compute the loss while we update high-precision weights with gradient descent in the backward pass, as in BinaryConnect [5]. Thus, our method is different from BinaryRelax [16] that uses pseudo quantized weights in the forward pass. The pseudo quantized weights are computed by weighted average of high-precision weights and their quantized values. Our MSQE regularization resembles Moreau-Yosida regularization in BinaryRelax. However, the Moreau-Yosida regularization factor in BinaryRelax is manually increased with a fixed rate at every iteration in training so the pseudo quantized weights are pushed towards quantized values as training goes on. In our scheme, the difference between high-precision weights and quantized weights is reduced by the MSQE regularization. Moreover, we propose letting the regularization coefficient λ be learnable and adding another penalty term − log λ to promote increasing λ; hence, λ does not necessarily increase with a fixed rate and can saturate after some point of training to find a better local optimum, as shown in Figure 6(a). We do not constrain the range of λ in (5) so it is possible that λ diverges in optimization. However, we empirically found that λ saturates after some point of training in practice as the loss saturates (e.g., see Figure 6(a)). Figure 3 presents an example of how high-precision weights are gradually quantized by our regularization scheme. We plotted weight histogram snapshots captured at the second convolutional layer of the MNIST LeNet-5 model 1 while a pre-trained model is quantized to a 4-bit fixed-point model. The histograms in the figure from the left to the right correspond to 10k, 21k, 23k, and 30k batch iterations in training, respectively. Observe that the weight distribution gradually converges to the sum of uniformly spaced delta functions and all high-precision weights converge to quantized values completely in the end.

3) COMPARISON TO SOFT WEIGHT SHARING
In soft weight sharing [38], [48], a Gaussian mixture prior is assumed, and the model is regularized to form groups of weights that have similar values around the Gaussian component centers (e.g., see [49,Section 5.5.7]). The learnable regularization coefficient can be related to the learnable variance in the Gaussian mixture prior. However, our weight regularization method is different from soft weight sharing since we consider uniform quantization and optimize quantization bin sizes, instead of optimizing individual Gaussian component centers for non-uniform quantization. We employ the simple MSQE regularization term for quantization, so that it is applicable to large-size DNNs. Note that soft weight sharing yields the regularization term of the logarithm of the summation of exponential functions, which is sometimes too complex to compute for large-size DNNs. In our method, the additional computational complexity for MSQE regularization is not expensive. It only scales in the order of O(N ), where N is the number of weights. Hence, the proposed scheme is easily applicable to the state-of-the-art DNNs with millions or tens of millions weights.
We note that biases are treated similar to weights. However, for the fixed-point design presented in Section II, we use δ l l instead of δ l as the scaling factor in (3), where l is the scaling factor for input feature maps (i.e., activations from the previous layer), which is determined by the following activation quantization procedure.

B. QUANTIZATION OF ACTIVATIONS
We quantize the output activation (feature map) x of layer l for 1 ≤ l ≤ L and yield Q + m (x; l ), where Q + m is the quantization function in (2) for bit-width m and l is the learnable scaling factor for quantized activations of layer l. We note that l is the scaling factor for activations of layer l whereas it denotes the scaling factor for input feature maps of layer l in Section II (see Figure 2). This is just one index shift in the notation, since the output of layer l is the input to layer l + 1. We adopt this change just for notational simplicity. Similar to (3), we assumed that activation bit-width m is the same for all layers, but this constraint can be easily relaxed to cover the cases where each layer has a different bit-width. We assumed ReLU activation and used the unsigned quantization function Q + m while we can replace Q + m with Q m in case of general non-linear activation (see (1) and (2)). We optimize l by minimizing the MSQE for activations of layer l, i.e., we minimize where A l is the set of activations of layer l for 1 ≤ l ≤ L.
In the backward pass, we first perform gradient descent for weights and their scaling factors using the loss function in (5), and then we update l with gradient descent using (6). We do not utilize (6) in gradient descent for weights. Backpropagation Through Quantized Activations: Backpropagation is not feasible through quantized activations analytically since the gradient is zero almost everywhere. For backpropagation through the quantization function, we adopt the straight-through estimator [50]. In particular, we pass the gradient through the quantization function when the input is within the clipping boundary. If the input is outside the clipping boundary, we pass zero.

C. REGULARIZATION FOR WEIGHT PRUNING
For weight pruning, we propose using partial L2 regularization. In particular, given a target pruning ratio r, we find the r-th percentile of weight magnitude values. Assuming that we prune the weights below this r-th percentile value in magnitude, we define a L2 regularizer only for them as follows: where θ (r) is the r-th percentile of weight magnitude values, which is the threshold for pruning. Adopting the learnable regularization coefficient as in (5), we have The partial L2 regularizer encourages the weights below the threshold to move towards zero, while the other unregularized weights are updated to minimize the loss due to pruning. The threshold θ(r) is also updated at every iteration of training based on the instant weight distribution. We note that the threshold θ (r) decreases as training goes on since the regularized weights gradually converge to zero (see Figure 4). After finishing the regularized training, we finally have a set of weights clustered very near zero. The loss due to pruning these small-value weights is negligible.
After weight pruning, the pruned model is quantized by following the quantization procedure in Section III-A and Section III-B. In this stage, pruned weights are fixed to be zero while only unpruned weights are updated and quantized. After pruning, we still use quantization bins around zero for the weights that are not pruned but have small magnitude, or for the weights that are made to be small after training the quantized network; unpruned weights between − /2 to /2 are still quantized to zero, where is the quantization bin size. However, the number of (unpruned) weights that are quantized to zero becomes much smaller after pruning.

IV. EXPERIMENTS
We evaluate the proposed low-precision DNN compression for ImageNet classification and image super resolution. Image super resolution is included in our experiments as a regression problem since its accuracy is more sensitive to quantization than classification. Note that Tensorflow Lite 2 already supports a very efficient 8-bit weight and activation quantization tool for network development on mobile platforms. Thus, our experimental results focus on more extreme cases of quantization using less than 8 bits, where a more sophisticated algorithm is needed for smaller loss. We use FLP and FXP to denote the floating-point and fixed-point formats, respectively.

A. EXPERIMENTAL SETTINGS
For ImageNet classification, we use the ImageNet ILSVRC 2012 dataset [51]. For image super resolution, we use the Open Images dataset 3 as the training dataset, which is pre-processed as described in [52]. The proposed network pruning, quantization, and compression pipeline is implemented with Caffe. 4 The pre-trained models used in our ImageNet classification experiments are obtained from the links in Table 1. For image super resolution, we train (CT-)SRCNNs from scratch as described in [52].
Provided a pre-trained high-precision model, weight scaling factors δ L 1 are initialized to cover the dynamic range of the pre-trained weights, i.e., the 99-th percentile magnitude of the weights in each layer. Similarly, activation scaling factors L 1 are set to cover the dynamic range of the activations in each layer, which are obtained by feeding a small number of training data to the pre-trained model.
For quantization of ImageNet classification networks, we employ the Adam optimizer [53]. The learning rate is set to be 10 −5 and we train 300k batches with the batch size of 256, 128, 32 and 64 for AlexNet, ResNet-18, MobileNet and ShuffleNet, respectively. Then, we decrease the learning rate to 10 −6 and train 200k more batches. For the learnable regularization coefficient λ, we let λ = e ω and learn ω instead in order to make λ always positive in training. The initial value of ω is set to be 0, and it is updated with the Adam optimizer using the learning rate of 10 −4 . For pruning of MobileNet and ShuffleNet, the Adam optimizer is used for 500k batches with learning rate 10 −5 , without decreasing the learning rate to 10 −6 at 300k batches. The initial value of ω is set to be 10 in pruning. The other settings are the same as described above for quantization. Then, pruned MobileNet and ShuffleNet models are quantized by following the same training procedure as described above for quantization. For quantization of image super resolution networks, we train the quantized models using the Adam optimizer for 3M batches with the batch size of 128. We use the learning rate of 10 −5 . The initial value for ω is set to be 0 and it is updated by the Adam optimizer using the learning rate of 10 −5 .

1) AlexNet QUANTIZATION
In Table 2, we compare our quantization method to DoReFa-Net [9] for the AlexNet model in [54]. Since DoReFa-Net does not consider weight pruning, we neither apply pruning here. The DoReFa-Net results in Table 2 are (re-)produced by us from their code, 5 and we use the same training hyperparameters and epochs as we described in Section IV-A for fair comparison. We evaluate two cases where (1) all layers are quantized, and (2) all layers except the first and the last layers are quantized. The results in Table 2 show that 4-bit quantization is needed for accuracy loss less than 1%. For binary weights, we observe some accuracy loss of more or less than 10%. However, we can see that our quantization scheme performs better than DoReFa-Net in particular for lowprecision cases, where the quantization error is larger and the mismatch problem of the forward and backward passes is more severe.  2) ResNet-18 QUANTIZATION Figure 5 presents the accuracy of the low-precision ResNet-18 [55] models obtained from our quantization method. The experiments on ResNet-18 are mainly for ablation study. In particular, we compare weight and activation quantization for various low-precision settings. The loss due to weight quantization is relatively less than the loss due to activation quantization, which is consistent with the results from DoReFa-Net [9]. We also compare the lowprecision models obtained with and without the constraint of power-of-two scaling factors. In fixed-point computations (see Figure 2), it is more appealing for scaling factors (i.e., quantization bin sizes) to be powers of two so they can be implemented by simple bit-shift, rather than with scalar multiplication. For power-of-two scaling factors, we perform rounding of scaling factors into their closest power-of-two values in the forward pass, while the rounding function is replaced with the identity function in the backward pass. We observe small performance degradation due to the constraint of power-of-two scaling factors in our experiments.
In Table 3, we compare the proposed quantization scheme to the existing quantization methods from [19], [21], [26] for 4-bit weight and 4-bit activation quantization of ResNet-18.  All convolutional and fully-connected layers of ResNet-18 are quantized in [19], [26], and ours, while the first and the last layers are not quantized in [21]. Since the baseline 32-bit model shows different accuracy in each method, we compare the accuracy difference between 32-bit floatingpoint models and 4-bit fixed-point models. For our method, we also show the accuracy obtained by using the average score from 10 different crops of the input (called 10-crop testing), where the baseline accuracy of our 32-bit floatingpoint model is aligned with the others. The results show that the proposed quantization scheme achieves 4-bit ResNet-18 quantization whose accuracy loss is comparable to the stateof-the-art methods. In particular, the accuracy loss from 4-bit quantization is shown to be very small and less than 1% in our scheme.
Learnable Versus Fixed Regularization Coefficients: In Table 4, we compare the performance of quantized ResNet-18 [55] models when we use learnable and fixed regularization coefficients, respectively. Observe that the proposed learnable regularization method outperforms the VOLUME 8, 2020 TABLE 5. Low-precision MobileNet and ShuffleNet compression results for ImageNet classification. For ablation study, we compare pruning-only results and pruning+quantization results with various low-precision setting. We also show the compression results with and without entropy coding, where we used bzip2 as a specific entropy coding scheme. conventional regularization method with a fixed coefficient in various low-precision settings.
In Figure 6, we compare the convergence curves when learnable and fixed regularization coefficients are used, respectively. Using a learnable regularization coefficient, the MSQE regularization term decreases (although there is a bump in the middle) while λ increases in training. However, using a fixed regularization coefficient, the MSQE regularization term saturates and even increases after some point as training goes on, which implies that the mismatch of the forward and backward passes is not resolved. The unresolved mismatch eventually turns into accuracy loss, as shown in the figure.

3) MobileNet AND ShuffleNet COMPRESSION
We mainly evaluate our method to obtain compressed lowprecision MobileNet [34] and ShuffleNet [35] models for ImageNet classification. MobileNet and ShuffleNet are stateof-the-art ImageNet classification networks developed for efficient inference on resource-limited platforms. Compression and quantization of such efficient networks are important in practice to lower latency and to improve power-efficiency further in mobile and edge devices. It is typically more difficult to compress and quantize such networks of efficient architectures. For MobileNet and ShuffleNet compression, we prune 50% weights from their pre-trained models as described in Section III-C so that the accuracy loss due to pruning is marginal. Then, we employ our weight and activation quantization method. After converting into sparse lowprecision models, universal source coding with bzip2 [56] follows to compress the fixed-point low-precision weights.
In Table 5, for ablation study, we compare pruning-only results and pruning+quantization results with various lowprecision setting. We also show the compression results with and without entropy coding, where we use bzip2 as a specific entropy coding scheme. Observe that the accuracy loss is marginal when we prune 50% weights for both MobileNet and ShuffleNet. After pruning 50% weights, we quantize the pruned models. Similar to the AlexNet and ResNet-18 results, the accuracy loss from quantization is more severe when we  decrease the activation bit-width than the weight bit-width. From the experiments, we obtain low-precision models of 5-bit weights and 8-bit activations with top-1 accuracy loss of 0.6% only. The compression ratio of these low-precision models is 6.40 without bzip2 compression, but it increases and becomes 7.13 and 6.79 for MobileNet and ShuffleNet, respectively, after bzip2 compression. We also show that our scheme outperforms the existing quantization schemes from tensorflow and [28].
In Figure 7, we compare the compression ratios of our scheme and the existing network compression methods in [36], [42], [45], [57]. Our low-precision network compression scheme shows comparable compression ratios to the state-of-the-art weight compression schemes. We emphasize that our scheme produces low-precision models of fixedpoint weights and activations that support efficient inference of fixed-point operations, while the previous compression schemes, except [45], produces quantized weights that are still floating-point numbers and thus floating-point operations are necessary to achieve the presented accuracy of them. The hardware-aware automated quantization in [45] achieved impressive compression results by searching for a quantized model of ''mixed'' precision for different layers with reinforcement learning, but not all hardware supports mixed precision operations.

4) IMAGE SUPER RESOLUTION NETWORK QUANTIZATION
The image super resolution problem is to synthesize a highresolution image from a low-resolution one. The DNN output is the high-resolution image corresponding to the input low-resolution image, and thus the loss due to quantization is more prominent. We evaluate the proposed method on SRCNN [58] and cascade-trained SRCNN (CT-SRCNN) [52] for image super resolution. The objective image quality metric measured by the peak signal-to-noise ratio (PSNR) and the perceptual score measured by the structural similarity VOLUME 8, 2020 index (SSIM) [59] are compared for Set-14 image dataset [60] in Table 6 for 3-layer SRCNN, 5-layer CT-SRCNN, and 9-layer CT-SRCNN, respectively. Observe that our method successfully yields low-precision models of 8-bit weights and activations at negligible loss, and they are better than the results that we obtain with one of the previous works, Ristretto [14]. It is interesting to see that the PSNR loss of using binary weights and 8-bit activations is 0.5 dB only.

V. CONCLUSION
In this paper, we proposed a method to quantize deep neural networks (DNNs) by regularization to produce low-precision DNNs for efficient fixed-point inference. Although our training happens in high precision particularly for its backward passes and gradient descent, its forward passes use quantized low-precision weights and activations, and thus the resulting networks can be operated on low-precision fixed-point hardware at inference time. The proposed scheme alleviates the mismatch problem in the forward and backward passes of low-precision network training by using MSQE regularization. Moreover, we proposed a novel learnable regularization coefficient to reinforce the convergence of high-precision weights to their quantized values when using MSQE regularization. We also discussed how a similar regularization technique can be employed for weight pruning with partial L2 regularization.
We showed by experiments that the proposed quantization algorithm successfully produces low-precision DNNs of binary weights for classification problems, such as ImageNet classification, as well as for regression and image synthesis problems, such as image super resolution. For MobileNet and ShuffleNet compression, we obtained sparse (50% weights are pruned) low-precision models of 5-bit weights and 8-bit activations with compression ratios of 7.13 and 6.79, respectively, at marginal accuracy loss. For image super resolution, we only lost 0.04 dB PSNR when using 8-bit weights and activations, instead of 32-bit floating-point numbers.

APPENDIX GRADIENT DESCENT A. GRADIENTS FOR WEIGHTS
The gradient of the cost function C n in (5) for w satisfies for weight w of layer l, 1 ≤ l ≤ L. The first partial derivative in the right side of (7) can be obtained efficiently by the backpropagation algorithm. For backpropagation through the weight quantization function, we adopt the following approximation similar to straight-through estimator [50]: , n = 1, where 1 E is an indication function such that it is one if E is true and zero otherwise. Namely, we pass the gradient through the quantization function when the weight is within the clipping boundary. To give some room for the weight to move around the boundary in stochastic gradient descent, we additionally allow some margin of δ l /2 for n ≥ 2 and δ l for n = 1. Outside the clipping boundary with some margin, we pass zero. For weight w of layer l, 1 ≤ l ≤ L, the partial derivative of the regularizer R n satisfies ∇ w R n = 2 N (w − Q n (w; δ l )), almost everywhere except some non-differentiable points of w at quantization bin boundaries U n (δ l ) given by U n (δ l ) = 2i + 1 − 2 n 2 δ l , i = 0, 1, . . . , 2 n − 2 , (10) for n > 1 and U 1 (δ l ) = {0}. If the weight is located at one of these boundaries, it actually makes no difference to update w to either direction of w− or w+ , in terms of its quantization error. Thus, we let ∇ w R n 0, if w ∈ U n (δ l ).

Remark 3:
If the weight is located at one of the bin boundaries, the weight gradient is solely determined by the network loss function derivative and thus the weight is updated towards the direction to minimize the network loss function. Otherwise, the regularization term impacts the gradient as well and encourages the weight to converge to the closest bin center as far as the loss function changes small. The regularization coefficient trades off these two contributions of the network loss function and the regularization term.

B. GRADIENT FOR THE REGULARIZATION COEFFICIENT
The gradient of the cost function for λ is given by Observe that λ tends to 1/R n in gradient descent. Remark 4: Recall that weights gradually tend to their closest quantization output levels to reduce the regularizer R n (see Remark 3). As the regularizer R n decreases, the regularization coefficient λ gets larger by gradient descent using (12). Then, a larger regularization coefficient further forces weights to move towards quantized values in the following update. In this manner, weights gradually converges to quantized values.

C. GRADIENTS FOR SCALING FACTORS
For scaling factor optimization, we approximately consider the MSQE regularization term only for simplicity. Using the chain rule, it follows that ∇ δ l C n ≈ ∇ δ l R n = − 2λ N w∈W l (w − Q n (w; δ l ))∇ δ l Q n (w; δ l ), (13) for 1 ≤ l ≤ L. Moreover, it can be shown that ∇ δ l Q n (w; δ l ) = r n (w; δ l ) clip n (round(w/δ l )), n > 1, sign(w), n = 1, almost everywhere except some non-differentiable points of δ l satisfying for n > 1. Similar to (11), we let ∇ δ l Q n (w; δ l ) 0, if w ∈ U n (δ l ), (16) so that the scaling factor δ l is not impacted by the weights at the bin boundaries. From (13)- (16), it follows that ∇ δ l C n ≈ − 2λ N w∈W l (w − Q n (w; δ l ))r n (w; δ l )1 w / ∈U n (δ l ) .
Similarly, one can derive the gradients for activation scaling factors L 0 , which we omit here.