Model Compression via Position-Based Scaled Gradient

We propose the position-based scaled gradient (PSG) that scales the gradient depending on the position of a weight vector to make it more compression-friendly. First, we theoretically show that applying PSG to the standard gradient descent (GD), which is called PSGD, is equivalent to the GD in the warped weight space, a space made by warping the original weight space via an appropriately designed invertible function. Second, we empirically show that PSG acting as a regularizer to the weight vectors is favorable for model compression domains such as quantization, pruning, and knowledge distillation. PSG reduces the gap between the weight distributions of a full-precision model and its compressed counterpart. This enables the versatile deployment of a model either as an uncompressed mode or as a compressed mode depending on the availability of resources. The experimental results on CIFAR-10/100 and ImageNet datasets show the effectiveness of the proposed PSG in model compression including an iterative pruning method and the knowledge distillation.


I. INTRODUCTION
To reduce the generalization error and induce a prior to the model, many regularization techniques have been proposed [1], [2], [3], [4]. To inject a prior for the specific purpose, we propose a novel regularization method designed for the model compression. This regularizer non-uniformly scales gradients to constrains the weight to a set of compression-friendly grid points. The scale of gradient depends on the position of the weight.
In this work, we propose a new optimizer position-based scaled gradient descent dubbed PSGD. Compared to conventional stochastic gradient descent (SGD), we replace a gradient with position-based scaled gradient. We prove that optimizing the model in the original weight space with PSGD is equal to optimize the model in the warped weight space which is warped by a proposed invertible warping function with a SGD optimizer. This warping function helps to merge The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan . the original weights to the desired target positions by scaling the gradients.
PSGD, the scaling gradient elements, is the branch of variable metric method [5] which utilizes a positive definite matrix to scale the gradient vector by standing upon the loss function. Unlike, variable metric method, our PSGD only considers the current position of the weight for the scaling gradient elements.
In recent years, deploying a deep neural network (DNN) on restricted edge devices such as smartphones and IoT devices has become a very important issue. For these reasons, reducing bit-width of model weights (quantization) and removing unimportant model weights (pruning), improving the performance of the given model with additional knowledge (knowledge distillation) and proposing for the specific domain have been studied and widely used for applications [10]. We apply the proposed PSG method to the model compression problems such as quantization, pruning and knowledge distillation. Fig 1 shows performances of quantized models using various regularization methods with Our PSGD is compared with various methods such as SGD, DQ [6], G-L2 [7] and G-L1 [8]. FP indicates the full-Precision accuracy and W#A# represents the number of bits for weights and activations. More details are in Table 6.
Since Quantization Aware Training (QAT) methods need a pre-trained model or the entire training dataset for the training, many works have focused on post-training quantization (PTQ) methods that do not require full-scale training [11], [12], [13], [14]. For example, [12] starts with a pre-trained model with only minor modification on the weights by equalizing the scales across channels and correcting biases. Because of inherent discrepancy between the distribution of the pre-trained model and that of quantized model, PTQ methods tried to minimize the distribution gap. Fig. 2 illustrates the fundamental differences between full-precision weights and quantized weights. because of the differences in weight distributions, the quantization error and the classification error increase in accordance with the number of bit-width.
Meanwhile, another line of research in quantization has recently emerged that approaches the task from the initial training phase [8] considered as regularization methods. Compared to PTQ methods, regularization methods tried to reduce the inherent differences by adding the regularizer in the pre-training phase.
Our method is classified as a regularization method. PSGD trains a model from scratch like traditional SGD. Compared to SGD, PSGD focuses to attain a compression friendly model. This model can be effectively pruned or quantized because of its shape of the weight distribution. Consequently, a pre-trained model with PSGD does not require additional post-processing, re-training and accessing the data when the resources are limited. To achieve this goal, PSGD regulates the original weights to merge to a set of grid points by scaling the gradient of weights according to their error between the original weights and the compressed weights (pruned or quantized) (Fig. 3).
This work is the expanded version of our previous research [15]. We additionally verify PSGD with the recent iterative pruning framework. Also, we show that our PSGD as an implicit regularizer not modifying the objective function [16] works well with knowledge distillation which is one of the explicit regularizations. We interpret the geometry of the warped space from PSGD using steepest descent method. Finally, we provide the warped space and the original space distribution analysis.
Our contributions can be summarized as follows: • We interpret the warped space of PSGD using the steepest descent method with quadratic norm, which tries to make the space wide inversely proportional to quantization error (Eq 26). This phenomenon is also experimentally observed in Sec. V-B3.
• We verify the adaptability of PSGD as an implicit regularizer, which do not modify the objective function by combining iterative pruning methods and a traditional knowledge distillation loss function.
• We provide the analysis of weights in both the warped space and original space in Sec. V-A

II. RELATED WORK A. QUANTIZATION
Quantization-aware training (QAT) trains the model to attain the quantized model which performs well at the lower bits such as 4,3 and 2 bit. QAT updates the model in the full-precision domain but gradients are calculated in the low-precision domain using the training dataset [17], [18], [19]. To avoid using the whole training dataset and retraining phase, post-training quantization (PTQ) has been researched. These methods do not need a whole training dataset with a simple proposed calculation to consider a resource constraint device [12], [13], [14]. Channel-wise quantization methods require storing quantization parameters per channel and calculations about the quantization bin size [13], [20]. However, layer-wise quantization is more hardware-friendly as they calculate the quantization bin size and store quantization parameters at once per layer [11], [12], [14]. Reference [12] proposes a bias correction and a range equalization of channels, which maintain quantization performance until 8-bit. On the other hand, [14] splits the outliers to reduce the clipping error caused by them. However, these methods still suffer from a significant accuracy drop at the lowbit. References [21] and [22] propose to directly minimize the quantization error using a calibration dataset to achieve higher performance at under 6-bit.
Contrary to previous QAT and PTQ methods, the regularization method has focused on quantization robustness with explicit or implicit regularization terms at the initial training phase. Reference [6] minimizes the Lipshitz constant for robustness against adversarial attacks. Reference [8] proposes an L1 penalty term on the gradients for quantization robustness across different bit widths. This enables a quantization without additional training, dubbed an on-the-fly quantization.

B. PRUNING
Model pruning methods [23], [24], [25], [26], [27], [28], [29], [30] try to prune weights or filters in the model considered VOLUME 10, 2022  [9]) trained with standard SGD and our PSGD. For PSGD, the distribution of the full precision weights closely resembles the low precision distribution, yet maintains its accuracy. unimportant units by proposed unique criterion [31], [32]. Many works prune the model in the training phase [33], [34], [35], [36], [37]. Reference [35] proposes a L0 regularization term to train a sparse model. Reference [36] finds a sparse model only using a single shot by the gradient. Similarly, PSGD also does not need a pruning schedule and a retraining phase. PSGD makes a model into a sparse model by using gradient scaling to merge model weights to a zero value.

C. KNOWLEDGE DISTILLATION
Knowledge distillation (KD) is one of the most popular regularization method widely used in model compression domain [3]. This framework uses a larger teacher network's knowledge to boost the performance of a small-size student network. In general, KD encourages the student network to mimic the softened distribution of the teacher network. Another approach named as a feature distillation utilizes the feature map from the teacher network to teach the student network [4], [38].
In doing so, the student network absorbs the knowledge of the teacher by mimicking the logit from the teacher network using the Kullback-Leibler divergence (KL) loss. KD modifies conventional objective terms such as the cross-entropy loss by combining the KL loss. Our PSGD can be combined with this kind of the explicit regularizer because PSGD acts regardless of the objective function. We conduct applying KD with PSGD in Sec. IV-F.

III. PROPOSED METHOD
PSGD regularizes the original weight to converge at the desired target points which can help to perform well in uncompressed and compressed domains. The act of PSGD optimization in the original weight space is equivalent to SGD optimization in the warped weight space. With the invertible function between the original and warped space, PSGD gives the compression-friendly solution by converging a different local minimum compared to the solution of SGD in the original weight space.

A. OPTIMIZATION IN WARPED SPACE
Theorem 1: Let F : X → Y, X , Y ⊂ R n , be an arbitrary invertible multivariate function that warps the original weight space X into Y and consider the loss function L : X → R and the equivalent loss function L = L • F −1 : Y → R. Then, the gradient descent (GD) method in the warped space Y is equivalent to applying a scaled gradient descent in the original space X such that where y y y = F(x x x) and ∇ b a and J b a respectively denote the gradient and Jacobian of the function b with respect to the variable a.
Proof: Consider the point x x x t ∈ X at time t and its warped version y y y t ∈ Y. To find the local minimum of L (y y y), the standard gradient descent method at time step t in the warped space can be applied as follows: y y y t+1 = y y y t − η∇ L y y y (y y y t ).
Here, ∇ L y y y (y y y t ) = ∂L ∂y y y | y y y t is the gradient and η is the learning rate. Applying the inverse function F −1 to y y y t+1 , we obtain the updated point x x x t+1 : x x x t+1 = F −1 (y y y t+1 ) = F −1 (y y y t − η∇ L y y y (y y y t )) = F −1 (y y y t ) − ηJ x x x y y y (y y y t )∇ L y y y (y y y t ) where the last equality is from the first-order Taylor approximation around y y y t and J x x x y y y = J F −1 y y y = ∂x x x ∂y y y ∈ R n×n is the Jacobian of x x x = F −1 (y y y) with respect to y y y. By the chain rule, Now Eq. 2 and Eq. 4 are equivalent and Eq. 1 is proved. In other words, the scaled gradient descent (PSGD) in the original space X , whose scaling is determined by the matrix (J F x x x ) −2 , is equivalent to gradient descent in the warped space Y. the weight vector is close to a quantization grid, the gradient of that weight vector is scaled down proportionally to prevent it from escaping. Conversely, if it is distant, the gradient is scaled up so as to accelerate its escape from its original position.This idea is equivalent to multiplying a scaling factor to the gradients based on the distance from the nearest grid point.

B. POSITION-BASED SCALED GRADIENT
In this part, we introduce one example of designing the invertible function F(x x x) for scaling the gradients. This invertible function should cause the original weight vector x x x to merge to a set of desired target points {x x x}. These kinds of desired target weights can act as a prior in the optimization process to constrain the original weights to merge at specific positions. The details of how to set the target points will be deferred to the next subsection.
The gist of weight-dependent gradient scaling is simple. For a given weight vector, if the specific weight element is far from the desired target point, a higher scaling value is applied so as to escape this position faster. On the other hand, if the distance is small, lower scaling value is applied to prevent the weight vector from deviating from the position (See Fig. 3). From now on, we focus on the design of the scaling function for the quantization problem. For pruning, the procedure is analogous and we omit the detail.

1) SCALING FUNCTION
We use the same warping function f for each coordinate ) and our method belongs to the diagonally scaled gradient method.
Consider the following warping function where the targetx is determined as the closest grid point from x, sign(x) ∈ {±1, 0} is a sign function and c(x) is a constant dependent on the specific grid pointx making the function continuous. We introduced c(x) for making f (x) continuous. If we do not add a constant c(x), the f (x) has points of discontinuity at every {(n+0.5) |n ∈ Z} as depicted in Fig. 4, where represents step size and n means n-th quantized value identical tox corresponding to x. We can calculate the left sided limit and right sided limit at n +0.5 using Eq. 5.
Based on the condition that the left sided limit and the right sided limit should be the same, we can get the following recurrence relation: Using the successive substitution for calculating c(x), it becomes Setting c(0) = 0 and because n =x, c(x) can be calculated as below: is an arbitrarily small constant to avoid infinite gradient. Then, from Eq. 4, the elementwise scaling function becomes Using the elementwise scaling function Eq. 7, the elementwise weight update rule for the PSG descent (PSGD) becomes where, η is the learning rate. 1 We further elaborate on the geometry of the warped space using the concept of steepest descent in the p-norm in Section III-E. PSGD operates independent of the type of the loss function as it does not modify the loss term, but rather non-uniformly scales the gradient elements. Therefore, it can be applied to KD loss containing task loss L (e.g. cross-entropy) and KL loss.Assuming that there are n classes, softmax posterior with temperature T can be calculated as follows: 1 We set η = η 0 λ s where η 0 is the conventional learning rate and λ s is a hyper-parameter that can be set differently for various scaling functions depending on their range. VOLUME 10, 2022 where z k represents a k-th logit. The temperature value, T , is used to make soft logits for knowledge distillation. We can compute the KL loss between student and teacher network using following equation.
). (10) where Z T and Z S are teacher logit and student logit, respectively. Then, we can use PSGD with KD loss combining task loss and KL loss as below: L KD refers to the KD loss. We multiply T 2 because the decrease rate of the gradient scale is 1/T 2 . Using KD loss, the update ruls for the PSGD with KD becomes Applying PSGD at the beginning hinders training the model because of its regularization effect. To relieve this issue, PSGD is applied after a few warm-up epochs. More details are in Sec. IV-A2. The overall process of PSGD is depicted in Algorithm 1. if Iter < W then 5:  In this paper, we use the uniform symmetric quantization method [11] and the per-layer quantization scheme for hardware friendliness. Consider a floating point range [min x ,max x ] of model weights. The weight x is quantized to an integer ranging [−2 n−1 + 1,2 n−1 − 1] for n-bit precision. Quantization-dequantization for the weights of a network is defined with step-size ( ) and clipping values. The overall quantization process is as follows:

Algorithm 1 Position-based Scaled
where · is the round to the closest integer operation and x elsewise. We can get the quantized weights with the de-quantization process asx = x Q × and use this quantized weights for target positions of quantization.

2) PRUNING
For magnitude-based pruning methods, weights near zero are removed. Therefore, we choose zero as the target value (i.e.x = 0).

D. PSGD FOR DEEP NETWORKS
Many literature focusing on the optimization of DNNs with stochastic gradient descent (SGD) have reported that multiple experiments give consistently similar performance although DNNs have many local minima (e.g. see Sec. 2 of [39]). Reference [40] analyzed the loss surface of DNNs and showed that large networks have many local minima with similar performance on the test set and the lowest critical values of the random loss function are located in a specific band lower-bounded by the global minimum. From this respect, we explain informally how PSGD for deep networks works. As illustrated in Fig. 5, we posit that there exist many local minima (A, B) in the original weight space X with similar performance, only some of which (A) are close to one of the target points (0) exhibiting high performance also in the compressed domain. As in Fig. 5 left, assume that the region of convergence for B is much wider than that of A, meaning that there exists more chance to output solution B rather than A from random initialization. By the warping function F specially designed as described above (Eq. 5), the original space X is warped to Y such that the areas near target points are expanded while those far from the targets are contracted. If we apply gradient descent in this warped space, the loss function will have a better chance of converging to A . Correspondingly, PSGD in the original space will more likely output A rather than B, which is favorable for compression. Note that F transforms the original weight space to the warped space Y not to the compressed domain.

E. GEOMETRY OF THE WARPED SPACE
In this section, we further illustrate the exact geometry of the warped space when PSGD is applied to quantization. Recall from Eq. 4, Eq. 8, and Eq. 13 that the absolute magnitude of the quantization error is used to scale the gradient elements. This corresponds to left-multiplying a diagonal matrix with the elements determined by the magnitude of the quantization error. We use the concept of p-norm steepest descent [41] to illustrate why this leads to a warped space that induces the weight vectors to merge to the target points. First, we explain some necessary preliminary details for completeness.

1) STEEPEST DESCENT METHOD
For a first-order optimization method, the steepest descent direction, v is determined by minimizing the the first-order Taylor approximation of L(x + v) around x.
Since v can be chosen to have arbitrarily large magnitude in a particular direction, the magnitude is normalized as Naturally, using different values of p will yield distinct steepest directions. Additionally, other family of norms can also be used such as the quadratic norms, which is defined for a positive-definite matrix A as One can also consider the unormalized steepest descent, which scales the normalized steepest descent by the dual norm.
v sd = ∇L(x) * v nsd (17) where · * denotes the dual norm For the Euclidean norm (p = 2), v nsd corresponds to −∇L(x), which is denoted as gradient descent. Now we present our theorem by interpreting PSGD as steepest descent method in the quadratic norm. Lemma 1: For a fixed iteration t, the unormalized steepest descent direction in the quadratic norm · A is equivalent to the PSG descent direction if the symmetric, positive-definite matrix A is given by where s(x) is given by Eq. 7 and n is the dimension of the weight vectors.
Proof: First note that the normalized steepest descent in the Euclidean norm is simply given by the negative direction of the gradient scaled by its norm.
The steepest descent in the quadratic norm can easily be formulated as above with change of variables.
where the last equality follows from the change-of-variable h = A 1 2 v. Then, the descent direction is given by or equivalently, To yield the unormalized descent direction, we compute the dual norm of ∇L A , which is precisely sup which written in element-wise for the ith element is equivalent to the PSGD update rule given in Eq. 8.
Theorem 2: Given weight spaces X , Y ⊂ R n , and a symmetric, positive-definite matrix A ∈ R n * n , let X and Y be the weight spaces obtained by PSG descent method and the gradient descent method respectively. Then, the linear transformation from X to Y at iteration t is given by Thus, for a weight vector x t j with small quantization error, the jth basis is expanded inversely proportional to the error, rendering x t+1 j in the vicinity of the target point for a given update.
Proof: For the simplicity of notation, t is omitted below as the proof applies to any fixed t. Consider the loss function defined in Y.
The gradient descent direction in y is given by Applying the inverse transformation of Eq. 26 yields the gradient descent direction in x, which is equivalent to the unormalized steepest descent direction in the qudratic norm given by Eq. 24 in y ∈ Y.
By Lemma 1, this is equivalent to the PSG descent.

IV. EXPERIMENTS
To verify the effectiveness of PSGD used in model compression, we apply PSGD in three model compression domains including pruning, quantization and knowledge distillation. For pruning, we train the sparse model by setting a target point as 0 without any pruning method and compare this sparse model with L0-regularization [35] and SNIP [36], which are the regularization method and the single-shot pruning method not requiring additional fine-tuning and pruning schedules. Then, we apply our PSGD with iterative pruning methods which require pruning phase and schedule using a magnitude-based pruning criterion. For quantization, we compare our PSGD with regularization methods which train the model from scratch with regularization [6], [7], [8]. We choose the L1,L2 and Lipschitz regularization methods as the baseline for the original paper of [8]. Then, we also compare with layer-wise PTQ methods that utilize the pre-trained model [12], [14]. Finally, we apply our PSGD with extremely low bits (2,3 bits), various architecture and Adam optimizer [42] for verifying the adaptability.
For knowledge distillation, we conduct experiments to validate the compatibility of PSGD with knowledge distillation by applying it to the KD loss. To show the adaptability of PSGD with other post-training method, we applied the post-training method with PSGD-trained model.
A. IMPLEMENTATION DETAILS 1) HYPER-PARAMETER λ s λ s is the hyper-parameters related to the scaling function (Eq. 8) We tried to find the λ s which does not bring the performance degradation of uncompressed model, similar    to [8]. When we search the hyper-parameter, we utilize two disjoint datasets, the training and validation dataset, from the whole training dataset. After finding the hyper-parameter, we train the model using the whole training dataset and the hyper-parameter. Table 1 and 2 show the values of λ s used in experiments. The λ s tended to rise for lower target bit-widths or for higher sparsity ratios. In CIFAR-10, we observe that the same λ s value yields fair performance across all bit-widths. In our observation, CIFAR-100 and ImageNet dataset need a wide range compared to CIFAR-10.

2) METHODS
All experiments are conducted with the Pytorch framework. For the Single-shot pruning, we used we used ResNet-32 [9] on the CIFAR-100, following the training hyperparameters of [48]. We used released official implementations of [35] and re-implemented [36] for the Pytorch framework. In iterative pruning of Table 4, we followed the same setting of [44].  For quantization experiments of Table 6 and 7, we used ResNet-18 and followed [8] settings for CIFAR-10 and Ima-geNet. For [14], released official implementations were used for experiment. All other numbers are either from the original paper or re-implemented. For fair comparison, all quantization experiments followed the layer-wise uniform symmetric quantization [11] and when quantizing the activation, we clipped the activation range using batch normalization parameters as described in [12], same as [8]. PSGD is applied from the last 15 epochs for ImageNet experiments and from the first learning rate decay epoch for CIFAR experiments. We use additional 30 epochs for PSGD at extremely low bits experiments (Table 8). Also, we tuned the hyper-parameter λ s for each bit-widths and sparsity. Our search criteria is ensuring that the performance of uncompressed model is not degraded, similar to [8].

3) DATASETS
We use CIFAR-10/100 and the ImageNet datasets for experiments. CIFAR-10 consists of 50,000 training images and 10,000 test images, consisting of 10 classes with 6000 images per class. CIFAR-100 consists of 100 classes with 600 images per class. The ImageNet dataset consists of 1.2 million images. We use 50,000 validation images for the test, which are not included in training samples. We use the conventional data pre-processing steps. 2,3 a: ImageNet / CIFAR-10 For ResNet-18, we started training with a L2 weight decay of 10 −4 and learning rate of 0.1, then decayed the learning rate with a factor of 0.1 at every 30 epochs. Training was terminated at 90 epochs. We only used the last 15 epochs for training the model with PSGD similar to [8]. This means we applied the PSG method after 75 epochs with learning rate 0.001. For extremely low-bits experiments, we did not use any weight decay after 75 epochs. We tuned the hyperparameters λ s for target bit-widths. All numbers are results of the last epoch. We used the official code of [14] for comparisons with 0.02 for the Expand Ratio. 4

b: CIFAR-100
For ResNet-32, the same weight decay and initial learning rate were used as above and the learning rate was decayed at 82 and 123 epoch following [48]. Training was terminated at 150 epoch. For VGG16 with batchnorm normalization (VGG16-bn), we decayed the learning rate at 145 epoch instead. We applied PSG after the first learning rate decay. The first convolutional layer and the last linear layer are quantizedat 8-bit for the 2-bit and the 3-bit experiments. For sparse training, training was terminated at 200 epoch and weight decay was not used at higher sparsity ratio, while all the other training hyperparameters were the same. For [35], we used the official implementation for the results. 5

B. PRUNING 1) SINGLE-SHOT PRUNING
To figure out that PSGD can train the sparse model with setting the target point at zero, we apply magnitude-based pruning [43] after PSGD training across different sparsity ratios. This setting can be fairly compared with sparsity regularization method and single-shot pruning method because of no need for fine-tuning. Table 3 shows that our PSGD outperforms two competitive methods in terms of maintaining the performance across different sparsity ratios. Although all methods show promising results at the low sparsity (∼10%), [35] suffers a significant accuracy degradation, where the same phenomenon is observed in another work [49]. Relatively, single-shot pruning [36] maintains the performance at the high sparsity, but PSGD is more potent in making a sparse model. Fig. 6 represents the distribution of weights in SGD-and PSGD-trained models, which explains that the weights are well clustered at the zero target value.

2) ITERATIVE PRUNING
We also consider the iterative pruning case. Comparing single shot pruning methods, iterative pruning gradually increases the sparsity while training. This can help the recovery the TABLE 6. Test accuracy of regularization methods that do not have post-training process for ResNet-18 on the ImageNet and CIFAR dataset. PSGD@W# indicates the target number of bits for weights in PSGD is #. All numbers except ours are from [8]. At #-bit, PSGD@W# performs the best in most cases.   performance via finetuning steps included in iterative pruning training schedule. We bring PSGD to the iterative pruning methods from DPF [44] and AC/DC [47]. Table 4 shows the performance of our method with competitive pruning methods. DPF+PSGD and AC/DC+PSGD refer to combine the DPF with PSGD and AC/DC with PSGD, respectively. PSGD is also very effective in the iterative pruning schemes. PSGD shows promising results by combining iterative pruning method with gradient scaling in the SGD. These results verify that PSGD performs well with only scheduling comparable to the competitive iterative pruning. This can be possible because PSGD only scaled the gradient to regularize the weights to zero which is friendly to pruning. We also report the Multiply and Accumulation (MAC) and FLoating point Operations Per Second (FLOPS) of dense model and PSGD model for comparing the efficiency in Table 5.  Table 6 provides the results about on-the-fly quantization domain on CIFAR-10 and ImageNet. The setting of onthe-fly quantization evaluates performances across various bit-widths using a single model without any modification. We followed the same setting of [8]. PSGD performs well on CIFAR-10 in every bit-width. In ImageNet dataset, PSGD targeting 8-bit and 6-bit (PSGD@W8 and PSGD@W6) show promising accuracy except 4-bit. Gradient L1 (λ = 0.05) and PSGD @ W4 maintain the performance of the quantized models even at 4-bit. In general, PSGD outperforms other competitive methods in every bit-width because of quantization friendly weight distributions. Table 7 shows the performance of Post-training quantization methods and PSGD model. PTQ methods have drastic drops in low bits as depiced in Fig. 1 of the original paper of DFQ [12]. On the other hand, PSGD outperforms OCS at 4-bit about 19% without any post-training method. PSGD also can be combined with PTQ method as stated in Sec. IV-H.

3) EXTREMELY LOW BITS QUANTIZATION
Certainly, the lower bit-width a model is used, the more the performance of quantized model decrease because of its representation power. To verify the expandability of PSGD to extremely low-bits such as 2 and 3-bit, we conducted an experiment targeting 2 and 3-bit except the first, last layers and activation maps (8-bit) in Table 8. This experiment shows that PSGD can be a key solution for the extremely low bit quantization.

D. ADAM OPTIMIZER WITH PSG
All previous experiments are conducted with a naive stochastic gradient descent method. To confirm that PSGD can be combined with other optimizers such as Adam, we apply PSGD with Adam optimizer using ResNet-32 on CIFAR-100 dataset. We followed the same setting except for the initial learning rate where 10 −3 was used. Table 9 shows that PSGD also can be used with another type of optimizer.

E. VARIOUS ARCHITECTURES WITH PSGD
To verify the adaptability of PGSD in the model architecture, in this section, we show the results of applying PSGD to various architectures. Table 10 shows the quantization results of VGG16 [50] with batch normalization on the CIFAR-100 dataset and DenseNet-121 [51] on the ImageNet dataset, respectively.
For DenseNet, we run additional 15 epochs from the pre-trained model to reduce the training time. 6 For fair comparisons in terms of the number of epochs, we also trained for additional 15 epochs for SGD with the same last learning rate (0.001). However, we only observed oscillation in the performance during the additional epochs. Similar to the extremely low-bits experiments, we fixed the activation bit-width to 8-bit. For VGG16 on the CIFAR-100 dataset, similar tendency in performance was observed with ResNet-32. The 4-bit targeted model was able to maintain its full-precision accuracy, while the model targeting lower bit-widths had some accuracy degradation.

F. KNOWLEDGE DISTILLATION
In this part, we show the adaptability of PSGD, which only manipulates the magnitude of the gradients from the loss function. We apply PSGD with another regularizer, Knowledge Distillation. We follow the update rule (Eq. 12) for quantization, using a KD framework. We utilize a powerful teacher network to train a relatively small student network. We conduct two experiments on CIFAR-100 and ImageNet. In CIFAR-100, we use ResNet-32 as a student and ResNet-54 as a teacher network. In ImageNet, we use ResNet-18 and ResNet-34 as a student and teacher, respectively. Table 11 and 12 show similar tendency. Regardless of bit-width, network, and dataset, Combining KD and PSGD (Eq. 12) outperform using PSGD alone (Eq. 8). From this respect, we validate that PSGD can be used alongside with other regularizer because of its adaptability.

G. QUANTIZATION-AWARE TRAINING VS PSGD
Conventional QAT methods [17], [18], [52] starts with a pre-trained model initially trained with SGD and further update the weights by only considering the low precision weights. In contrast, regularization methods such as our work and [8] starts from scratch and update the full-precision weights analogous to SGD. In our work, the sole purpose of PSGD is to find a set of full precision weights that are quantization-friendly so that versatile deployment as low precision (LP) is possible without further operation. Therefore, regularization methods start from the initial training phase analogous to SGD, whereas QAT methods starts with a pre-trained model after the initial training phase such as SGD and PSGD. The purpose of QAT methods is solely focused on LP weights. In general, a coarse gradient is used to update the weights attained by forwarding the LP weights, instead of the FP weights by using the straight-throughestimator (STE) [11]. Additionally, the quantization scheme is modified to include trainable parameters dependent on the low-precision weights and activations. Thus, QAT cannot maintain the performance of full-precision as it only focuses on that of low-precision such as 4 bit-width.

H. POST-TRAINING WITH PSGD-TRAINED MODEL
Our model attains similar full-precision performance with SGD and reasonable performance at low-precision even with naive quantization. Thus, PSGD-trained model can be potentially used as a pre-trained model for QAT or PTQ methods. We performed additional experiments using the model trained with PSGD in Table 7 and Fig. 7 by applying a concurrent PTQ work, LAPQ [21], using the official code. 7 This attains 66.5% accuracy for W4A4, which is more than 3.1% and 6.2% points higher than that of PSGD-only and LAPQ-only respectively as shown in Table 13. This shows that PTQ methods can benefit from using our pretrained model.

A. THE WARPED WEIGHT SPACE AND THE ORIGINAL WEIGHT SPACE
In this section, we provide the analysis of the weight distribution in both the warped weight space and the original weight space. We already proved that PSGD in the original weight space is equal to SGD in the warped space. And weights in 7 https://github.com/ynahshan/nn-quantization-pytorch/tree/master/lapq the warped space can be calculated by Eq. 5. Fig. 8 shows weight distributions trained with SGD and PSGD. Based on the Eq. 5, if the quantization error is small, the warped weights are converged at c(x) which is closely related tox. The results of Fig. 8 reflect this phenomenon. Warped weights of PSGD are converged at specific points. On the other hand, Warped weights of SGD are spread around specific points compared to PSGD.

B. TOY EXAMPLE
We provide a toy example to give a intuition for the PSGD solution. We trained two models with SGD optimizer and PSGD optimizer under the 2-bit on the MNIST dataset. Each model consists of two layers containing 50 and 20 neurons. We show the weight distribution of the first layer and the eigenvalues of the Hessian matrix.

1) MULTI-MODAL WEIGHT DISTRIBUTION
SGD produces the bell-shaped weight distribution which is not suitable for model compression. On the other hand, PSGD generates multi-moal weight distributions. In this toy example, as we choose a 2-bit for the target bit, three modes exist in the weight distribution of PSGD model as depicted in Fig. 9a. PSGD has nearly the same accuracy with FP (∼96%) at W2A32. However, the accuracy of SGD at W2A32 is about 9%, although the FP accuracy is 97%. This tendency is also shown in Fig. 2b.

2) QUANTIZED AND SPARSE MODEL
In general, SGD produces the bell-shaped weight distribution which is not matched for the quantization. However, PSGD yields a multi-modal distribution. Three target points are used (2-bit) so the weights are merged into three modes as depicted in Fig. 9a. A large proportion of the weights are near zero. Similarly, we note that the sparsity of ResNet-18@W4 shown in Table 6 is 72.4% at LP. This is because symmetric quantization also contains zero as the target point. PSGD has nearly the same accuracy with FP (∼96%) at W2A32. However, the accuracy of SGD at W2A32 is about 9%, although the FP accuracy is 97%. This tendency is also shown in Fig. 2b, which demonstrates that PSGD reduces the quantization error.

3) CURVATURE OF PSGD SOLUTION
In Sec III-D and Fig. 5, we claimed that PSG finds a minimum with sharp valleys that is more compression friendly, but has a less chance to be found. As the curvature in the direction of the Hessian eigenvector is determined by the corresponding eigenvalue [53], we compare the curvature of solutions yielded by SGD and PSGD by assessing the magnitude of the eigenvalues, similar to [54]. SGD provides minima with relatively wide valleys because it has many near-zero eigenvalues and the similar tendency is observed in [54]. However, the weights trained by PSGD have much more large positive eigenvalues, which means the solution lies in a relatively sharp valley compared to SGD. Specifically, the number of large eigenvalues (λ > 10 −3 ) in PSGD is 9 times more than that of SGD. From this toy example, we confirm that PSG helps to find the minima which are more compressionfriendly (Fig 9a) and lie in sharp valleys (Fig. 9b) hard to reach by vanilla SGD. we have also used official code 8 of [55] to qualitatively assess the curvature of Fig. 9, using the same 8 https://github.com/tomgoldstein/loss-landscape experimental setting, which is depicted in Fig. 10 and it shows a similar tendency. The solution of PSGD is in the more sharp valley than it of SGD.

VI. CONCLUSION
We propose a new regularization method for model compression. Position-based scaled gradient (PSG) scales the gradient to merge current weights into specific target points such as quantization bins or zero values which are compressionfriendly points. Hence, By training the model with PSGD, we can achieve a compression-friendly model. We proved that optimizing the model with position-based scaled gradient descent (PSGD) in the original space is equivalent to optimizing the model in the original space with SGD. The proposed PSGD can be applied at quantization and pruning. Also, we showed that PSGD performs well with knowledge distillation which is an explicit regularization. PSGD will help further research in model compression, including quantization and pruning.