Distilled Gradual Pruning With Pruned Fine-Tuning

Neural networks (NNs) have been driving machine learning progress in recent years, but their larger models present challenges in resource-limited environments. Weight pruning reduces the computational demand, often with performance degradation and long training procedures. This work introduces distilled gradual pruning with pruned fine-tuning (DG2PF), a comprehensive algorithm that iteratively prunes pretrained NNs using knowledge distillation. We employ a magnitude-based unstructured pruning function that selectively removes a specified proportion of unimportant weights from the network. This function also leads to an efficient compression of the model size while minimizing classification accuracy loss. Additionally, we introduce a simulated pruning strategy with the same effects of weight recovery but while maintaining stable convergence. Furthermore, we propose a multistep self-knowledge distillation strategy to effectively transfer the knowledge of the full, unpruned network to the pruned counterpart. We validate the performance of our algorithm through extensive experimentation on diverse benchmark datasets, including CIFAR-10 and ImageNet, as well as a set of model architectures. The results highlight how our algorithm prunes and optimizes pretrained NNs without substantially degrading their classification accuracy while delivering significantly faster and more compact models.


I. INTRODUCTION
D EEP neural networks (NNs) have shown state-of-the-art performance on various visual tasks, such as image classification [1], [2], [3], [4], object detection [5], [6], and semantic segmentation [7], [8].Despite their success, the substantial size and computational demands of these models present a major challenge for their implementation on resource-limited devices.Several compression techniques have been developed to reduce the size and computational demands of deep NNs while retaining their performance and to overcome the previously mentioned challenges.Neural architecture search (NAS) has been explored as a method to design efficient architectures; for instance, in [9] an optimization for specific hardware platforms is proposed, and in [10] the curriculum search strategy is explored.They support the expansion of the search space progressively.Techniques such as the contrastive learning framework [11], the "once-for-all" approach [12], and the neural architecture Transformer [13] have further advanced the field.Last, the disturbance-immune update strategy [14] addresses the performance disturbance issue in weight-sharing NAS methods.However, while NAS offers automated design, the need for more direct compression techniques remains paramount.This is where pruning comes into play.This work delves deeper into the intricacies and advancements in pruning techniques.
The primary goal of weight pruning is to remove nonrelevant weights from a NN.This process aims to reduce the network's size and computational requirements while minimizing the loss of its performance.There are two types of pruning methods, structured and unstructured.Structured pruning involves modifying or removing layers or parts of the network.This method may lead to changes in the input and output dimensions of the layers, which can cause issues in networks with long-range dependencies among layers [15].The solution to this problem is often circumvented by constraining pruning into targeting only layers that do not induce issues like filters [16] and channels pruning [17], [18], or a mixed approach [19].Whatever the pruning method be, it usually involves careful fine-tuning [20] to maximize its performances.However, such constraints are expected to decrease the efficiency of pruning.Unstructured pruning, on the other hand, produces sparse matrices that are difficult to accelerate [21], even if some recent works withdraw this statement [22], [23].In this context, different strategies have been proposed throughout the years for unstructured pruning in several application areas.The optimal brain damage algorithm [24] and magnitude-based pruning algorithm [25] are two popular unstructured pruning techniques.Other popular methods include Taylor expansion pruning [26], which prunes based on the loss function's second-order Taylor approximation, and random pruning [25], which prunes randomly to improve computation times.However, a simple pruning of the weights may lead to a drop in performance.To this extent, weight recovery between training cycles [27], [28] and fine-tuning the pruned model through additional training has shown to be an effective approach to mitigate this issue [29].
As a popular approach for model compression, knowledge distillation has received significant attention in recent years [30], [31].The basic idea behind this technique is to train a smaller model, referred to as the student model, and to mimic the behavior of a larger model, referred to as the teacher model.The student model is trained by minimizing the difference between its predictions and the predictions of the teacher model, which is often a pretrained NN.Self-distillation refers to a knowledge distillation approach where a NN is distilled into a smaller, more compact version of itself [32].
The research direction goes toward more complex pruning and distillation strategies, but often with a large computational cost; Srinivas et al. [28] tried to introduce a cyclical pruning and weight recovery schedule, but significantly increasing the complexity of the algorithm at a price of a slight classification improvement.
We present a novel unstructured pruning algorithm that seamlessly integrates knowledge distillation techniques to achieve significant model compression without compromising its accuracy.Our proposed method commences with a gradual weight pruning phase that employs knowledge distillation to remove unimportant weights and reduce the model size.Once the desired sparsity level is achieved, the model undergoes a distilled fine-tuning process until convergence.This is then followed by a final fine-tuning process without the teacher.We demonstrate that our approach outperforms the existing methods in terms of compression-accuracy tradeoffs through extensive experimental evaluations conducted on publicly available benchmark datasets.These results show that the algorithm has a potential impact in the field of deep learning by enabling the deployment of large, accurate models on a wide range of devices with limited computational resources and to average users.
To summarize, the contributions of this work are as follows.1) We build upon a well-known baseline function exploiting magnitude-based unstructured pruning to minimize memory and storage requirements by selectively removing a specified proportion of weights from a pretrained NN. 2) We propose a unique simulated pruning technique.
This method stands out as it replicates the benefits of weight recovery while consistently maintaining stable convergence.Notably, this is achieved at each training iteration, setting it apart from conventional practices in weight recovery literature.3) We introduce distilled gradual pruning with pruned fine-tuning (DG2PF), a comprehensive algorithm that integrates unstructured weight pruning and knowledge distillation to prune pretrained NNs without incurring a substantial reduction in performance.4) We have conducted experiments on publicly available benchmark datasets and models to validate the performance of our method.The results of this evaluation provide quantifiable evidence of the effectiveness of the proposed algorithm.The rest of this article is organized as follows: Section II presents the related work, where we review and discuss the existing literature and research relevant to our study; Section III contains details about the proposed algorithm, comprehensive of pseudocode; Section IV details the evaluation of DG2PF and the comparative studies with the state-of-the-art pruning techniques on two representative datasets; and Section V discusses and presents the conclusion and future work.

II. RELATED WORK
Pruning in NNs can involve either structured pruning that removes model structures or unstructured pruning that removes individual parameters.In general, structured pruning methods [16], [33] do not depend on specialized hardware.In contrast, unstructured pruning approaches [34], [35] explicitly require support for sparse computations.Recent advancements in structured pruning include [17], which aims to enhance network performance through channel pruning by eliminating redundant components.The work in [18] offers a distinctive method for lossless channel pruning, drawing inspiration from neurobiology, and ensures structured sparsity without sacrificing accuracy.Meanwhile, Liu et al. [36] introduce a combined approach of discrimination-aware channel and kernel pruning.In the context of unstructured pruning, there are three distinct pruning schedules: one-shot, gradual, and cyclical pruning.One-shot pruning [37] involves the simultaneous removal of unimportant weights in a single step, followed by a final fine-tuning stage.Gradual pruning [27] gradually prunes the network weights over multiple iterations.This approach is interleaved with training steps and culminates in a final fine-tuning stage.Cyclical pruning [28] involves multiple gradual pruning schedules, with weight recovery at the beginning of each cycle.Parameterefficient masking networks [38] lead to a new paradigm for model compression utilizing one random initialized layer, accompanied by different masks, so the model can be expressed as one-layer with a bunch of masks.The work in [39] smoothly induces sparsity while learning pruning thresholds, providing a nonuniform sparsity budget.This article, inspired by [27], proposes an algorithm that fuses pruning and knowledge distillation techniques, introducing a novel approach called simulated pruning.The simulated pruning introduces weight recovery without the need for cyclical schedules.In [40] and [41], the authors suggest automatically tuning thresholds for magnitude pruning to improve global sparsity by removing unimportant weights based on their absolute value.Alternative approaches to magnitude pruning, such as second-order [24], [42] and Fisher-based [43], [44] of the loss function, have been proposed.However, recent work [45] suggests they may not be more effective, especially when combined with fine-tuning.Probabilistic pruning approaches, such as those described in [46] and [47], involve stochastic relaxations, but research [48] shows they often perform similarly to simple magnitude pruning-based methods.The works described in [49] and [50] use gradient updates computed on a sparse proxy model by exploiting the straight-through estimator (STE), similar to [51] and [52], and claim that this method can lead to weight recovery.These approaches make use of one-shot pruning.However, Srinivas et al. [28] show weight recovery is complicated to achieve in practice in this setting.
Knowledge distillation is a form of compression strategy that transfers relevant feature representation from a larger teacher network to a smaller student network, followed by fine-tuning.This method was proposed by [53] for networks that tackle the classification task.The approach introduces a distillation loss that utilizes the softened output of the teacher network's last layer.In [30], the authors improved the performance of this approach by using an intermediate representation of the teacher model as a hint in addition to the output layer.In [54], knowledge distillation is applied to the ResNet architecture by minimizing the L 2 loss of the Gramian feature matrix in the ResNet modules between teacher and student.Like for our article, recent works [55], [56] try to mix pruning and distillation for optimal performance.

III. PROPOSED METHOD: DG2PF
In this section, we will describe our proposed method, called DG2PF.The algorithm is composed of two phases.The first phase, called distilled gradual pruning (DGP) (Algorithm 1), incorporates two distinct types of pruning mechanisms.The first type of pruning is carried out according to the procedure outlined in Section III.A.This pruning approach is gradually applied, once per epoch, during the first phase, until the desired sparsity level is attained.We called the other kind of pruning "simulated," as described in Section III.B.This type of pruning is performed during each iteration of every epoch of the DGP phase.It selectively removes and recovers a portion of the weights that have not yet been pruned in the network.The second phase is called pruned fine-tuning (PF) (Algorithm 2) and starts upon completion of the previous one.Here, the network has already been pruned to its intended sparsity level and the simulated pruning strategy is terminated.This phase aims at recovering most of the performance lost during DGP.In Section III.C, a knowledge distillation strategy is presented.It merges two knowledge distillation losses, named Kullback-Leibler (KL) divergence and performance-weighted loss.In Section III.D, we present DG2PF, our novel two-phase algorithm that merges the techniques mentioned above.

A. Pruning Function
In line with the previous research [40], [41], we operate with the assumption that weights with magnitudes closer to zero have less impact on the final output of a NN.Therefore, we propose to prune these weights by collapsing them to zero and flagging them as pruned [27], [28], [37].The rationale behind this assumption is that the weights with smaller magnitudes have minor effects on the output of the NN.It can be deducted by considering the activation functions commonly used in NNs.In these activation functions, the signal is passed through a hard or soft threshold, which means that small changes in the input signal do not or marginally affect the output unless they cross this threshold.Thus, weights with smaller magnitudes have a lower probability of crossing the threshold and therefore are less influential in determining the final output.Based on these assumptions, we can remove the weights with smaller magnitudes without a significant loss of accuracy.Consequently, the number of parameters in the network is reduced, improving its efficiency without significant performance degradation.
Let s ∈ R be the chosen sparsity of the network, with 0 < s < 1.Each weight θ i of a NN parameterized by θ is pruned as follows: where m l and m r ∈ R are the margins computed as (1 − s/2)th and (s + (1 − s/2))th percentiles of the weights θ, respectively.The weights falling inside these margins are set to zero and thus pruned.Fig. 1 shows an example of margins and weights to prune on a pretrained network.Moreover, Fig. 2 depicts an example of pruned and unpruned parameters within a layer of a sparse network.

B. Simulated Pruning Function
We assume that the reduction of the importance of weights likely to become zero during the upcoming pruning stage has a comparatively minor impact on the network's overall performance.When we employ this technique, we essentially carry out a cyclical pruning step in a single training iteration on a single batch of data.It means that in each iteration we start with the (simulated) pruning stage and we recover the pruned weights by the end of the iteration.This methodology stands in contrast to the approach presented in [28], where the pruning process is initiated only after completing a predetermined number of training epochs.In particular, in [28] each cycle spans several training epochs and ensures that weights undergo a gradual pruning, in order to only have a fraction restored at the end of the cycle.A notable limitation emerges when these weights, especially in the earlier stages of the cycle, are pruned based on a constrained pool of information.It predominantly happens when specific policies, such as magnitude-based pruning, are adopted.Despite the evident efficacy of the cyclical pruning mechanism, our methodology compares and rectifies its core shortcomings.We guarantee that the heuristic responsible for the pruning decision is perpetually equipped with a uniform dataset for each weight, facilitating both the pruning and recovery within each iteration, ensuring an informed decisionmaking process, and enabling more stable convergence.We use straight through estimation (STE), thus allowing the gradient to pass through the weights pruned in this phase.As theoretically proved by [57], this technique speeds up the learning process and helps ensure stability.
Let s sim ∈ R be the chosen simulated sparsity of the network, with 0 < s < 1.At the start of each training step of the first phase of the algorithm, a fraction s sim of unpruned weights are pruned and then recovered after the backpropagation of the loss.Each weight θ i of a NN parameterized by θ is pruned according to the probability where m s corresponds to the (1 − s sim )th percentile of a vector p ∈ [0, 1] |θ| obtained as follows: where 1 = [1] |θ| is a vector of the same length of θ where each position is filled with 1.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

C. Knowledge Distillation Procedure
As will be described in Section III.D, the proposed method follows two phases.The first one entails knowledge distillation from the original, unpruned model.The loss used to train the student model during these steps combines a variation of performance-weighted loss [58] and pointwise KL divergence loss [59].
The rationale under their combination is to use performanceweighted loss to get resilience against outliers and challenging instances, and pointwise KL divergence to align the student model's distribution with the teacher model's distribution.
The performance-weighted loss is a modification of the wellknown cross-entropy loss.The cross-entropy loss is commonly used for training classification networks and is expressed mathematically as where B is the batch size, y i is the ground truth label vector for the ith sample, and ŷi contains the predicted probabilities for sample i.The logarithm in the formula is used to amplify the loss when the model is highly confident but incorrect.In fact, the logarithm grows as the predicted probability approaches 0, penalizing the model for being overly confident in incorrect predictions.As a more robust alternative to the cross-entropy loss, during the distillation phase of the algorithm, an alternative version of the performance-weighted loss [58] is employed.In this procedure, each sample is given a proportional weight to the teacher network's confidence when classifying.Thus, the weight w i of a sample of index i in a batch is defined starting from the score of the teacher network for the correct class c i , ŷ(t) i,ci ∈ R, as follows: with γ > 0 set to 1 and β ∈ [0, 1] set to 0.1.Since (5) puts more emphasis on incorrect labels, the original authors propose to compare student network's predictions to corrected soft-labels ŷ * i instead of always the ground truth labels where student network's predictions ŷi are used instead of the one-hot encoded ground truth vector y i where the model has made a correct classification.Given that the modified performance-weighted loss is defined as where B is the batch size, L CE is the cross-entropy function (4), w i is the weight of the ith sample in the batch (5), and ŷ * i is the corrected soft-labels vector (6).
The pointwise KL divergence loss measures the dissimilarity between two probability distributions.It is commonly used in knowledge distillation to match the soft predictions of a more extensive, pretrained teacher network to those of a smaller student network [53], [59].The formula for the pointwise KL loss is defined as follows: where B is the batch size, while y (t) i and y i , respectively, contain the predictions of the teacher and the student networks on the ith sample.
The final loss function utilized in the first two stages of the procedure is a modified version of the one proposed in a previous study [53], which is calculated as follows: In this equation, L KD emerges as a linear combination of the two sublosses L KL (8) and L PW (7), modulated by parameters α ∈ [0, 1] and τ ∈ R. The coefficient α acts as a balancing factor, determining the proportional influence of L KL on the overall loss.Meanwhile, τ functions as a temperature parameter.Notably, the combination is weighted by τ 2 , thereby adjusting the scale and sensitivity of the combined loss.In a broader sense, α and τ adjust the balance and sensitivity of the loss function, determining the importance of replicating the teacher network's behavior via L KL and classifying examples through L PW .

D. Distilled Gradual Pruning With Pruned Fine-Tuning
The proposed algorithm is composed of two phases.The first phase, called DGP, involves gradually removing parts of the model while minimizing the loss in classification performance.This process is executed by using self-distillation to make sure the pruned model behaves as much like the original model as possible.The algorithm works by gradually pruning the model over a specific number of s e epochs and then continuing to train until it reaches convergence.The procedure's pseudocode is shown in Algorithm 1.Let δ ∈ [0, s] se be a vector containing s e evenly spaced numbers in increasing order.At the beginning of each epoch, i ≤ s e the model is pruned to a sparsity of δ i and then trained on the batched training dataset D (b) t .At the beginning of each training step, the model undergoes an additional simulated pruning process, as explained in Section III.B.This procedure happens only if epoch i ≤ s e and targets the unpruned weights, reducing their sparsity to s sim .Then, the algorithm makes predictions ŷ, ŷ(t) ∈ R bs×c using the pruned and teacher models, respectively, where b s denotes the batch size and c is the number of labels in the datasets.These predictions are compared to the actual labels y ∈ R bs , and the knowledge distillation loss L is calculated using (9).From this loss, we compute the gradients Δθ and eventually restore the weights set to zero during the simulated pruning step.After that, the algorithm updates the unpruned weights and proceeds to the next batch in the epoch.At the end of each training epoch, the model is tested on the batched validation dataset D (b) v , and its top-1 accuracy score is saved.We use AdamW [60] as the optimizer function to speed up convergence.The DGP process ends when the maximum number of epochs has been reached or if the top-1 accuracy score on the batched validation Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

dataset D (b)
v does not improve after a fixed number of epochs, triggering an early stop.
The second phase of the algorithm, known as PF, follows the first phase of DGP.In this phase, the model is fine-tuned without a teacher, allowing it to focus on classification scores without being constrained by the teacher's predictions.Additionally, the model is not pruned further as the desired sparsity level was achieved during the previous phase.The pseudocode for PF is provided in Algorithm 2. The algorithm loops through the batches of the training dataset D (b) t with the same stopping criteria as the previous phase.During training, the unpruned weights are trained using the cross-entropy loss (4) to enhance classification performance.The unpruned parameters are updated through stochastic gradient descent (SGD) with a low learning rate.We opted for SGD over AdamW since our experiments yielded better generalization performance.

IV. EXPERIMENTAL RESULTS
In this section, we evaluate our proposal on two widely adopted datasets and compare them to several state-of-the-art methods regarding unstructured pruning.

A. Datasets
CIFAR-10 [61] is a small dataset containing 60 000 training images and 10 000 test images, split into ten classes.The images in CIFAR-10 are relatively simple and small, making it a popular dataset for testing algorithms and architectures in their early stages of development.
ImageNet (also known as ImageNet-1K) [62] is a much larger and more complex dataset, containing over 1 million training images and 50 000 validation images, split into 1000 classes.ImageNet offers various classes, from ordinary objects to abstract concepts, e.g., mountains and handwriting.The larger image size of ImageNet provides a more realistic and challenging benchmark for computer vision models.

B. Metrics
The metric we used to quantify the classification performance of a model is the top-k accuracy.When classifying a sample, the model outputs a probability distribution among the possible labels and is trained to give more weight to the more plausible labels.The top-k predictions Ŷi,k for a sample of index i are the labels with the highest scores.This metric measures the proportion of times the model predicts the correct label to be among the top-k predictions where N is the number of samples in the dataset, with 1 ≤ i ≤ N , y i is the true label for the ith sample, and Ŷi,k is the set of the top-k predicted labels for the ith sample, with | Ŷi,k | = k.
According to typical practices in the related literature, we have decided to present the top-1 accuracy results in comparison with the state of the art in Section IV.E.
We assessed the effectiveness of our compression method using the compression rate metric.The compression rate is calculated using the target sparsity, which represents the percentage of weights that are pruned from the original model.The metric is computed as follows: where 0 < s < 1 represents the target sparsity of the network.

C. Implementation Details
The experiments were conducted on a high-performance computer (HPC) equipped with an Nvidia Quadro RTX6000 GPU and 24 GB of VRAM.Minimal data augmentation was applied to ensure a fair comparison with the previous literature [3], [27], [28], [37].In addition, this procedure also reduces the potential confounding effects that could be introduced by more complex data preprocessing and allows for a more fair and comprehensive evaluation of the impact of the proposed methods.The optimizer used during the self-distillation phase is AdamW [60], with a learning rate of 10 −5 , β 1 and β 2 equal to 9 × 10 −1 and 9.99 × 10 −1 , and a weight decay of 10 −2 .After the teacher is detached from the pruned model, AdamW is replaced with plain SGD with a learning rate of 10 −4 , a momentum of 9 × 10 −1 , and a weight decay of 5 × 10 −4 .This optimization swap is motivated by the fact that in our experiments AdamW tended to converge in fewer epochs while SGD has shown better generalization capabilities.We made this change to improve our model's classification performance.During all experiments, the max epochs were set to 100 to be fair in comparison with other works, however thanks to the early stop strategy and AdamW, no experiments reached the max epochs limit.

D. Ablation Study
In this section, we assess the impact of the hyperparameters used in the method's pruning and distillation stages.To conduct the ablation study, we selected the CIFAR-10 dataset [61] and the ResNet-18 model, which are relatively small and enable quicker and more comprehensive evaluation of various combinations of hyperparameters.The model was initially configured with 95% sparsity, 10% simulated sparsity percentage during self-distillation with α = 0.75, and ten pruning epochs.We trained and tested the model in this base configuration for each experiment, varying single hyperparameters.Each table row shows the mean and standard deviation of top-1 accuracy obtained from three runs of the same experiment with different seeds.The notation "acc@1" is utilized as an abbreviation for the top-1 accuracy.
1) Effect of s for Sparsity: In this study, we have examined how increasing the target sparsity s of the model affects classification accuracy.The results are presented in Table I.From the results, we can observe that the loss in accuracy is negligible for sparsity values up to 90%, after which the accuracy begins to decline significantly.Specifically, the drop in accuracy from 90% to 95% amounts to 1.65%, which is consistent with the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.findings of other studies on unstructured pruning [27], [28], [37].These results demonstrate that while higher sparsity levels can lead to a more compact and efficient model, there is a tradeoff between sparsity and accuracy.Therefore, the target sparsity s should be carefully selected, considering the specific model, dataset, and desired tradeoff between size and accuracy.
2) Number of Pruning Epochs s e : In this study, we have investigated whether increasing the number of pruning epochs leads to a more accurate model.The results are shown in Table II and demonstrate a clear trend of higher accuracy with increased pruning epochs.The peak gain of 1.33% was observed at s e = 15 compared to the one-shot pruning setting.The gradual and careful selection of the parameters to prune explains this improvement.However, it should be noted that this result may be further improved if the pruning is performed multiple times per epoch, as proved in [28].However, it is a field of future research and requires further investigation.
3) Simulated Sparsity s sim : The objective of this study was to observe how the network behaves as the percentage of simulated sparsity is increased.The outcomes of the experiments are presented in Table III.The performance of the network without simulated pruning was better than that with 20% simulated sparsity by 0.11% but inferior to that with 10% simulated sparsity by 0.37%.It implies that the simulated sparsity level must be cautiously selected, as a higher level may remove too many parameters, making learning more difficult.

4) Knowledge Distillation α:
This study aimed to measure the impact of α in the knowledge distillation loss (9) on the accuracy of the model.The results are in Table IV.The experiments revealed that the best results were achieved  with α values of 25% and 90%.Specifically, the mean top-1 accuracy was improved by 1.71% and 1.74%, respectively, compared to the undistilled setting.It was observed that generally the experiments with an α greater than 0 showed better mean top-1 accuracy and reduced standard deviation, indicating that the application of knowledge distillation can improve the model's accuracy.5) Loss Temperature τ : This study aimed to measure the impact of τ in the knowledge distillation loss (9) on the accuracy of the model.The results are in Table V.The experiments revealed that the best result was achieved with a τ value of 0.5, where the mean top-1 accuracy was 92.79%.The accuracy achieved at this temperature was slightly higher than the others, with a very low standard deviation of 0.08%, indicating a consistent performance.Furthermore, it can be observed that varying the temperature τ from 0.1 to 8 led to minimal variations in the top-1 accuracy, with all values hovering around the 92.59% to 92.79% range.The standard deviations also were relatively low for all the experiments, suggesting that the model's performance was stable across different τ settings.This suggests that the knowledge distillation process is robust to changes in temperature τ within the explored range for the ResNet-18 model on the CIFAR-10 dataset.

E. Comparison With SOTA
In order to provide a quantitative assessment of the efficacy of DG2PF, we conducted a comprehensive set of experiments on two widely used benchmark datasets, namely CIFAR-10 [61] and ImageNet [62].We compared our proposed algorithm with various state-of-the-art techniques to demonstrate its effective performance in network pruning.Throughout our experiments, we set the number of pruning epochs, denoted as s e , to 15, while s sim to 10%, the distillation factor α to 90% and the temperature τ to 0.5.The results for the two datasets are shown in Tables VI and VII.The tables show the baseline top-1 accuracy (acc@1) for both the unpruned models and the pruned ones, sided with the difference between the two.It is crucial to note a few disparities when comparing pruning methods.While we focused on keeping uniformity in our implementations,  Note: Models marked with * are obtained from [28] reimplementing the original methods.Method with superscript † indicates that the data reported is obtained from reimplementation by [70].Bold values indicate better result in a column.
the baseline accuracy among models with the same architecture may differ.This variation stems from different pretrained weights adopted by each study.As a significant number of these weights are inaccessible to the public, the replication of the exact initializations is unfeasible.Based on these assumptions, our evaluation criteria do not involve directly comparing the best scores between models with the same architecture but possibly different weights.Instead, we gave prominence to the relative accuracy difference between the pruned and unpruned versions of the same model, offering a more insightful measure of a method's efficacy.1) CIFAR-10: We compared VGG-16 [2], ResNet-18, and ResNet-50 [3] architectures for CIFAR-10 [61] classification and evaluated our DG2PF algorithm against One-Cycle Pruning [63], SNIP [64], Iterative Pruning [65], Gradual Pruning [27], and DPF [52].The performance comparisons are presented in Table VI.The results of our experiments showed that DG2PF outperformed all the benchmarked models, achieving the highest top-1 accuracy on all the tested architectures given the same sparsity levels.Specifically, on VGG-16, our algorithm achieved an improvement of 0.23% top-1 accuracy over the baseline and 0.1% over [52].ResNet-18 and ResNet-50 both overcome the baseline by 0.31% and 0.89%, respectively.To the best of our knowledge and also according to a recent review [65], our work is the first one which deals with the ResNet-50 architecture in this specific application area.
2) ImageNet: As part of our research, we tested several deep learning architectures for ImageNet [62] classification, including ResNet-18, ResNet-50 [3], and MobileNet v2 [68].We evaluated the effectiveness of our DG2PF algorithm against Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
state-of-the-art pruning techniques, such as One-Shot Pruning [37], Gradual Pruning [27], Cyclical Pruning [28], and SWD [69].The performance comparison is shown in Table VII.We can see that DG2PF outperforms the competitors on all the benchmarked models, yielding an improvement of 0.32% top-1 accuracy on ResNet-18 and ResNet-50, and 1.19% on MobileNet V2 against the previous best scores of [28].The results show that DG2PF performed well on this more extensive dataset, achieving better accuracy than existing methods.

V. CONCLUSION, LIMITATIONS, AND FUTURE WORKS
We have introduced DG2PF, a novel and comprehensive algorithm that gradually prunes pretrained NNs using magnitude-based unstructured pruning techniques and knowledge distillation.The method has been designed to minimize performance loss due to compression.Based on a well-known pruning function, a specified proportion of weights from a pretrained NN is selectively removed to minimize memory and storage requirements.A novel simulated pruning strategy with the advantages of weight recovery and without the disadvantages of unstable convergence has also been presented.The combination of those techniques is used in the DGP phase of the algorithm.Then, the PF phase further supports the performance recovery due to the pruning.The algorithm's effectiveness has been rigorously evaluated on publicly available benchmark datasets and models, demonstrating significant improvements in memory usage and computational efficiency while maintaining high accuracy.Consequently, this method provides a promising avenue for optimizing pruned pretrained NNs with potential applications in various domains.
For future works, there are several areas to explore.One avenue is to investigate different pruning functions to determine their effectiveness in reducing memory and storage requirements while maintaining accuracy.The simulated pruning strategy can also be enhanced to achieve even better weight recovery and convergence properties.Additionally, exploring domain-specific applications and scaling up the algorithm to larger models would further validate its effectiveness.This study supports the following assumption: weights closer to zero have less impact on the final prediction in comparison to larger values for magnitude-based pruning methods [27], [28], [45].Despite the actual results shown in this method and the related work, it is crucial to recognize the limitations of this assumption.For instance, research indicates that Transformerbased networks typically achieve a lower level of sparsity using this class of pruning algorithms [75], [76], [77].Acknowledged that our method can indeed be adapted to different activation functions and network architectures, the correct adjustments might be essential to accommodate the specific attributes of these networks in future work findings.Last, integrating the algorithm with other optimization techniques, such as quantization and network architecture search, could yield even better results.Overall, the DG2PF algorithm presents a comprehensive solution for optimizing pruned pretrained NNs, and future research can further improve its performance and applicability in various domains.

Algorithm 1 Algorithm 2
Distilled Gradual Pruning i ← 1 δ ← linearly sample s e numbers in [0, s] while i ≤ s e or the score keeps improving do if i ≤ s e then prune δ i percent of the model end if for b ∈ D (b) t do if i ≤ s e then apply simulated pruning to the weights (2) end if y ← ground truth labels for the b-th batch ŷ ← model's predictions for the b-th batch ŷ(t) ← teacher's predictions for the b-th batch L ← KD loss (9) Δθ ← gradients from L if i ≤ s e then recover the weights of the simulated pruning end if update unpruned weights with Δθ using AdamW end for score ← top-1 validation accuracy (10) on D(b) v i ← i + 1 end while Pruned Fine-tuning while the score keeps improving do for b ∈ D (b) t do y ← ground truth labels for the b-th batch ŷ ← model's predictions for the b-th batch L ← CE loss (4) Δθ ← gradients from L update unpruned weights with Δθ using SGD end for score ← top-1 validation accuracy (10) on D (b) v end while

Fig. 1 .
Fig.1.Histogram representation of the 90% of weights that would be pruned on an unpruned ResNet-50 model[3].The abscissa depicts the values of the weights, while the ordinate depicts the frequency count of weights with the corresponding value.The vertical bars represent the left and right margins, respectively.The amount of the margin delimits the weights to the p-percentile of the total weights, where p is the arbitrary percentage of pruning set to 0.9 in the plot.

Fig. 2 .
Fig. 2. Illustration of pruned and unpruned parameters within a layer of a 70% sparse MobileNet V2 network.Each 3 × 3 matrix depicts a channel of the layer's weights.Within each filter, the pruned parameters are shaded in a darker tone, whereas the unpruned parameters are highlighted in yellow.

TABLE VI COMPARISON
WITH THE STATE-OF-THE-ART ON THE CIFAR-10 DATASET.THE COMPRESSION RATE IS SHOWN ALONGSIDE SPARSITY PERCENTAGES

TABLE VII COMPARISON
WITH THE STATE-OF-THE-ART ON THE IMAGENET DATASET.THE COMPRESSION RATE IS SHOWN