A Novel Dynamic Network Pruning via Smooth Initialization and its Potential Applications in Machine Learning Based Security Solutions

With the development of artificial intelligence technology, the demand for new digital security and privacy solutions is growing importantly. Inspired by the synaptic pruning in mammalian brains, we develop a network pruning method, called dynamic network pruning (DNP) method, which reduces the necessary number of free parameters in the convolutional neural networks. The DNP can be perfectly integrated with the gradient descent training and performed at any point, even multiple times, during training. We show that the pruning of connections (filters) is more intrinsic than the pruning of neurons (channels), and we relate the significance of filters to the dispersion of their values. With our proposed weight initialization technique, called smooth initialization (SI), unimportant filters can be easily identified based on the simple thresholds. The DNP method does not need pre-training to learn the connectivity of the network nor requires a very long time of fine-tuning to restore the performance. The DNP can also lead to better data privacy under distributed environment due to improved learning efficiency and convergency. The experiments show that our method outmatches several weight reduction methods in terms of reduction ratio and test accuracy on various models and datasets, and the generalization abilities of the pruned models are not damaged.


I. INTRODUCTION
In the past decades, neuroscientists have discovered that during the development of mammalian brains, mass neurogenesis is followed by regressive events called apoptosis and synaptic pruning [1]- [3]. Apoptosis [4] is the genetically programmed death of neurons that have failed in the competition with other neurons to connect with predetermined targets. Synaptic pruning [3] is to delete synapses that are inappropriately functional, and it eventually leads to the loss of about half of the synapses in human brains. The pruning of synapses is largely determined by environmental influences and is considered to represent the actual process of learning [2]. Although the deep mechanisms of apoptosis and synaptic pruning are being discussed [5] and more secrets The associate editor coordinating the review of this manuscript and approving it for publication was Xiaochun Cheng. about human brains are yet to be discovered, the concept of synaptic pruning is highly valuable for one of the most thriving machine learning algorithms, i.e. neural networks [6].
When neural network models are getting better at addressing dedicated tasks, their sizes increase dramatically thus require much more computational resources. Therefore, the reduction of necessary parameters in deep models has attracted researchers' attention. In convolutional neural networks (CNNs), the connections correspond to the convolution kernels and the neurons correspond to the feature maps or channels [7]- [9]. The apoptosis and synaptic pruning are both feasible in CNNs. Dropping unimportant neurons (along with all their connections to other neurons) is the simulation of apoptosis, which is often called channel pruning. To simulate synaptic pruning, certain connections need to be removed, while the neurons are kept intact. This process is often referred to as filter pruning. Sometimes the accumulation VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ of synaptic pruning eventually leads to apoptosis. Because when all input/output connections of a neuron are pruned, the neuron will no longer participate in the forward/backward computation. We will show later that the importance of neuron is not as obvious as that of connection. Therefore, the critical part for network pruning is to decide which connections are unimportant, such that it does not lead to dramatic degradation of performance. In this paper, we propose the on-the-fly network pruning method called dynamic network pruning (DNP) that reduces the number of parameters in CNNs by removing unimportant connections during a one-time training (as shown in Figure 1). The pruning process can be performed at any point during training and even multiple times. Neither a very long time of pre-training nor retraining (fine-tuning) is needed. We also propose a weight initialization method called smooth initialization (SI) which is extremely helpful in distinguishing the important connections from the unimportant ones.
Experimental results show that SI leads to higher accuracy and allows DNP to outperform several state-of-the-art parameter reduction methods. we will also show that the generalization abilities of pruned models are not damaged by DNP. The novel method can lead to better data privacy under distributed environment due to improved learning efficiency and convergence, which has a great potential prospect in the field of healthcare [10]- [12].
The paper is organized as follows. Section 2 gives a brief survey regarding the parameter reduction methods in neural networks. Section 3 describes DNP method in details. Section 4 presents multiple comparison experiments with other methods. Further discussions regarding the method and the results are given in Section 5 and the conclusion is drawn in Section 6.

II. RELATED WORK
A great amount of efforts has been made from different perspectives to reduce the parameters in CNNs while inflicting the minimal impacts on the performances. Based on the pruning space, current methods can be roughly categorized into three areas.

A. CHANNEL PRUNING
These methods directly prune unimportant neurons from the network. For example, Hu et al. [13] considered that a neuron was redundant if most of its outputs were zeros. He et al. [14] designed a pruning method for pre-trained models and adopted different pruning strategies for branching architectures. Molchanov et al. [15] proposed to iteratively prune neurons on pretrained models based on a Taylor expansion criterion. Yu et al. [16] proposed to prune neurons based on the reconstruction error produced by the flatten layer. Hu et al. [17] approximated the Hessian matrix at each layer to determine the sensitivities of neurons. He et al. [18] proposed a learning strategy based on reinforcement learning to select unimportant neurons. Luo and Wu [19] designed an independent pruning layer that could be integrated into the training process. Game theoretic based random channel pruning for better defense against adversarial examples was studied as well [20].

B. FILTER PRUNING
These methods aim to prune redundant connections (entire convolution kernels) in CNNs. In 2015, Han et al. [21] proposed a network pruning method to pre-training stage to determine which connections were unimportant and required a very long time of retraining. Li et al. [22] proposed to prune those kernels whose element sums were relatively small. Luo et al. [23] considered that the importance of a kernel should be determined by the out-put of the next layer, rather than its own layer. He et al. [24] proposed to allow the pruned kernels to continue to be updated and used the same pruning ratio in all layers.

C. FINE-GRAINED PRUNING
We include two types of techniques into this category because they are often combined to increase the training speed in distributed systems: (1) weight/gradient quantization that converts parameters or their gradients to lower bits (e.g. from floats to integers) [25]- [28] and (2) weight/gradient smartification that approximate the parameters or gradients with sparse tensors [29]. Xu et al. [30] proposed a weight quantization method that took layer-depth information into consideration. Lin et al. [31] proposed Deep Gradient Compression that took advantage of multiple techniques including gradient smartification, local gradient clipping and warm-up training and achieved impressive reduction ratio on popular deep models.

D. OTHER STRATEGIES
There have been many tensor factorization-based methods that provide profound mathematical support for network pruning. For example, Jaderberg et al. [32] proposed to approximate the kernels by a linear com-bination of a smaller basis set of kernels. The approximation was based on an iterative scheme and designed for pre-trained models. Tai et al. [33] proposed to directly solve the global optimum for low-rank tensor decomposition. Yet, low-rank constrained CNNs needed to be trained, which made it hard to directly integrate with common training procedures. Network pruning during the evaluation stage has been studied. Lin et al. [34] proposed to prune at runtime based on Markov decision process while models were in fact kept intact. Pruning that takes place on higher levels than neurons has also been studied. For example, Liu and Deng [36] proposed to prune modules that consist of one or several layers and solved network pruning problem by architecture search via reinforcement learning.
Li et al. [37] proposed to merge non-tensor and tensor layers together to obtain compact mobile-friendly networks. Other approaches such as teacher-student networks and hardware accelerators have been discussed by Chen et al. [38].
In this work, we aim to develop a workflow that's capable of removing unimportant connections during the normal training of CNNs. The approach is pruning-based and designed for filter-level pruning. We use the standard deviations independently calculated within each kernel as the criterion to determine the importance of weights. Therefore, the approach has no effect on the architecture or training process and can be performed at any point or even multiple times during training.

III. METHOD
Our approach consists of two steps: (1) smooth initialization and (2) dispersion-based pruning during training. Smooth initialization of the weights is crucial for the successful pruning. The pruning process can be performed at any point during the gradient descent training. The overall procedure of DNP is described in Figure 2.

A. WEIGHT INITIALIZATION
We propose a simple, yet rather practical weights initialization technique called smooth initialization (SI). As indicated by the name, after the weights have been initialized based on certain selected distribution, we apply smooth processing on the initial weights before the training begins. As a result, the weights converge to two subsets after training: (1) a small portion of the weights have much varied values (with large dispersion), and (2) most of the weights tend to be flat (with small dispersion). As shown in Figure 3 where 3 × 3 Gaussian blur is used for SI, by comparing the 2nd and 4th groups, the number of flat kernels increases (red boxes) and larger dispersion occurs in some kernels (blue boxes) when weights are initialized with SI. To avoid ambiguity in terminology, henceforth kernel refers to the convolution kernel in CNN and filter represents the smooth processing filter used in SI.
The two main factors that affect the final distribution of weights are (1) the selected distribution to generate original  weights from and (2) the smooth filter to perform upon the weights. To demonstrate the influence of different distributions and smooth filters on model's performance, we present a set of experiments in which we train a 3-layer CNN on SVHN [39] dataset. The model has three convolution layers and a flatten layer as hidden layers. We use a convolution layer instead of fully-connected layer as the flatten layer, in which the kernel sizes equal the sizes of input feature maps (as like C5 layer in LeNet-5 [40]). Hence, we can use the same strategies to initialize and prune the flatten layer as previous layers. Detailed network architecture and data preprocessing are described in Appendix A. As shown in Figure 4a, Xavier [41] distribution outperforms He et al. [42] distribution when no smooth processing is performed. However, the performance of his distribution is almost always better than Xavier distribution when smoothed. Amongst all smooth methods, the Gaussian filter with the window size of 3 × 3 based on He distribution achieves the best performance.

B. NETWORK PRUNING
It becomes natural to consider that the dispersed kernels may play more important roles in the classification, and the flat ones scarcely contribute to the decision-making when it's hard for them to extract useful features from the inputs.

VOLUME 7, 2019
We use the standard deviations that are independently calculated within each kernel as the measurement of such dispersion, only to further decide which kernels are unimportant. Specifically, after certain epochs of training, we prune all the kernels with small standard deviations and continue training on the remaining kernels. The pruning may be performed multiple times, and the selection criteria in different layers at different pruning stages may differ.

FIGURE 5.
Relations between standard deviation of kernel (σ ), connection sensitivity (α) and neuron sensitivity (β) in the 3-layer CNN on SVHN. In (a) and (c), the horizontal axes represent that kernels are ordered by α increasingly. In(b) and (d), kernels are ordered by β increasingly, and the curves of β are in fact step functions in which the ''width'' of each step correspond exactly to its 64 input connections (whose α values are marked by red dots). Better viewed in color.

1) NEURON SENSITIVITY VS CONNECTION SENSITIVITY
Here, we let σ note standard deviation calculated within a single kernel, α denote connection (kernel) sensitivity and β note neuron sensitivity for simplification. α is defined by the accuracy drop inflicted by the pruning of a connection, while β by the pruning of a neuron. When calculating β, we simply remove all its input connections. Therefore, for jth neuron, , where i denotes the ith input neuron, M denotes the total number of input neurons and α (i,j) denotes the connection between input neuron i and output neuron j. A larger α or β value indicates higher significance to the decision-making. As pointed out by Morcos et al. [43], the diversity among the importance of neurons was not as evident as traditionally believed. Their experiments suggested that the ablation of highly selective (high-sensitivity) neurons was no more impactful than that of non-selective (low-sensitivity) ones, or put another way, confusing neurons worked almost well with the absence of interpretable neurons. However, the reason behind such observation was not given. In fact, when comparing the difference between α and β in the 3-layer CNN previously trained on SVHN (Figure 5b and Figure 5d, in which input connections are sorted in an increasing order of output neurons' sensitivities β), we find that important kernels (higher red dots) are almost evenly distributed in all neurons. When examine the relation between σ and α( Figure 5a and Figure 5c, in which connections are sorted in an increasing order of α), we find that important kernels (higher red curves) tend to have large σ (higher blue dots). There are also a few kernels with relatively large α leads to accuracy increases (negative values). Further correlation analysis (see Appendix B) indicates that kernels with large standard deviations are highly likely to have more influence in the decision-making and the importance of kernel is more intrinsic than the importance of neuron. Therefore, we will use the standard deviation of each kernel as the selection criterion for pruning.

2) PRUNING STRATEGY
Two pruning strategies are applied in DNP in different scenarios: • Std threshold. By defining the standard deviation thresholds, we prune all the kernels whose standard deviations are smaller than the thresholds.
• Target percentage. By defining the target percent-ages, we first sort all the kernels according to their standard deviations then remove those with the smallest standard deviations to meet the pruning ratios de-fined by the target percentages. Different criteria are allowed in different layers and different pruning stages. The main difference between two criteria is that the std threshold has a more delicate control over which weights are pruned and the reduction ratio may vary in different runs (depending on the actual values of trained weights). However, if a high reduction ratio is more of a concern than the accuracy, the target percentage will be-come handier and the reduction ratio maintains the same in different runs.

3) NEURON DROPPING
After pruning the unimportant connections, it becomes reasonable to further drop the neurons that have no input or output connection. Specifically, for any neuron in the hidden layers, if the number of its in-put/output connections is zero, the neuron is dropped permanently from the network. These neurons are often referred to as dead neurons. The dropping of these neurons further simplifies the connectivity of the network and has no effect on the performance. There exists a rare situation, however, where a stranded neuron would be left out, which is discussed in detail in Appendix C. Moreover, as pointed out by Han et al. [21], the continued training after pruning also guarantees that most of the remaining connections of these dead neurons get automatically disconnected due to the back-propagation training. However, our experiments show that sometimes a few dead neurons may '' survive'' if the continued training is relatively short. In order to reduce computational costs, the neuron dropping is not performed until the training is finished.

IV. EXPERIMENTS
We evaluate our DNP method on different neural network models (e.g. LeNet [40], ResNets [42], [44] and VG-Gnets [45] et al.) that are trained on different datasets TABLE 1. Comparison of test accuracy and weight reduction ratio with other methods on different models. The abbreviation ''c/o'' denotes Cutout. The reduction ratio of the low-rank regularization method is approximated based on the K1, K2, K3 values provided b y [33]. Top-1 accuracy is used for ImageNet. Note that the letters behind ''DNP'' are unrelated and only represent different pruning settings. The T times(X) of reduction ratio means that 1 -1/ T of the parameters are removed.
We use PyTorch (http://pytorch.org) for software implementation. All models are trained on a single NVIDIA RGTX 1080Ti video card. We use the fast implementation of the Gaussian and median filters provided by OpenCV (https://opencv.org) for SI. The for Gaussian in both axes is set to 0.8 (for 3 × 3) and 1.1 (for 5 × 5), and filter values are regularized to ensure the sum of all elements to be 1.

A. MNIST
We employ DNP on LeNet-5 which is trained on MNIST dataset. The original LeNet-5 model achieved 99.20% test accuracy on MNIST with data distortion. We use the same network architecture used by Han et al. [21], except that we replace the first fully-connected layer (flatten layer) with a convolution layer, while the number of parameters is not varied. SI is performed based on truncated He normal distribution in which values with magnitudes more than twice standard deviation are dropped and re-generated and 3 × 3 Gaussian blur is used. We use Momentum [48] as the gradient descent method with a momentum term of 0.9. The learning rate is set to 0.01. Data distortion is not used in our experiments. The DNP is performed on a 30-epoch training on LeNet-5, in which two pruning stages are carried out after the 10th and 20th epochs, respectively (DNP-A). In the 1st pruning stage, 47.7% of the parameters are pruned and only 1% of accuracy drop is inflicted (test accuracy before pruning is 99.40% and after, 98.55%). After another 10 epochs of training, the test accuracy of the pruned LeNet-5 is restored back to 99.40%. We use higher thresholds to prune 95.0% parameters in the 2nd stage of DNP-A. The final compression ratio is 20X, which is higher than 12X of the pruning method by Han et al. [21]. Accuracy drops to 79.56% immediately after the 2nd pruning, and arrives at 99.29% when the training is finished. The pruning ratio in each layer is shown in Table 2. Weights in the output layer are not pruned, considering they only take up around 1% of all parameters and any modification on them will lead to dramatic impacts on the performance.
If we omit the 1st pruning stage and perform only one pruning stage after the 20th epoch with the exact same std thresholds as in the 2nd stage, more parameters are removed with a reduction ratio of 27X (DNP-B). The test accuracy  [21] in LeNet-5 on MNIST data. Threshold denotes the std thresholds used for pruning and pruned denotes the ratio of pruned parameters. Baseline accuracy of 99.20% [40] is used. drops to 64.24% after pruning and is restored to 99.28% after training. We manage to represent 60K digit samples with only 15.8K parameters. If we prune the same amount of parameters as in the 2nd stage by setting the target percent-ages (DNP-C), the accuracy drops to 72.80% after pruning and reaches 99.31% after training. Although a one-shot pruning could inflict more damages to network's performance than the multi-stage pruning, the final accuracy after the same iterations of training is hardly affected.

B. Cifar-10
We evaluate 3 different models on Cifar-10 dataset. First, we use the same 3-layer CNN architecture and training method as in [33], except that we replace the first fullyconnected layer with a convolution layer. The model is trained on Cifar-10 [49] with data augmentation. If we per-form a 1-stage pruning during a 200-epoch training, a similar reduction ratio (3X) as [33] is achieved with higher test accuracy of 87.41%. If we increase the training time to 400 epochs and perform a 3-stage pruning, where each is per-formed after a 100-epoch training, a higher reduction ratio is achieved (8X) and test accuracy is increased by 1%. With DNP, the 3-layer CNN model achieves88.41% test accuracy on Cifar-10 with only 0.4M parameters.
We apply DNP on ResNet-56 and train it on Cifar-10. We replace the average pooling layer with a depth wise separable convolution layer [50] which introduces slightly more than 4K additional parameters to the network. The weights are initialized based on truncated He normal distribution, before applying 3 × 3 Gaussian smooth. Skip connections and downsampling layers are not directly pruned, since their kernel sizes are all 1 × 1. We use Momentum as the training method and the momentum term is set to 0.9. The ResNet-56 is trained for a total of 320 epochs. The initial learning rate is set to 0.1 and scaled by a factor of 0.1 after every 80 epochs during the first 240 epochs. A single stage of pruning is performed after the 240th epoch which is followed by another 80 epochs of training. The learning rate is rescaled to 0.01 after the 240th epoch and changed to 0.001 at the end of the 280th epoch. This adjustment schedule of learning rate resembles the way Loshchilov and Hutter [51] designed the cosine schedule with warm restarts, which we find helpful in restoring the performance in this case. Three different pruning settings are applied to ResNet-56. The performances and reduction ratios when different target percentages are used for ResNet-56 are listed in Table 3. The accuracy of the base model, 93.60%, is higher than the baseline accuracy (93.04%), which is caused by SI. The target percentages for all layers in the residual blocks that are responsible for doubling the channel depths are all set to 25% and parameters in the first and last convolution layers in the network are not pruned, because we find that these layers are more sensitive to pruning. Pruning ratios in other layers are altered simultaneously. Apparently, the test accuracy tends to decrease as more parameters are pruned. We also find that Dropout [52], in this case, is not helpful in increasing the generalization ability of the network. We believe that it is because as more channels are pruned, the sub models selected by Dropout becomes even weaker. However, Cutout [53], on the other hand, is extremely helpful in increasing the generalization ability on Cifar-10. The ResNet-56 model achieves 93.47% test accuracy when nearly half (47.4%) of the parameters are pruned. When trained with Cutout, the accuracy increases to 93.53% when 70.0% of the parameters are pruned, with the reduction ratio of 3.3X.
We also apply DNP to VGG-16 model. The results on Cifar-10 are shown in Table 4. We use the same network architecture used by [22] and use the same training schedule as in ResNet-56 above, except that the learning rate is always further scaled by 0.1. Three different pruning settings are applied to VGG-16. Parameters in the first layer are always pruned by half, and the last 6 layers have more redundant weights than others. Specific target percentages for each layer in VGG-16 are presented in Appendix D. Even with only one pass of pruning, the performance degradation is slightly more than 1% at worst. The final test accuracy decreases when more parameters are pruned. When trained with Cutout, the network's accuracy reaches 93.77% with the reduction ratio of 3.7X (denoted by DNP+c/o-A). When even more parameters are pruned (5.9X, denoted by DNP+c/o-B), the accuracy (93.32%) is still higher than the unpruned model. Detailed analyses regarding the impacts of pruning on ResNet-56 and VGG-16 are given in Section 5.

C. ImageNet
We perform our DNP method on the ResNet-34 model which is trained on the ILSVRC-2012 training set. Unlike in the Cifar-10 experiments, the average pooling layer is not replaced. Momentum is used, and the momentum term is set to 0.9. The network is trained for a total of 120 epochs. The initial learning rate is 0.1 and scaled by a factor of 0.1 after every 30 epochs and keeps the same since the 61st epoch. No warm restarts are used, because we find that their effects are limited. Two different pruning settings are applied to ResNet-34. A single stage of pruning is performed after the 90th epoch. As pointed out by Li et al. [22], the layers in every residual block respond differently to the pruning. We find that all the residual blocks for mapping the channel dimensions to twice depths and the first layer in the network are the most sensitive to pruning. Therefore, the target percentages in these layers are set to zeros in DNP-A (in Table 1), and no more than 25% in DNP-B. The second layer in each residual block is more sensitive than the first layer. Therefore, in DNP-A, the target percentages for the second layers in rest of the residual blocks are set to 20% and for the first layers, 50%. As for DNP-B, both layers are pruned by 50%. The downsampling layers, skip connections and the output layer are not directly pruned. Detailed target percentages for ResNet-34 are listed in Appendix E. DNP-A achieves 73.04% top-1 accuracy and 91.31% top-5 accuracy on the ILSVRC-2012 validation set, which are higher than Filter Pruning [22], NISP [16] and SFP [24]. 27.2% parameters are pruned in DNP-A with the reduction ratio of 1.4X. DNP-B achieves 71.91% top-1 accuracy and 90.80% top-5 accuracy with the reduction ratio of 1.8X (44.5% parameters are pruned, which is among the highest in Table 1). The impacts of DNP on ImageNet classification are discussed in Section 5.

D. TEXT CLASSIFICATION
We use the CNN model proposed by Kim [54] to perform text classification on the Movie Review Data1 [55], which contains 5331 positive and 5331 negatives processed sentences. The feature extraction layer in the model is the concatenation of three convolution layers with kernel sizes of 3 × 3,4 × 4 and 5 × 5. All convolution kernels are initialized based on truncated normal distribution with a standard deviation of 0.1. Adam [56] is used as the training method, in which the learning rate is 0.001, and the two moment terms are 0.9 and 0.999. The model is trained for 20 epochs and achieves test accuracy of 74.11%, which is higher than 73.36% reported by Kim [54]. Mass redundancy in the parameters of the trained model is observed. We find that when even all of the 3 × 3 and 4 × 4 kernels are removed and most of the 5 × 5 kernels are pruned, the overall ac-curacy is only dropped by 1.12%, and fully restored after a very short time of training. It indicates that the classification falls mainly on the 5 × 5 kernels. Finally, we prune a total of 91% parameters with the reduction ratio of 11X.

V. DISCUSSION
Luo and Wu [19] designed an independent pruning layer that could be integrated into the training process. Yet, additional parameters needed to be trained to label the importance of each neuron. Liu et al. [57] also proposed a pruning scheme based on the scaling factors in batch normalization [58] layers and had no effect to the training. How-ever, the method was designed for channel pruning thus did not address the redundancy in connections. Lin et al. [31] proposed Deep Gradient Compression to reduce the band-width exchange in distributed systems and managed to re-duce the bandwidth consumption by 277X in ResNet-50. Despite the different goal from our approach, which aims to reduce the number of free parameters in networks them-selves, their method provides an exciting inspiration. It is completely feasible that we first prune the unimportant kernels then perform gradient smartification and binary coding on the remaining parameters to dramatically improve the training speed of pruned models, and we look forward to exploring the potential of such approach in the future.

A. IMPORTANCE OF SMOOTH INITIALIZATION
We have previously shown that smooth initialization is extremely helpful in identifying the unimportant kernels during training (Figure 3). We also find that in most cases it also improves the performances of CNNs. For example, unpruned LeNet-5 with SI outperforms the baseline by 0.20% and ResNet-56 by 0.56%. When comparing the dispersionbased pruning and random pruning, as shown in Appendix F, we find that SI not only helps the training converge faster VOLUME 7, 2019 during the early iterations, but also improves the accuracies of pruned models.

B. OTHER PRUNING CRITERIA
As pointed out by Luo et al. [23], two common pruning criteria for weights pruning are (1) weight sum [22] and (2) APoZ (average percentage of zeros) [13]. Both are designed to select weights that have many close-to-zero elements. Weight sum is similar to our std threshold, which is meant to examine each kernel independently; and APoZ, as similar to our target percentage, is selecting weights based on their relative amplitudes in the entire layer. Other forms of norm have been used as pruning criteria as well [24]. In the meantime, Ye et al. [59] suggested that other criteria rather than weight norms could work better and proposed an ISTA-based method that took advantage of the scaling factor in batch normalization. Essentially, their method, however, was to select neurons with nearly constant outputs during pruning, which resembles our criteria. We use standard deviations independently calculated within kernels as the selection criteria so that weights to be pruned do not necessarily need to be very close to zeros. In fact, weights with constant high values can also be pruned by DNP, which is explained by the comparison between std threshold and weight sum in Appendix G. It is a reasonable strategy since a flat non-zero kernel is yet incapable of extracting useful information from its inputs.

C. IMPACT ON THE GENERALIZATION
During our experiments, we observe the following intriguing phenomena. (1) Models trained with Cutout are more prone to be compromised by pruning but have higher accuracies after training. For example, the test accuracy of ResNet-56-A is dropped by 12.79% after pruning while ResNet-56+c/o-A has degradated by as high as 65.53%. However, ResNet-56-A can be easily outmatched by the latter after a few epochs of training. More statistics on ResNets and VGGnets are listed in Appendix D.
(2) When we prune 44.5% parameters in ResNet-34 trained on ImageNet and continue training for only 1 additional epoch, we find that the model's accuracy on the evaluation set surpasses the accuracy obtained from the training set by roughly 3% (69.66%/89.40% on evaluation and 66.68%/86.24% on training). These observations suggest that by pruning low-dispersion weights, the memory of details might be damaged, but the abstract high-level features learned in the latent space are still reserved by the highly dispersed kernels. Therefore, the generalization ability of the pruned model is not compromised by DNP.

VI. CONCLUSION
We propose the dynamic network pruning (DNP) method to reduce the redundant weights in convolutional neural networks. The pruning approach does not affect the training process and can be performed at any point during training. With our proposed smooth initialization technique, unimportant kernels can be easily isolated based on simple criterion,  i.e. the standard deviation of each kernel, which allows the pruning to be performed multiple times. Experimental results show that DNP outmatches several weight pruning/reduction methods in reduction ratio and test accuracy on various models and datasets.

A. DETAILS ON SVHN EXPERIMENTS
Cropped digits are used in our experiments, which are character level 32 × 32 RGB images. 531131 additional samples are also used as training data. The images are converted into 1-channel grayscale before feeding into the network. There are 64, 96 and 128 neurons in the three convolution layers, respectively, and 2048 neurons in the flatten layer, which means that there are 6144 kernels in the 2nd layer and 12288 kernels in the 3rd layer. All convolution kernels are 5 × 5.
If we replot all four subfigures in Figure 5 only in α − σ and β − α coordinates, we will get Figure 6. In Figure 6a, the Pearson correlation coefficient (or PCC for short) between σ and α is 0.792, and the P-value is 0.000 (there are 6144 sample points). In Figure 6c, the PCC between σ and α is 0.772 and the P-value is 0.000 (there are 12288 sample points). Therefore, we are confident to say that there is a positive linear correlation between σ and α (PCC is close to 1 and P < 0.05). In Figure 6b, the PCC between α and β is −0.0264 and the P-value is 0.0385; while in Figure 6d, the PCC between α and β is 0.0400 and the P-value is 9.18 × 10 6 . Therefore, we are confident to say that TABLE 7. Target percentages in each layer of ResNet-34 on ImageNet. The 1st column under each block represents the 1st layer in the residual block, and the 2nd column represents the 2nd layer.

FIGURE 6.
Relations between σ and α, and between α and β in the 3-layer CNN on SVHN, in which σ is the standard deviation of each kernel, α is kernel sensitivity and β is neuron sensitivity. there is no linear relationship between α and β (PCC is close to 0 and P < 0.05).

C. STRANDED NEURONS IN NEURON DROPPING
As shown in Figure 7, after connection pruning, neuron B only has no output connection and is qualified for neuron dropping, while neuron A will survive the neuron dropping. After dropping neuron B, neuron A has no output connection, hence becomes another dead neuron. We refer to neuron A as a stranded neuron, although both neuron A and B have no contribution to the network before/after neuron dropping. To address this situation, we simply need to perform neuron dropping for k times on a k-layer network.

D. DETAILS ON Cifar-10 EXPERIMENTS
Accuracy changes inflicted by DNP in ResNet-56 and VGG-16 on Cifar-10 are listed in Table 5. Larger degradation in performance occurs when pruning ratio is higher and networks trained with Cutout are more sensitive to pruning. However, based on Table 3 and Table 4 in the paper, models with Cutout have higher test accuracies when training is finished. It indicates that the memory of details in data might be damaged, but the latent knowledge learned from data are still reserved by high-dispersion kernels.
Target percentages in each layer for different pruning settings for VGG-16 model in Section 4.2 are listed in Table 6. Parameters in the first layer (conv1), flatten layer and output layer are not pruned.

E. DETAILS ON ImageNet EXPERIMENTS
Target percentages in each layer for different pruning settings for ResNet-34 model in Section 4.3 are listed in Table 7. The blocks represent the residual blocks in ResNet-34. The 1st column under each block represents the 1st layer in the residual block, and the 2nd Colum represents the 2nd layer. Parameters in the first layer, average pooling layer and output layer are not pruned.

F. IMPORTANCE OF SMOOTH INITIALIZATION
Training loss in the first 500 iterations of ResNet-56 on Cifar-10 is shown in Figure 8. We can see that the loss curves of SI are generally lower in the early training stage, with or without Cutout.  In some literature, random pruning is studied, which does not take the importance of the pruned parameters into account. A randomly pruned model can be considered as a random sub model derived by Dropout. Nevertheless, it is a valid baseline for any well-founded pruning method. By comparing with random pruning, we find that DNP per-forms better when pruning ratios are higher, and SI improves the test accuracies of pruned models. Again, we use ResNet-56 on Cifar-10 for comparison and the results are shown in Figure 9.

G. STD THRESHOLD VS WEIGHT SUM
Standard deviations and weight sums of the pruned parameters in ResNet-34 trained on ImageNet are illustrated in Figure 10. A clear lower bound can be observed, because the existence of relative higher values is the necessity of larger standard deviations. On one hand, a notable amount of weights with large sums are pruned by our approach; on the other hand, kernels with very low sums cannot have large deviations, which will be left out by our approach. It indicates that by using the std thresholds, more weights are pruned compared to the weight sum criterion.