Pruning Convolutional Filters using Batch Bridgeout

State-of-the-art computer vision models are rapidly increasing in capacity, where the number of parameters far exceeds the number required to fit the training set. This results in better optimization and generalization performance. However, the huge size of contemporary models results in large inference costs and limits their use on resource-limited devices. In order to reduce inference costs, convolutional filters in trained neural networks could be pruned to reduce the run-time memory and computational requirements during inference. However, severe post-training pruning results in degraded performance if the training algorithm results in dense weight vectors. We propose the use of Batch Bridgeout, a sparsity inducing stochastic regularization scheme, to train neural networks so that they could be pruned efficiently with minimal degradation in performance. We evaluate the proposed method on common computer vision models VGGNet, ResNet, and Wide-ResNet on the CIFAR image classification task. For all the networks, experimental results show that Batch Bridgeout trained networks achieve higher accuracy across a wide range of pruning intensities compared to Dropout and weight decay regularization.

age [Huttenlocher et al., 1979]. In pruning, the elements of a trained network are ranked according to some importance criteria and the least important elements are removed from the network, resulting in a smaller network with lower inference cost. Unstructured pruning removes individual weights from the model resulting in sparse matrices with the same dimensions as the original model, requiring sparse computational techniques to save inference cost [Gale et al., 2020]. Structured pruning, on the other hand, removes filters from the convolutional layer reducing the dimension of the matrix multiplication and directly resulting in a smaller number of operations and runtime memory.
The simplest and most common importance criteria for removing filters is the magnitude of the filters. Therefore, several regularization approaches have been proposed to train CNNs such that most weights have smaller values at the end of the training to facilitate pruning [Bui et al., 2019, Louizos et al., 2018. In order to train neural networks that are robust to posthoc pruning, stochastic regularization techniques provide the advantage that they make the network rely less on any individual weights in the network during training. One such stochastic regularization technique is Bridgeout [Khan et al., 2018], proposed for fully connected neural networks. Bridgeout, along with being stochastic, is also proven theoretically to induce sparsity in the model weights.
In this work, we propose a simple and computationally less expensive variant of Bridgeout called Batch Bridgeout. Batch Bridgeout is applicable to both fully connected and convolutional layers. Batch Bridgeout is significantly faster compared to Bridgeout and requires less GPU memory. Importantly, Batch Bridgeout can be easily implemented without changing the optimized GPU based kernels such as cuDNN [Chetlur et al., 2014]. Unlike Bridgeout that uses a different set of weights per example, Batch Bridgeout uses a single set of weights per mini-batch of examples. We show that Batch Bridgeout results in sparse filter weights in CNNs similar to Bridgeout inducing sparsity into fully connected layers.
Given its stochastic nature, sparsity inducing characteristics, and ease of use, we propose the use of Batch Bridgeout for pruning convolutional filters in CNNs. The stochastic nature of Batch Bridgeout regularization makes the CNN robust to pruning since the network does not rely on any single filter, whereas the sparse nature of the regularization results in networks that distill their knowledge in a smaller set of weights making the network naturally appropriate for pruning.
We train contemporary computer vision models such as VGG-16 [Simonyan and Zisserman, 2014], a very small ResNet [He et al., 2016] and a large Wide-ResNet [Zagoruyko and Komodakis, 2016], with Batch Bridgeout targeted to the largest magnitude filter weights, on the CIFAR image classification task. We perform one-shot structured pruning of the filters in the networks and show that Batch Bridgeout results in the least degradation compared to Targeted Dropout for all the networks.
The contributions of this work include: 1. a novel stochastic regularization method, Batch Bridgeout, that can be used to induce sparsity into CNNs while also being easy to implement and computationally less expensive; 2. the novel application of one-shot filter pruning with sparsity inducing regularization; and, 3. the evaluation of structured pruning with Batch Bridgeout across a range of DNN models.

Background and Related Work
To make state-of-the-art computer vision models more practical to deploy, many pruning techniques have been proposed. This section provides a taxonomy of pruning techniques followed by a description of the works related to the main contributions of this paper.

Classification of Pruning Techniques
We broadly classify pruning techniques for lower inference cost based on the elements pruned, the number of train-prune iterations and the criteria used for pruning decisions as follows.

Unstructured vs. Structured Pruning
Unstructured pruning removes individual weights from the weight matrices of convolutional filters or the fully connected units [Han et al., 2015]. Unstructured pruning results in higher compression rates for the same task performance due to the flexibility of fine grained selection of which weights should be eliminated. Unstructured pruning results in sparse weight matrices with the same dimensions as the unpruned ones. Thus, specialized sparse matrix multiplication techniques are needed to exploit the sparsity for faster inference [Gale et al., 2020].
Structured pruning, on the other hand, removes model weights in groups corresponding to a neural unit or convolutional filter [Li et al., 2016]. Structured pruning corresponds to removing an entire row or column from the weight matrix. This results in reduced dimension weight matrices directly reducing the number of operations and runtime memory required for inference without any additional overhead or specialized techniques. However, imposing such structure during pruning generally results in higher degradation in task performance.

Iterative vs. One-shot Pruning
Iterative pruning trains a network to convergence, computes the importance of each element in the network and removes a small number of least important elements. This is followed by retraining to recover from any loss in task performance due to pruning. This process is repeated until the desired network size and task performance trade off is achieved [Han et al., 2015].
Conversely, in one-shot pruning, the training algorithm is modified in such a way that at the end of training most of the weights (or neurons) are zero. Thus, these weights or neurons could be removed at the end of training without any additional steps nor substantial loss of accuracy. The cost function of DNNs over the weight space has many local minima. One-shot pruning from scratch schemes employ some form of regularization to prefer DNNs with sparse weights compared to dense ones with equivalent cost. This sparsity inducing characteristic of the regularization helps in retaining the network performance when the network is pruned. Several deterministic sparse regularization techniques have been proposed for one-shot pruning of neural networks [Yoon andHwang, 2017, Wen et al., 2017].

Sensitivity vs. Magnitude based Pruning
When pruning a model, it can be helpful to quantify the importance of weights or neurons in the network. This importance metric is intended to quantify the degradation in performance brought about by removing the weight or neuron.
In sensitivity based pruning methods, the sensitivity of the cost function with respect to weights or neurons is directly approximated using the derivatives of the cost function with respect to each network element. Optimal Brain Damage (OBD) [LeCun et al., 1989] is a technique that uses a second order Taylor approximation of the cost function with respect to individual weights. This approximation requires the computation of the Hessian of the cost function with respect to individual weights. The Hessian is then used as a proxy for the sensitivity of the cost function with respect to the weights.
Instead of using Hessian-based methods, the importance of a network element could be approximated using the magnitude of the network element. A higher magnitude of an element means that element contributes more to the output of the network compared to a low-magnitude element [Han et al., 2015]. In a large scale empirical study, Gale et al. [Gale et al., 2019] found that simple magnitude-based importance criteria performs better than other complex criteria for pruning deep neural networks.
In this work we are concerned with one-shot, structured magnitude-based pruning of filters in convolutional neural networks.

Sparse Regularization for Pruning
The purpose of magnitude-based pruning techniques is to remove small magnitude weights or neurons from the network with as little performance drop as possible. It is logical then to train neural networks, from scratch, in such a way that most of the weights or neurons are close to zero except a few weights that are critical to the performance of the network. The cost function of DNNs often exhibit a large number of local minima [Safran and Shamir, 2016]. It is, therefore, probable that two local minima are almost equal in magnitude but correspond to very different configurations of the parameters, for example, one could belong to a highly dense DNN, whereas the other could belong to a very sparse, and thus compact, DNN. To this end several deterministic regularization techniques have been explored to train DNNs so that the sparse configurations of the parameters are selected, which in turn, can aid in model pruning.
Alvarez and Salzmann [Alvarez and Salzmann, 2016] have used group sparsity to learn the number of neurons per layer in a deep neural network. They added an L 1,2 penalty term to the cost function in order to force groups of parameters belonging to a single neuron close to zero. Evaluating L 1,2 on Imagenet, the authors reported better compression performance compared to an L 1 penalty. The grouped sparsity removes complete neurons and thus this technique is a structured pruning method.
Scardapane et al. [Scardapane et al., 2017] augmented the L 2,1 penalty with an L 1 penalty to avail additional compression for fully connected neural networks using both structured (L 1,2 ) and unstructured (L 1 ) pruning at the same time. Yoon and Huang [Yoon and Hwang, 2017] combined the group sparsity (L 2,1 penalty) with exclusive sparsity (L 1,2 penalty) to achieve more compact fully connected and convolutional neural networks.
In the previous methods, L 1 -like penalties were used to drive weights towards zero during training. While the L 1 penalty prefers sparser weights it does not make the weights exactly zero. A more representative penalty for non-zero weights is the L 0 norm, which is defined as the number of nonzero elements in a vector. However, training with an L 0 norm regularizer based on gradient descent is not feasible because the L 0 norm is not differentiable. The L 1 norm, as described previously, is used as a convex relaxation to the L 0 norm. A few algorithms for training neural networks with an L 0 penalty have been proposed in the literature recently. Louizos et al. [Louizos et al., 2018] proposed a method to minimize the expected L 0 norm of the weights during training followed by unstructured pruning. Bui et al. [Bui et al., 2019] derived a method for minimizing the group L 0 norm in order to perform structured pruning.
Stochastic regularization such as Dropout has been utilized to aid in pruning performance as well. Variational Dropout (VD) uses an individual dropout rate for each parameter during training. Parameters with large Dropout probability at the end of training could thus be discarded. Molchanov [Molchanov et al., 2017] reported good compression performance of VD on LeNet and VGG architectures.
Dropout regularization promotes robustness to the loss of individual neurons during training. During each forward pass Dropout randomly prunes neurons. If the least useful neurons are known a priori, based on some criteria such as magnitude, Dropout could be applied only to these neurons. Such targeted application of Dropout will enable the network to be robust to post-hoc pruning. This idea was proposed by Gomez et al. [Gomez et al., 2019] naming it Target Dropout. Gomez et al. showed that Targeted Dropout resulted in superior performance to other complex pruning strategies. We chose Targeted Dropout as the primary baseline for comparison in our study due to its state-of-the-art performance.
While Dropout zeros out neurons during training and promotes robustness, in expectation, it does not minimize a cost function that promotes sparsity. Therefore, our proposition is that an alternative stochastic regularization scheme, that explicitly induces sparsity, should better retain model accuracy after pruning.

Proposed Method
In previous work it was observed that deterministic sparse regularization techniques (L 1 , L 1,2 , L 0 ) promoting sparsity have been utilized to improve pruning performance. In addition, stochastic methods such as Dropout have been shown to promote robustness to pruning. Therefore, we expect that the sparse stochastic regularizers such as Batch Bridgeout could be used to combine the benefits of both sparsity and robustness for obtaining efficient and compact DNNs through pruning.

Batch Bridgeout
Batch Bridgeout generates a Bernoulli random mask M for each mini-batch in the training set, for each hidden-layer in the network. During training Batch Bridgeout applies the following perturbation to the weights of the l th layer W l where p is the probability of the Bernoulli mask determining the strength of regularization and q is the norm of the penalty determining the sparsity of the weights. The output of the l-th layer is then calculated as where a l−1 and a l are the activations of the previous and current layer, respectively. b l is the bias vector and σ is a non-linear activation function such as sigmoid or ReLU [Glorot et al., 2011].
The perturbation in Equation 1 is equivalent to an L q penalty on the weights of linear models [Khan et al., 2018]. L q weight penalties for q < 2 results in sparse weight matrices. Thus, Batch Bridgeout with q < 2 enables us to obtain equivalent results as applying sparsity penalties on the weights . At the same time it is a stochastic technique. Therefore, it has the advantage of making the DNNs not rely on any individual weights and hence making the networks robust to any post-hoc pruning of the weights.
Generating a new set of perturbed weights per layer in a deep neural network for each example in a large dataset is prohibitively expensive. Per example perturbation also requires customization of the common GPU based implementations of CNNs such as cuDNN. Batch Bridgeout uses a single set of perturbed weights per mini-batch of examples. We show that using a single set of perturbed weights per layer per mini-batch makes Bridgeout only fractionally more expensive compared to Dropout while still keeping the sparsity inducing properties.

Filter Pruning
Once the network is trained with Batch Bridgeout, many of the weights in the filters will be close to zero due to the sparsity inducing property of the regularization. We use the L 2 norm of the filter weights as a surrogate for the importance of the filter to the network. In each layer we keep only the k filters with the largest L 2 norms where W is the matrix containing all the filters in the layer and W * contains the k most important filters, with the largest L 2 norm, we are interested in after pruning.
Batch Bridgeout could be applied to all the weights of a DNN layer. However, for the purposes of pruning, our goal is to find the optimal set of filter weights W * with respect to the cost function such that the number of non-zero filters is bounded |W * :,j | < k. That is, we want to keep only the k most important filters at the end of training. Rather than dropping weights deterministically at the end of training, it is more reasonable to target the application of Batch Bridgeout to only the less important weights as is done in Targeted Dropout [Gomez et al., 2019]. If during training some less important weights become significant, Batch Bridgeout is not applied to them. Thus, Targeted Batch Bridgeout helps in distilling the weights of the k most important filters during training. Targeted Batch Bridgeout is illustrated in Figure 1 for the case of fully connected layers where it is applied only to a fraction γ of the weights in the layer at each iteration of the training.

Experimental Results
In this section we describe the experiments that evaluate the computational cost and sparsity induction and performance of Batch Bridgeout. For evaluating the pruning performance, we benchmark Batch Bridgeout against the recently proposed Targeted Dropout [Gomez et al., 2019] technique which has been shown to perform better than other deterministic and stochastic techniques. To asses whether the pruning results generalize to different architectures of different sizes we conducted experiments with three architectures of different sizes: VGG, a very small ResNet and a full sized Wide-ResNet.

Computational Cost
Batch Bridgeout is computationally efficient compared to Bridgeout. To evaluate the computational cost of Batch Bridgeout against Bridgeout, we train a fully-connected autoencoder with two hidden

Sparsity Characterization
This section evaluates whether or not Batch Bridgeout, that is, a single set of Bridgeout perturbed weights per mini-batch, induce sparsity into the weights of convolutional layers. We trained the VGG-16 model without the fully connected layers. The detailed architecture of the model and the training method is described in Section 4.3.
The network was trained with different regularization applied to all the convolutional layers. We use Hoyer's sparsity measure [Hoyer, 2004] as a metric for quantifying sparsity. Hoyer's measure of a d-dimensional vector x is given by  Figure 2 shows the Hoyer's sparsity measure of the filters of the convolution layers at the end of the training. As can be seen, Batch Bridgeout results in a higher sparsity measure compared to Dropout for almost all the layers. This shows that using Bridgeout with a single set of perturbed weights per mini-batch induces noticeable sparsity in the weights of the neural networks. We note that not applying any stochastic regularization results in higher sparsity for the layers near the input and output, this has been previously noted by Li et al. [Li et al., 2016]. When the stochastic regularization is targeted only to the top 75% of the weights, there is a significant increase in the sparsity of the weights. This motivates us to target Batch Bridgeout and Dropout only to the lowest magnitude weights for the purposes of pruning.
Targeting frees up the important weights from the effects of regularization and thus those important weights could grow larger, creating an imbalance in the distribution of weights. That is, regularized weights shrinking smaller and important weights being updated as dictated by the gradient of the cost function. This imbalance could result in higher sparsity of the targeted versions of the regularization. It can be seen that among all the methods, Targeted Batch Bridgeout results in the highest sparsity in the VGG architecture.
Since we are using the L 2 norm as the importance criteria for keeping a filter during pruning, Figure 3 shows the slope of the L 2 -norm of the filters each layer of VGG-16 trained with Batch Bridgeout and Dropout. As can be seen, Batch Bridgeout results in larger slopes of the filters compared to Dropout. It has been observed by Li et al. that layers with larger slopes maintain their accuracy as filters are pruned in that layer [Li et al., 2016]. Figure 4 shows the training and validation loss and accuracy for models trained with different techniques. As shown in the figure, all three models converge with Batch Bridgeout still having a downward trend whereas the validation loss starts ascending for the model without regularization. The training loss for Batch Bridgeout is higher compared to Dropout due to the large amount of perturbation applied during training.

Pruning VGGNet
In this section we describe the pruning performance of Batch Bridgeout for the VGG architecture [Simonyan and Zisserman, 2014]. The VGG architecture is one of the popular deep convolutional networks used in computer vision. The VGG-16 consists of 13 convolutional layers of receptive field of size 3 × 3 with some layers followed by max pooling, two fully connected layers of 4096 units followed by softmax layer of 10 units representing the class probabilities. In order to make a more challenging pruning task, we train the VGG-16 without the two fully connected layers, which represent 90% of the parameters of the VGG-16 model as shown in Figure   Unless stated otherwise, all the networks were trained on the CIFAR-10 [ Krizhevsky and Hinton, 2009] image classification dataset for 230 epochs with stochastic gradient descent, an exponentially decaying learning rate of 0.1, momentum of 0.9 and weight decay of 5 × 10 −4 . A Dropout probability of p = 0.3 and Batch Bridgeout norm of q = 1.5 was used. Both Batch Bridgeout and Dropout were targeted to the lowest magnitude 75% of the weights in each layer except the last layer in the network.
After training the networks with different regularization techniques, we zero out a constant fraction of low-magnitude filters across all convolutional layers, independently. Figure 6 shows the average accuracy of the VGG-16 model when different percentage of filters are set to zero. As can be seen, the degradation in accuracy of the Batch Bridgeout trained network is negligible when about 40% of the filters in each layer are pruned. Whereas the Dropout and backprop trained networks degrades significantly when even 10% of the filters are set to zero in the network.
In order to evaluate the computational cost of the pruned models (e.g., memory required by the model and the forward pass execution time) we remove the zeroed out filter rows from the weight matrices. Since removing a filter from the weight matrix in layer l results in the reduction of the input size to the layer l +1, we remove all the weights in the l +1 layer corresponding to the removed filter's activation maps. For the final fully connected softmax layer, we remove the weights corresponding to the pruned out filters. Table 2 shows the memory and the average forward pass execution time at different levels of pruning and the accuracy of the models trained with different regularization. Similar to Figure 6, in Table 2 Batch Bridgeout results in the least degradation as filters are removed. For the full model the backpropagation results in the highest accuracy indicating that the regularization techniques reduce the complexity of the model slightly in the absence of fully connected layers.
In order to determine which model is able to regain its original accuracy after pruning, we retrained the pruned models for 100 epochs without any regularization. As a baseline, we also trained an equivalent sized model from random initialization. Table 3 shows the classification performance of the retrained models. For up to 40% pruning the Batch Bridgeout is able to achieve its own original accuracy of 92.79. The model trained from scratch comes close to the accuracy of the re-trained pruned models. This finding is consistent with previous work [Li et al., 2016] and calls into question the overall utility of post-training pruning vs. selecting a smaller size model a priori. This is discussed in Section 5 in the light of recent developments in the field.

Pruning ResNet and Wide-ResNet
In this section we describe experiments to evaluate whether the pruning results obtained for VGG-16 generalize to small sized models and other state-of-the-art deep learning models. For this purpose, we evaluated a very small ResNet [He et al., 2016] model and a large Wide-ResNet-28x10 [Zagoruyko and Komodakis, 2016] on the CIFAR10 dataset. Unlike VGG, ResNet models include identity connections between alternating convolution layers in order to facilitate the training of very deep neural networks. In order to make for a more challenging pruning task, we selected a small all convolutional ResNet model with four residual blocks of 64, 128, 256 and 512 filters for a total of only 0.5 million parameters. Since the model is already fairly small, removing any filters after training is expected to degrade task performance and thus makes for a good test for evaluating the pruning techniques. Wide-ResNet was selected as the standard implementation with 36 million parameters. Figure 7 and Figure 8 show the classification accuracy of ResNet and Wide-ResNet as a percentage of filters are zeroed out in each layer uniformly, respectively. In both the networks, we see the same relative results where Batch Bridgeout achieved the highest accuracy compared to Dropout and backpropagation.

Discussion
Higher performance on benchmark computer vision tasks has been achieved by increasingly larger and deeper convolutional neural networks, such as VGGNet [Simonyan and Zisserman, 2014], ResNet [He et al., 2016] and Wide-ResNet [Zagoruyko and Komodakis, 2016]. To deploy these models to resource limited devices post-training model pruning is often required. For this reason, model pruning is an active area of research in deep learning. Structured-pruning, where entire filters are removed, provides the best resource savings because removing filters reduces the dimension of the matrix multiplication. However, structured pruning results in more severe performance degradation compared to unstructured pruning of individual weights. We expect that regularization techniques that promote sparsity and result in robust networks will be useful in avoiding degradation in performance due to structured pruning. Regularization with Batch Bridgeout, proposed in this paper, provides both sparsity and stochasticity for robustness and is computationally efficient for use with large models. Our experiments demonstrated consistent improvement in pruned-model accuracy for Batch Bridgeout compared to Dropout or backpropagation with weight decay across a range of deep learning architectures. Our results for retraining pruned models demonstrated that Batch Bridgeout allowed models to better regain their original accuracy, even for 40% pruning on the fully-convolutional version of VGG16. However, we also observed that training a similarly sized model from scratch was typically within 1% of the retrained pruned model for CIFAR-10. This gives rise to the question of whether it is better to train a smaller model rather than training a large model and then performing post-training pruning.
Recently, several studies have tried to investigate this question. Frankle and Carbin [Frankle and Carbin, 2019] performed experiments with LeNet-5 [LeCun et al., 1990] to show that a sub-network obtained from pruning the original larger model could be trained to the same accuracy as the original model if the weights of the sub-network are initialized with its initial random weights when it was part of the original model. They argue that training and pruning helps in discovering this special sub-network which could perform as good as the original model. Liu et al. [Liu et al., 2018] experimentally showed that pruned sub-networks could be initialized with random weights and trained to achieve the performance of the original model. Gale et al. [Gale et al., 2019] performed large scale computer vision and neural machine translation experiments and concluded that for complex tasks unstructured pruned models cannot be trained to the same task performance from scratch as could be obtained from optimizing and pruning. Gale et al. investigated unstructured pruning which could be seen as an upper bound on the performance of an equivalent sized structurally pruned model. That is, an unstructured pruned model still inherits the surviving-weights' topology, whereas structured pruned models initialized from scratch do not inherit any information from the pruning process. Thus, given the results of Gale et al., it is unlikely that for complex tasks small models equivalent to a structurally pruned model will attain the same performance as the pruned models. Furthermore, given the fact that Batch Bridgeout is able to retain performance when pruning very small models indicates the utility of pruning even when small models are used for relatively simple computer vision tasks.
One reason for the good performance of the pruned models with random initialization could be the long-term use of the standard benchmark datasets such as CIFAR for building computer vision models. As noted by Recht et al. [Recht et al., 2018], sampling a different test set from the CIFAR-10 dataset results in large changes in the generalization error of contemporary computer vision models. A plethora of hyperparameters and architecture designs have been tailored towards these benchmark datasets. This could mean that under these settings, the number of trainable hyperparameters or model size could be reduced quite a bit without sacrificing the performance on the given test set. However, for solving novel tasks large models might still be needed. As future work, we plan to investigate this question with large scale complex tasks such as evaluation on Imagenet [Deng et al., 2009]. Nevertheless, our results indicate that, under identical conditions, Batch Bridgeout trained networks are more robust to structured pruning compared to Dropout and backpropagation across several architectures and models of different sizes. These results also hold when the architectures are thinned out in the first place, e.g. by removing 90% of the parameters of VGG16 by omitting the two large fully connected layers, and by selecting a small ResNet model with less than 0.5M parameters.
In the present study, all layers in a DNN were pruned to the same degree. However, in a DNN, some layers are more sensitive to pruning verses others as shown in Figure. 2. As future work, the amount of pruning performed could be adjusted on a layer-by-layer basis, which could potentially result in better accuracy for a given amount of pruning.

Summary
In this paper, we show that effective sparsity inducing regularization techniques are important for compressing large neural network models. We presented Batch Bridgeout, a computationally efficient extension of the Bridgeout regularization scheme, and demonstrated empirically that it is capable of inducing sparsity in the filters of convolutional neural networks. Batch Bridgeout was evaluated on structured filter pruning on a number of CNN architectures, including VGG, ResNet and Wide-ResNet. For all architectures, Batch Bridgeout was shown to outperform the recently proposed Targeted Dropout and weight decay regularization based pruning.