Automated Filter Pruning Based on High-Dimensional Bayesian Optimization

Filter pruning is necessary to efficiently deploy convolutional neural networks on edge devices that have limited computational resources and power budgets. With conventional filter pruning techniques, the same pruning rate is manually specified for different convolutional layers, which is suboptimal and time-consuming. To extract the features from the coarse level to the fine level, the number of filters in each layer has various distributions. Therefore, it is unsuitable to utilize the same pruning rate for different functional layers. To address this issue, we propose a high-dimensional Bayesian optimization-based filter pruning (HDBOFP) algorithm, which aims to automatically determine the most appropriate pruning rate for each convolutional layer. In addition, the proposed method can automatically identify optimal pruning-rate combinations without a time-consuming retraining phase. Compared with conventional filter pruning methods, this automated filter pruning technique exhibits a higher efficiency, which improves accuracy and reduces the required human labor. The effectiveness of our automated filter pruning algorithm is validated through two major computer vision applications, namely image classification and object detection. Specifically, when used in combination with ResNet-110 to classify the CIFAR-10 dataset, HDBOFP reduces the required number of float point operations (FLOPs) by more than 62% without affecting accuracy. Similarly, when HDBOFP is added to the YOLOv5l framework to run detection experiments on the MS-COCO 2017 dataset, FLOPs decrease by more than 43% with only a 1.2% loss in mean average precision, which has advanced the previous studies.


I. INTRODUCTION
Convolutional neural networks (CNNs) with large model sizes and heavy computational costs have achieved remarkable performance in various computer vision research applications; however, it is difficult to deploy these CNNs on edge devices, due to their limited computational resources and power budgets. Even state-of-the-art high-efficiency architectures, such as residual connections or inception modules, have millions of parameters and require billions of float point operations (FLOPs). Therefore, it is necessary to develop CNNs that can deliver high accuracies with a relatively low computational cost. Many recent studies have attempted to The associate editor coordinating the review of this manuscript and approving it for publication was Shadi Alawneh . enable CNNs to use hardware resources more efficiently, which has resulted in various model compression strategies, including pruning [3], [4], tensor decomposition [14], [18], quantization [19], and knowledge distillation [20]. Among these strategies, pruning has attracted special attention owing to its effective performance and implementation.
Recent developments in pruning can be divided into two categories: weight pruning and filter pruning. Weight pruning aims to remove redundant elements in the weight of filters, which can achieve high model efficiency with no loss of accuracy. However, these algorithms result in an irregular sparsity pattern and require specialized hardware for speeding up. Filter pruning, however, aims to crop all unimportant filters. Thus, the pruned filter weights are regular and can be directly accelerated using off-the-shelf hardware and libraries. In this VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ study, we investigate filter pruning, which is the preferred method for CNN compression. Conventional filter pruning algorithms comprise three steps: (1) training a large CNN on the target dataset; (2) pruning redundant filters from the pretrained CNN based on a particular pruning rate and pruning criteria, and (3) retraining the pruned CNN to recover its original performance. The fundamental task in filter pruning is determining the pruning policies, which include the pruning rate and pruning criteria. Determining the appropriate pruning rate for each layer is a crucial research point; however, many recent studies have focused on measuring the importance of each filter. Most existing methods manually specify a single pruning rate for different layers, as illustrated in Figure 1(a). Some previous studies have proposed a rule-based algorithm to determine different pruning rates for each convolutional layer. However, because the convolutional layers in deep CNNs are not independent, these hand-crafted rule-based pruning rate selection algorithms are typically suboptimal and time-consuming, and do not transfer well from one model to another. CNN architectures are evolving rapidly, and an automated compression method is necessary to improve engineering efficiency. Therefore, we propose a novel automated filter pruning method that automatically determines the appropriate pruning rate combination for an arbitrary network, while achieving better performance than conventional human-designed rule-based filter pruning methods. Figure 1(b) illustrates the proposed algorithm.
To attain the optimal pruning rate combination during the automated phase, it is necessary to evaluate the accuracy degradation of every pruning rate combination considered. However, as it is typically performed through a time-consuming retraining step, we require a more efficient method to measure accuracy degradation. Therefore, we propose an objective function that utilizes a soft filter pruning algorithm to estimate the expected accuracy degradation for a given pruning rate combination. The proposed objective function is based on the idea that the difference between a soft pruned network and its original structure is proportional to the accuracy degradation of the former. The proposed objective function correctly estimates the expected accuracy degradation of a given pruning rate combination; however, it is a non-differentiable and non-convex optimization problem. Therefore, we utilize Bayesian optimization, which is designed for black-box derivative-free global optimization. In addition, as the CNN deepens, the objective function space becomes a high-dimensional large-scale space, which calls for a trade-off between storage complexity, computational cost, and accuracy. Because standard Bayesian optimization suffers from a curse of dimensionality, the use of standard Bayesian optimization limits the applicability of the proposed algorithm to deep CNNs with many convolutional layers. To overcome this limitation, the proposed algorithm utilizes a low-dimensional embedding-based Bayesian optimization, called high-dimensional Bayesian optimization.
In summary, the main contributions of this study are as follows: • We propose an effective automated filter pruning algorithm, called high-dimensional Bayesian optimizationbased filter pruning (HDBOFP), which automatically determines the most appropriate pruning rate for each convolutional layer in a CNN.
• The HDBOFP can concurrently consider all convolutional layers, owing to the characteristic of the proposed objective function. In addition, it can automatically determine optimal pruning rate combinations without a time-consuming retraining phase.
• The use of a high-dimensional Bayesian optimization enables HDBOFP to theoretically provide a global optimal pruning rate combination and effectively solve the high dimensionality problem of the proposed objective function.
• To demonstrate the wide and general applicability of HDBOFP, we evaluate its performance through four image classification and object detection benchmarks. The results of extensive experiments suggest that HDBOFP performs better than conventional filter pruning algorithms. The remainder of this paper is organized as follows. Section II briefly introduces filter pruning, Bayesian optimization, and notation. Section III discusses our proposed automated filter pruning algorithm, focusing on the objective function and high-dimensional Bayesian optimization. The experimental results are presented and analyzed in Section IV. Finally, the study is summarized in Section V.

A. FILTER PRUNING
Current pruning methods can be categorized into weight pruning (unstructured pruning) and filter pruning (structured pruning). Weight pruning aims to remove the fine-grained weight of filters, which leads to unstructured sparsity in pruned CNNs. In contrast, filter pruning can achieve structured sparsity, enabling the pruned network to fully utilize highly efficient basic linear algebra subprogram libraries to achieve improved acceleration.
An active research topic in filter pruning is the quantification of filter importance, that is, defining the pruning criteria. Following [2], pruning criteria can be divided into two categories: weight-based criteria, which evaluate filter importance based on their weights, and activation-based criteria, which utilize training data and filter activations to quantify the importance of filters. Regarding weight-based criteria, [3] and [4] utilized the l1-and l2-norm of filters, respectively; [8] imposed sparsity on the scaling parameters of batch normalization layers to prune the CNN; and [5] proposed that the filter near the geometric median should be pruned. In terms of activation-based criteria, [7] introduced principal component analysis to identify the part of the network that should be pruned; [6] claimed to use information from the next layer to assist with filter pruning; and [9] explored the linear relationship between different activations to eliminate unimportant filters. Most of these studies utilized the same pruning rate for different layers and did not consider that different layers have various peculiarities and different filter distributions.

B. BAYESIAN OPTIMIZATION
Bayesian optimization is a class of machine learning-based optimization methods consisting of two major components: a Bayesian statistical model, which models the objective function, and an acquisition function, which determines the next sampling points. The Bayesian statistical model, which is a Gaussian process, quantifies the uncertainty of the objective function at unobserved data points. The acquisition function measures the predictive enhancement of unobserved data to determine the next sampling point [11], [12]. Therefore, Bayesian optimization is well suited for black-box derivativefree global optimization for the following reasons: (1) it does not require the structural information of the objective function (black box); (2) it does not observe the derivatives of the objective function (derivative-free); and (3) it determines the global optimum by calculating the uncertainty of the objective function at unobserved points (global optimization). Bayesian optimization is extremely versatile owing to its ability to optimize expensive black-box derivative-free functions. Recently, it has been applied to determine optimum hyper-parameters for machine learning algorithms, particularly deep neural networks [13]- [16].

C. NOTATION AND DEFINITION
In this study, the mathematical notation follows the same as used in [2]. For a neural network with L layers, the weight of the lth convolutional layer is denoted as and H (·) · are the width and height of the feature maps, respectively. p ∈ R L is the pruning rate combination, and p (l) ∈ [0, 1 − ] is the pruning rate of the lth convolutional layer, where is a very small number.W (l) ∈ R K ×K ×(1−p (l−1) )C (l) O represents the weight of the lth hard-pruned convolutional layer. The weight of the lth soft pruned convolutional layer is given by the sparse tensorW (l) ∈ R K ×K ×C (l) O , which indicates that the soft filter pruning zeroizes the filters that are selected to be pruned.
} is the filter set comprising all the filters in the CNN.F andF are the filter sets of the hard-pruned and soft-pruned CNNs, respectively.
Filter pruning minimizes the value of the loss function under sparsity constraints on the filters. Given a dataset D = , where x n is the nth input and y n is the corresponding label, the constrained optimization problem can be formulated as follows: whereL(·) is a standard loss function (e.g., cross-entropy loss, mean squared error), C M (·) and C F (·) are the storage and computational costs of the network, respectively, and τ M and τ F are the ratios between the storage and computation costs of the pruned CNN and those of the original CNN, respectively.

III. PROPOSED ALGORITHM
The aim of our proposed HDBOFP is to automatically determine an effective pruning rate for each of the layers in a convolutional network, rather than manually determining them, as is the case in previous filter pruning studies. The HDBOFP consists of two major components: (1) a novel objective function, which estimates the accuracy degradation given a pruning rate combination p without retraining the pruned CNN; and (2) a high-dimensional Bayesian optimization, which provides the optimal pruning rate combination p * by minimizing the proposed high-dimensional non-convex nondifferentiable loss function L.

A. PROPOSED OBJECTIVE FUNCTION
Calculating the accuracy degradation given the pruning rate combination p typically requires retraining the pruned CNN, which is a time-consuming process that produces a nonidentical output. To address this problem, we propose a novel objective function that can estimate the relative accuracy loss for a given pruning rate combination p without the need for a retraining phase. The proposed objective function is defined as follows: where · F is the Frobenius norm, which is defined as the square root of the sum of the squares of the elements. The main idea behind this objective function is that the relative accuracy loss is proportional to the reconstruction loss between the original and the soft pruned filters. Therefore, our algorithm can approximate the expected accuracy loss given the pruning rate combination, while not being affected by the retraining process when searching for an optimal pruning rate combination p * . The proposed objective function has the form of a constrained high-dimensional combinatorial optimization, resulting in a non-differentiable non-convex optimization problem that can be addressed via Bayesian optimization.
Therefore, we can define the potential values of the proposed objective function at a pruning rate combination p that has not been evaluated using a conditional distribution, which is typically associated with the distribution of k observed pruning rate combinations p 1:k . After the Gaussian process, we utilize the expected improvement acquisition function to determine the next pruning rate combination. The expected improvement returns the best pruning rate combination, i.e., that which yields the lowest loss among the already observed pruning rate combinations. Because the proposed objective function computes the loss of the observed pruning rate combination without noise, it is clear that the best current pruning rate combination is p best = arg min p 1:k L(p 1:k ).
The expected improvement acquisition function compares the loss of the current best pruning rate combination L(p best ) with the approximate objective error of the candidate pruning rate combinations L(p) to quantify the improvement yielded by the candidate pruning rate combination p. Thus, the expected improvement can be defined as follows: where E k (·) = E k (·|p 1:k , L(p 1:k )) denotes the expectation of the posterior distribution given evaluations of L at p 1:k . The expected improvement acquisition function is primarily used because of its closed form under a Gaussian process.
To determine the next pruning rate combination p k+1 , which is expected to be closer to the global optimal p * than the current best p best , the expected improvement acquisition function evaluates the candidate pruning rate combination p with the largest expected improvement: After the expected improvement, if the Bayesian optimization termination condition is satisfied, then the current pruning rate combination p k+1 is considered as the solution of the HDBOFP; otherwise, it is used to construct a posterior distribution through the Gaussian process.
Despite its remarkable effectiveness, Bayesian optimization is limited by a curse of dimensionality which consists in the Gaussian process producing a poor prediction for dimensions larger than 15-20. In HDBOFP, the dimension of the objective function corresponds to the number of convolutional layers L; therefore, the applicability of the proposed filter pruning algorithm is restricted to deep CNNs with no more than 15-20 layers. A common framework used to mitigate this problem is to consider a high-dimensional Bayesian optimization task as a standard Bayesian optimization in a low-dimensional embedding, where the embedding can be either linear [1], [28] or nonlinear [29], [30]. For HDBOFP, we utilized linear embedding for high-dimensional Bayesian optimization, as in [1], which exhibited superior performance compared with other algorithms in various applications.
When using linear embedding for high-dimensional Bayesian optimization, we assume the existence of a lowdimensional linear subspace that includes all the variations in L : R L → R. Specifically, let L e : R e → R, e L, and let T ∈ R e×L be a projection from the L to the e dimensional space, then the linear embedding assumption is as follows: L(p) = L e (Tp). Following [1], we generate a linear projection matrix T by sampling L points from the hypersphere S e−1 because this sampling method empirically has a better high-dimensional Bayesian optimization performance than randomly sampling L points. To prevent distortion in the linear embedding, HDBOFP constrains the optimization to points that do not project outside the bounds, by converting Equation 6 as follows: where p e = Tp and T † denotes the matrix pseudo-inverse. The constraints in the above equation are all linear; thus, they form a polytope and can be addressed using off-theshelf optimization tools. In addition, because the HDBOFP utilizes the Mahalanobis kernel in the Gaussian process, the projection is entirely linear and can be effectively modeled within this constrained space. Consequently, HDBOFP can be applied to deep CNNs using high-dimensional Bayesian optimization.

IV. EXPERIMENTAL RESULTS
To demonstrate the wide and general applicability of HDBOFP, we evaluate its performance using two image classification and one object detection benchmark datasets.
Results from extensive experiments verify the effectiveness of the proposed filter pruning method.

A. EXPERIMENTAL SETTINGS
HDBOFP aims to automatically provide a pruning rate combination, given a CNN and filter pruning criteria. In the experiments herein discussed, the pruning criteria are set to the L2-norm [4], which is based on the ''less norm, less information'' assumption and has exhibited superior performance in two major computer vision applications, namely image classification and object detection. In the proposed algorithm, we set the desired FLOPs reducing ratio τ F and the desired memory compression ratio τ M according to the hardware efficiency of state-of-the-art pruning algorithms.
In the tables in this section, HDBOFP-A and HDBOFP-B denote two pruned networks with different values of τ F and τ M with HDBOFP-B having lower threshold parameters than HDBOFP-A. The embedding dimension in high-dimensional Bayesian optimization e is initialized as half the number of convolutional layers L in the baseline model. We compare HDBOFP with previous CNN acceleration algorithms [3]- [5], [31]- [33], [33]- [42]. For a fair comparison of pruning from-scratch and pre-trained models, we use the same training epochs to train/fine-tune the network following [5], [31]. All the experiments were conducted with PyTorch [44] on NVIDIA RTX 3090 GPUs.

B. IMAGE CLASSIFICATION
The proposed automatic filter pruning algorithm based on high-dimensional Bayesian optimization is tested on two image classification benchmark datasets, specifically CIFAR-10 [21] and ImageNet (LSVRC-2012) [22]. The CIFAR-10 dataset contains 50k training images and 10k testing images, providing a total of 60k 32 × 32 colored images belonging to 10 different classes. The large-scale ImageNet dataset (LSVRC-2012) comprises 1.28M training images and 50k validation images, pertaining to 1k categories. In these image classification experiments, the effectiveness of the proposed algorithm is verified using ResNet [23] with different depths. ResNet consists of several residual modules which implement skip-connection; owing to this, previous studies claim that ResNet has less redundancy and is hence more difficult to compress than VGGNet [24]. Thus, we follow [10] to focus on filter pruning the challenging ResNet.

1) COMPARISON ON CIFAR-10
The proposed method is compared with state-of-the-art network pruning algorithms using ResNet on the CIFAR-10 dataset. Table 1 reports the accuracy degradation and reduction of FLOPs of the pruned models compared to the corresponding baseline network. As can be seen, our method can substantially reduce the computational cost for a given network without significant accuracy degradation. For example, the proposed HDBOFP-B can reduce FLOPs in ResNet-110 by 62.8% with an accuracy degradation of merely 0.52%. Compared with previous pruning algorithms, our method achieves pruned networks with less computational cost but higher test accuracy. Static methods, which utilize the same pruning rate on every convolutional layer, are clearly inferior to HDBOFP, e.g., the DSA [37] applied to ResNet-56 only decreases FLOPs by 52.2% and obtains a pruned network with 92.91% accuracy, while the proposed HDBOFP-A can deliver a network with an accuracy degradation of only 0.02% and a greater reduction of FLOPs of 64.5%. Our method also proves superior to existing dynamic pruning methods, which utilize a different pruning rate for each convolutional layer. For instance, ManiDP [31] implemented with ResNet-56 achieves 0.06% accuracy loss with 62.4% FLOPs pruned, which is less satisfactory than our method. We can infer that the proposed HDBOFP algorithm can adequately glean the redundancy of networks to get compact but powerful highperforming structures.

2) COMPARISON ON ImageNet (ILSVRC-2012)
For the ImageNet dataset, we test our HDBOFP on ResNet-34 and ResNet-56 with different pruning rates and without pruning the projection shortcuts for simplification, similar to [4]. As shown in Table 2, our HDBOFP achieves the best performance among the compared methods. For example, in the case of ResNet-34, FBS [32] results in a 51.2% speed-up ratio and a 1.65% top1 accuracy drop, while our HDBOFP-A achieves 56.1% speed-up ratio with just a 0.13% top1 accuracy degradation. Similarly, considering ResNet-56, compared to FPGM [5], which prunes 42.2% of FLOPs with a top5 accuracy loss, our HDBOFP-A has a lower 0.19% top-5 accuracy loss and a higher 53.4% FLOP reduction. These experimental results demonstrate that HDBOFP can produce a more compressed model with comparable or even better performance.

3) LAYER-WISE ANALYSIS
The remaining filter density of each layer of ResNet-34 on CIFAR-10 and ImageNet is illustrated in Figure 2. The peaks and crests indicate that the HDBOFP automatically prunes a 3 × 3 convolutional layer with a large pruning rate because it generally has significant redundancy; however, it prunes a more compact 1 × 1 convolutional network with lower sparsity. We can observe that the remaining filter distribution of HDBOFP is notably different from human experts' results shown in Table 3.8 of [17]. This suggests that the proposed Comparison of the performance of ResNet on CIFAR-10 after pruning with different methods. The ''Acc. ↓'' is the accuracy degradation between the pruned and baseline models; the smaller, the better. ''FLOPs ↓'' is the reduction of FLOPs from the baseline to the pruned model; the larger, the better.

TABLE 2.
Comparison of the performance of ResNet on ImageNet after pruning with different methods. The ''Acc. ↓'' is the accuracy degradation between the pruned and baseline models; the smaller, the better. ''FLOPs ↓'' is the reduction of FLOPs from the baseline to the pruned model; the larger, the better.
automated filter pruning can fully explore the optimization space and allocate sparsity in an improved way.

C. OBJECT DETECTION
We further validate the effectiveness of our automated filter pruning method using one object-detection benchmark dataset, MS-COCO 2017 [25]. This dataset contains 118k images in the training set, 5k in the validation set, and 20k in the test set, i.e., a total of 143k color images belonging to 80 object classes. To evaluate the object detection performance, we used the average precision (AP) over multiple intersection-over-union (IOU). AP@.5:.95 corresponds to the average AP for IOU from 0.5 to 0.95 with a step size of 0.05. For the MS-COCO 2017, AP is the average over 10 IOU levels on 80 categories. In these object detection experiments, we applied our automated filter pruning method on YOLOv5 [27], which is a state-of-the-art object detection model that has exhibited remarkable performance in various computer vision applications.

1) COMPARISON ON MS-COCO 2017
The experimental results of the pruning of YOLOv5s on MS-COCO 2017 are presented in Table 3, in which the AP on the validation set is reported. Compared with previous   [42], our method shows notable superiority. Therefore, these experimental results verify that selecting a proper pruning rate combination via HDBOFP can make the conventional filter pruning algorithm more powerful, leading to high model efficiency with low performance degradation.

D. ABLATION STUDY
The experimental results of an ablation study analyzing the effect of the embedding method on filter pruning when applied to ResNet-18 on CIFAR-10, are summarized in Table 4.

1) INFLUENCE OF EMBEDDING
To verify the effectiveness of high-dimensional Bayesian optimization, we tested HDBOFP without an embedding method, which is equivalent to implementing the proposed algorithm with standard Bayesian optimization. As shown in Table 4, the pruned ResNet-18 with standard Bayesian optimization has a poor filter pruning performance with an accuracy of only 88.97%. These experimental results imply that high-dimensional Bayesian optimization is inevitably required, the major reason being that the Gaussian process is known to produce poor predictions for dimensions larger than 15-20 [28], [43], [45]. Table 4 also shows high-dimensional Bayesian optimization performance on automated filter pruning. ALEBO [1] delivered the best performance among the tested embedding methods, which also included REMBO [28] and HeSBO [43]. REMBO specifies embedding via a random projection matrix with each element i.i.d. N (0, 1). Bayesian optimization is performed in the embedding to identify a point to be evaluated, which is given an objective value. Without box bounds, REMBO comes with a strong guarantee: if embedding space is larger than ambient space, then the embedding contains an VOLUME 10, 2022 optimum with a 100% probability [28]. Unfortunately, things become complicated when there are box bounds in the ambient space, therefore, as the proposed automated filter pruning algorithm is based on constrained Bayesian optimization, REMBO has a poorer performance than ALEBO. HesBO avoids the challenges of REMBO related to box bounds. However, the embedding method in HeSBO is not guaranteed to contain an optimum with high probability, and such probability may be low. Relative to REMBO, HeSBO improves the ability to model and optimize the embedding but reduces the chance of the embedding containing an optimum. From this experimental evidence, the effectiveness of ALEBO in the proposed algorithm can be considered proved.

V. CONCLUSION
In this study, we propose a novel automated filter pruning method for deep CNN compression, named HDBOFP. Unlike conventional methods, HDBOFP explicitly considers the difference between layers and adaptively selects a specific pruning rate for each of them. To efficiently and effectively produce a pruning rate combination, HDBOFP approximates an expected accuracy degradation given a pruning rate combination with no need to incur in a time-consuming retraining process. In addition, the proposed method utilizes high-dimensional Bayesian optimization to solve the non-convex non-differentiable minimization problem, which provides the optimal high-dimensional combinatorial distribution of pruning rates. After extensive experimentation with image classification and object detection, it was demonstrated that HDBOFP outperforms conventional filter pruning methods delivering a higher compression ratio, while simultaneously better preserving accuracy and reducing human labor demand. In future work, we may consider utilizing more efficient Bayesian optimization algorithms, such as parallel processing and sparse approximation.