An Ordered Aggregation-Based Ensemble Selection Method of Lightweight Deep Neural Networks With Random Initialization

Due to the popularity of 5G connectivity and The Internet of Things sensors, deep learning algorithms are being extended to edge devices. Compared with AI(Artificial Intelligence) cloud platforms, the deployment of deep neural networks on edge devices must focus on low power consumption, low latency, stability and reliability. In recent years, the development of lightweight deep neural network architecture has provided a basis for the deployment of deep neural networks on edge devices. However, the shortcomings of deep neural networks, such as overconfidence, vulnerability to adversarial attack, and easy over fitting when samples are insufficient, still limit their applications in many fields. One of the ways to compensate for these defects is to use deep ensemble. An ordered aggregation-based ensemble selection algorithm is proposed, which uses soft-margin as the importance assessment metric to take full advantage of the diversity and complementarity of lightweight deep neural networks obtained from different initialization training, so as to improve the overall performance of multiple edge devices. The experimental results show that this algorithm has a significant improvement in generalization performance compared with random ensemble and ordered aggregation algorithms based on accuracy or diversity, and provides a new complementary idea for the deployment of lightweight deep neural networks on edge devices.


I. INTRODUCTION
At present, deep learning models with multi-layer processing architectures show better performance than shallow or traditional classification models. With the development of 5G connectivity and the IoT (Internet of Things) sensors, deep learning algorithms are expanding to edge devices, the deployment of DNNs (Deep Neural Networks) on edge devices must focus on low power consumption, low latency, stability and reliability. In recent years, the development of lightweight deep neural network architecture has provided a basis for the deployment of DNNs on edge devices. However, the shortcomings of DNNs, such as overconfidence, vulnera-The associate editor coordinating the review of this manuscript and approving it for publication was Mahdi Zareei . bility to adversarial attack, and easy over fitting when samples are insufficient, still limit their applications in many fields. One of the ways to compensate for these defects is to use deep ensemble. Deep ensemble learning combine the advantages of deep learning and ensemble learning to improve generalization performance and robustness by training multiple models and aggregating their predictions. The members of a good ensemble model should be both accurate and errorindependent. The loss surfaces of DNNs are non-convex and depend on millions of parameters, and the geometry of these loss surfaces is not well understood. Even for simple networks, the number of local optima and saddle points is large and can grow exponentially in the number of parameters [1], [2]. Moreover, the loss is high along a line segment connecting two optima [3], [4]. These two observations suggest that the local optima are isolated. Meanwhile, in the process of DNNs training, SGD ( Stochastic Gradient Descent ) [5] and its variant Adam [6] are the most common optimization algorithms. The random noise of mini-batch data sampling in sgd-like algorithms and the random initialization of deep neural networks, coupled with the existence of various local minimum solutions in high-dimensional optimization problems, show that DNNs trained with different random seeds can converge to very different local minima, although they have similar error rates [7], [8], [9]. That is, DNNs trained with different random seeds usually do not produce the same error in the test set, even if the models have converged, they may produce inconsistent predictions given the same input [10]. So theoretically, it is feasible to ensemble DNNs trained with different initialization to improve the prediction performance. Furthermore, single model in an ensemble can be distributed to multiple end devices, which can further speed up inference and potentially simplify the design of specialized hardware.
MobileNetV2 is a classic lightweight deep neural network architecture that seeks to perform well on mobile devices. 30 MobileNetV2 models with different initialization were trained on the training set of the CIFAR-10 and CIFAR-100 datasets in this paper. Then 2, 3, 4, . . . 29, 30 models were randomly selected from the 30 models for ensemble. Next, the accuracy of each ensemble model was calculated on the test set. This process was repeated 100 times to get the corresponding test results. As shown in Figure. 1(a) and Figure. 1(b), the solid line represents the mean of 100 random ensemble results at each ensemble scale, and the shaded area represents the variation range of 100 random ensemble results at each ensemble scale. There is still a large room for optimization in the random ensemble results.
In this paper, the selective ensemble method of lightweight DNNs based on ordered aggregation is studied, and an ordered aggregation algorithm based on soft-margin to improve the performance of the ensemble is proposed. The experimental results show that the ensemble model obtained by this algorithm has a significant improvement in generalization performance compared with random ensemble and ordered aggregation algorithms based on accuracy and diversity. It makes the ensemble of multiple lightweight DNNs better at the same computational efficiency. This ordered aggregation-based algorithm is simple to use and does not require architectural tuning, which provides a new idea for the ensemble design and deployment of lightweight DNNs. The main contributions of this study are as follows: 1) The ensemble results of lightweight DNNs obtained by random initialization training under different ensemble scales are analyzed, and it is proposed that the diversity and complementarity brought by random initialization can be fully utilized to optimize the final ensemble performance.
2) According to the margin theory, soft-margin is proposed as one of the importance assessment metric for selecting the base lightweight DNNs.
3) Ordered aggregation of greedy heuristics algorithms based on soft-margin, accuracy and diversity assessment metric are proposed and compared to random ensemble, the algorithm based on soft-margin can get the optimal generalization performance.
The rest of this article is organized as follows. The second section introduces the related work on deep neural network ensemble and selective ensemble methods. In the third section, the ordered aggregation-based ensemble selection method of lightweight DNNs with random initialization is given in detail. In the fourth section, the experiment is carried out and the results of ordered aggregation method based on different metrics are analyzed objectively. The fifth section is the conclusion of this article.

II. RELATED WORK A. DEEP NEURAL NETWORK ENSEMBLE
In implicit ensembles, the parameters of the models are shared, and the single unthinned network at test time approximates the ensemble averaging. However, explicit ensembles do not share model parameters, and the ensemble output is a combination of the predictions of the ensemble models using different approaches, such as majority vote, averaging and so on.
During the training of the network, dropout [11] removes hidden nodes from the network in order to create an ensemble network. During the testing phase, all nodes are active. Using VOLUME 10, 2022 dropout, the network is regularized to avoid overfitting, and the output vectors become sparse. Dropconnect [12] provides a generalization of dropout. It randomly drops each connection, unlike dropout, which drops output unit. This causes sparsity in the weight parameters of the model. Both dropout and dropconnect require a lot of training time. As a solution, deep networks with stochastic depth [13] were designed to reduce the network depth during training while keeping it unchanged during testing. Stochastic depth is an improvement on ResNet [14], where residual blocks are removed randomly during training and these transformation blocks are bypassed via skip connections. Swapout [15] involves dropping individual units or skipping randomly through blocks, it is a generalization of dropout and stochastic depth.
All the aforementioned methods create an ensemble of networks by sharing the weights. Some researchers have explored explicit ensembles that do not share weights between models. Huang et al. [16] exploites good and bad local minima and let the SGD converge M-times to local minima along the optimization path, and take the snapshots when the model reaches the minimum, these snapshots are then ensembled by averaging for object recognition. The training time of the ensemble model is the same as that of the single model. The ensemble output is taken as the average of the output of the snapshot models' outputs. Random vector functional link network [17] has also been explored to create explicit ensemble, in which different random initialization of hidden layer weights in the hierarchy makes the ensemble prediction diversified.
Fast Geometric Ensembling (FGE) [18], shows that it is possible to collect models that are spatially close to each other but produce different predictions using cyclic learning rates. They use the collected models to train the ensemble, and there is no computational overhead compared with training a single DNN. An effective method of Bayesian neural network model averaging is also discussed in [19]. SWA [20] is inspired by the development track of FGE scheme. The purpose is to find a single model that can approximate FGE set, but provides stronger interpretability, convenience and scalability during testing.
The above DNN ensemble methods mainly take into account how to reduce training costs, without considering the screening of base models, so they are essentially random ensemble. As can be seen in Fig.1, random ensemble results have a large range of variation, and it is difficult to ensure the optimal results. This paper mainly explores the selective ensemble of different base models under the condition of random initialization, and makes full use of the diversity and complementarity brought by the random initialization of DNNs to obtain better ensemble performance, so that the lightweight DNNs can get better overall performance when deployed on multiple edge devices.

B. SELECTIVE ENSEMBLE
In order to achieve the ideal generalization performance, the ensemble learning algorithm usually generates a large number of base models to form an ensemble system. However, it is not that the more base models participating in the ensemble, the better the generalization performance of the ensemble system, including the following reasons: (1) Some generated models may have lower accuracy, and their participation in the ensemble will reduce the generalization performance of the final ensemble system. (2) Some generated models may be similar to each other, that is, they usually give the same results for the same samples, and ensembling some similar models will not improve their generalization performance. (3) Ensembling a large number of models requires a lot of storage and computing overhead and reduces the prediction speed of the ensemble system.
For the above reasons, the base models need to be screened. The base models ensemble selection is to select an approximate optimal subset from the initial base model pool according to some performance evaluation metric, and use the ensemble subset as the final ensemble system. In the past ten years, scholars have carried out a series of research work and proposed many ensemble selection algorithms. In general, these algorithms can be divided into three categories: (1) Ensemble selection based on ordered aggregation [21], [22], [23]; (2) Ensemble selection based on clustering algorithm [24], [25], [26]; (3) Ensemble selection based on optimization [27], [28], [29]. Optimization-based algorithms can usually select an ideal subset of ensembles, but at the cost of significant computational and time overhead, especially for ensembles of DNNs. Therefore, this paper mainly discusses the ensemble selection method based on ordered aggregation.
The ensemble methods in traditional ensemble learning mostly use simple base models like decision trees, and they mostly use majority voting for ensemble. These methods are not fully applicable to DNNs. The complexity of the DNN base model is higher, and the output is in the form of probability, so this work studies ordered aggregation method based on three importance assessment metric and compared to random ensemble, the method based on soft-margin gives the best performance.

III. ORDERED AGGREGATION-BASED ENSEMBLE SELECTION OF LIGHT DNNs WITH RANDOM INITIALIZATION A. ORDERED AGGREGATION AND IMPORTANCE ASSESSMENT
The basic idea of ensemble selection methods based on ordered aggregation is as follows. First, the performance of each base model in the initial ensemble system is evaluated separately according to some importance assessment metrics. Then, the base models are reordered according to the corresponding assessment values obtained to get a new model sequence, in which those base models that are evaluated to have good performance are ranked in the front of the sequence. Finally, some strategy is used to select the top k (0<k<T ) models from the new sequence to form the final ensemble subset.
For an ordered aggregation-based ensemble selection algorithm, the importance assessment metric used by the algorithm determines the performance it can achieve. It is well known that accuracy and diversity are the two most commonly used importance assessment metrics. To measure ensemble diversity, a classical approach is to measure the pairwise similarity/dissimilarity between two learners, and then average all the pairwise measurements for the overall diversity. The representative paired metrics are Disagreement, Q-Statistic, Correlation Coefficient, Kappa-Statistic and Double-Fault etc. Non-pairwise measures try to assess the ensemble diversity directly, rather than by averaging pairwise measurements. The representative unpaired metrics are Kohavi-Wolpert Variance, Interrater agreement, Entropy, Difficulty, Generalized Diversity and Coincident Failure etc. Although there are many diversity metrics, the exact form and measurement of diversity has not been solved, and the optimization of existing diversity metrics cannot guarantee the learner to obtain good generalization performance.

B. MARGIN THEORY
Schapire et al. [30] introduced the margin-based explanation to AdaBoost, Formally, in the context of binary classification i.e., f (x) ∈ (−1, +1), the margin of the classifier h on the instance x, is defined as f (x)h(x), the margin of the ensemble Based on the definition of the margin, Schapire et al [30], Breiman [31], Gao and Zhou [32] successively gave the upper bound of the generalization error of the ensemble model.
Theorem 1: Schapire et al. [30] For any δ > 0 and θ > 0, with probability at least 1 − θ over the random choice of sample S with size m, every voting classifier f ∈ C (H ) satisfies the following bound: Theorem 2: Breiman [31] For any δ > 0, with probability at least 1 − θ over the random choice of sample S with size m, every voting classifier f ∈ C (H ) satisfies the following bound: where θ =ŷ 1 f x 1 > 4 2 |H | , R = 32ln2|H | mθ 2 2m. Theorem 3: Gao and Zhou [32] For any δ > 0, with probability at least 1 − δ over the random choice of sample S with size m 5, every voting classifier f ∈ C (H ) satisfies the following bound: where µ = 8 θ 2 lnmln (2 |H |) + ln 2|H | δ . In 2019, A Grønlund, L Kamma et al. published a paper at the NeurIPS conference [33], proving that Gao and Zhou gave almost the tightest upper bound on the generalization error, improving at most one log factor. And this upper bound has been matched with the lower bound, and theoretically impossible to get a better result.
The margin theory is a very effective theoretical tool to analyze the generalization performance of ensemble models. From Eqs.(2)-(4), when other variables are fixed, the larger the margin over the training examples, the better the generalization performance. Therefore, if a base model is more beneficial to increase the margin of the ensemble model on the training samples, then it is more conducive to improve the generalization performance of the ensemble model. Inspired by the theory described above, margin is tried to be used as an important assessment metric for ordered aggregation of individual DNN in this paper, and is compared with accuracy and diversity assessment metrics.
Since the final output of the deep convolutional neural network model is based on the class probability (or confidence). Therefore, the soft voting method is usually used, and the individual classifier h i outputs a 1-dimensional vector h 1 i (x) , · · · , h 1 i (x) T for the example x. Among them,  (1) , v (2) j is the predictions probabilities for the j th label on example x i combined by soft voting, label y ∈ {1, 2, · · · , L}.
Based on the above theoretical analysis and the ensemble method of DNNs, the soft-margin of the given example x i can be written as 1 M v

C. THE ORDERING AGGREGATION OF GREEDY HEURISTICS BASED ON DIFFERENT IMPORTANCE ASSESSMENT METRICS
The selection of an optimal subensemble from a given base model pool is a difficult combinatorial optimization problem. With the limited computational resources, only approximate solutions are accessible for ensembles of realistic size. Reference [22] pointed out that the generalization performance of an ensemble cannot be improved by pruning technology based on individual attributes of the ensemble members, exhaustive search confirms that the greedy ordering heuristics VOLUME 10, 2022 devised can efficiently identify near-optimal subensembles of increasing size.
In this paper, the ordered aggregation of lightweight DNNs based on greedy algorithm is studied. First, the most accurate base model in the validation set was selected, and then the base model from the remaining base models was chosen to add to the ensemble in each round, which makes the accuracy, diversity or soft-margin assessment of the ensemble model the best. For details on the ordered aggregation of greedy heuristics algorithm, see Table 1. Compared with the random ensemble method of DNNs, the ensemble selection method needs to train more base models, but more base models can bring more diversity and complementarity to improve the overall ensemble performance, this work studies three importance assessment metric for lightweight DNNs selection, it is found that methods based on soft-margin proposed by this paper can select better base models than methods based on accuracy and diversity. Compared with random ensemble, it has a significant improvement in generalization performance. It further proves the correctness and validity of the margin theory, and provides a basis for the further research of soft-margin in lightweight DNNs ensemble.

A. EXPERIMENTAL ENVIRONMENT AND CONFIGURATION
In the experiments, the public datasets CIFAR-10 and CIFAR-100 are used to test the proposed method. In order to reduce the size and computation cost of the ensemble model, the classic lightweight DNN MobileNetV2 is used as the base model for training. Four groups of experiments were carried out, as shown in Table 2, each group of experiments was conducted 50 times. In order to reduce the random influence caused by different distribution of validation set and test set, the test set in the original data set is randomly divided into   The total number of samples in the datasets CIFAR-10 and CIFAR-100 is 60,000. The number of training set samples in the original dataset is 50,000. The training set samples was used to train multiple randomly initialized lightweight DNNs. The purpose of this paper is to make full use of the diversity brought by random initialization, so the samples in the training set did not change in the experiment.
The remaining 10,000 samples were divided into validation set with 5,000 samples and test set with 5,000 samples. Ensemble selection algorithm used validation set to screen base models. Since the validation set has a great impact on the results of base models selection, in order to ensure the reliability of experimental results and the effectiveness of proposed method, in each experiment, the validation set and test set samples were mixed and then redivided randomly according to 1:1 again. 50 experiments were conducted in total, and the validation set and test set in each experiment were different. Very common hyperparameter settings were used to train the models, because compared to the accuracy of single model, the difference in the accuracy of different ensemble models is worthy of more attention. which init lr = 0.1 with warm up and divide by 5 at 60th, 120th, 160th epochs, trained for 200 epochs with batchsize 256 and weight decay 5e-4, Nesterov momentum of 0.9.
30 MobileNetV2 models with different initialization were trained separately on the training set of CIFAR-10 and CIFAR-100 using the same hyperparameter settings. Then, the algorithm in Table 1 and the random ensemble method were used to get ensemble models of different sizes, and the accuracy of the ensemble model on the test set under each scale was calculated. After that, the validation set and test set were divided randomly again, and the above ensemble selection steps were repeated for 50 times, the results of 50 ensembles under each scale were counted. Figure. 2(a) and Figure. 2(b) show the average accuracy of the random ensemble method, ordered aggregation method based on accuracy, diversity and soft-margin under each ensemble scale. The results show that, for CIFAR-10, since the accuracy of single base model is already relatively high, the diversity among the basic models becomes smaller. Fig. 2(a) shows that the base models screening method using diversity is not as good as the method using accuracy. For CIFAR-100, because the accuracy of single base model is not high, the diversity among the base models is relatively large. From Figure 2(b), it can VOLUME 10, 2022 be seen that the method of using diversity to screen the base models is slightly better than using the accuracy. However, the accuracy and diversity assessment metric cannot fully guarantee the overall generalization performance of the ensemble model, so the ensemble performance of these two methods is not significantly improved compared to the random ensemble method. In the scale where a large ensemble gain can be achieved, that is, when the number of base models is about ten or less, The method using soft-margin based on margin theory proposed in this paper can achieve the best performance no matter in the CIFAR-10 dataset with high accuracy of single base model or on the CIFAR-100 dataset with low accuracy of single base model.
In order to compare the pros and cons of different methods more comprehensively, the boxplots of the prediction results of the ordered aggregation method based on accuracy, diversity and soft-margin and the random ensemble method when the ensemble scale is 2-10 were drawn, as shown in Figure. 3 and Figure. 4.
Boxplot is a statistical graph that describes the degree of dispersion of a set of data, which can reflect the overall pros and cons and stability of the effects of different algorithms. It can be seen from Figure. 3 and Figure. 4 that the IQR (interquartile range) differences between the ordered aggregation method and the random ensemble method are not obvious, it shows that the four algorithms have little difference in the degree of dispersion when randomly dividing the validation set and test set. In 50 runs, when the number of ensembled base models is 2-10, for CIFAR-10 dataset, the maximum value, upper quartile (Q3), median, lower quartile (Q1) and minimum value of the ordered aggregation algorithm based on soft-margin are greater than the other three algorithms except for the maximum value in the ensemble of 5 models; for CIFAR-100 dataset, the maximum, Q3, median, Q1 and minimum value of the ordered aggregation algorithm based on soft-margin are greater than the other three algorithms except for the median value in the ensemble of 2 models and the maximum value in the ensemble of 6-10 models. It shows that under the ensemble of the same scale, the ordered aggregation algorithm based on soft-margin can select the subensemble model with the best generalization performance. The specific statistics are shown in Table 3 and Table 4 where the best indicators are deepened in bold.

V. CONCLUSION
This paper proposes a soft-margin based selective ensemble method for lightweight DNNs. The ensemble selection strategy of this work argues that the soft-margin of the ensemble model on the validation set is effective in building an ensemble model with stronger generalization ability. First, select the most accurate base model on the validation set, and then use the ordered aggregation of greedy heuristics algorithm to sequentially add base model which can maximize the soft-margin of the ensemble to form a new larger ensemble model. The method is compared with the ensemble selection method based on accuracy and diversity and the random ensemble method. The experimental results on the CIFAR-10 and CIFAR-100 datasets show that before the model ensemble gains tends to converge, the soft-margin based lightweight DNNs ensemble selection method can achieve significantly the best generalization results compared to the other three methods, no matter the base model has a high or low accuracy.
Compared with the random ensemble method of deep neural network, the method in this paper needs to train more base models, and these base models will consume more training resources. But more base models can bring more diversity and complementarity to improve the overall ensemble performance. Therefore, the method in this paper is more optimal when it comes to ensemble selection of lightweight models, but for large models with very high training costs, the excessive training cost may not be suitable for the method in this paper.
The next step, the research team intends to further explore how to associate soft-margin with diversity to improve the generalization performance of lightweight deep ensemble models. Based on the analysis and experiments in this paper, the proper combination of soft-margin and diversity can achieve good results, which may provide new research ideas for integration selection.