Auto-Ensemble: An Adaptive Learning Rate Scheduling based Deep Learning Model Ensembling

Ensembling deep learning models is a shortcut to promote its implementation in new scenarios, which can avoid tuning neural networks, losses and training algorithms from scratch. However, it is difficult to collect sufficient accurate and diverse models through once training. This paper proposes Auto-Ensemble (AE) to collect checkpoints of deep learning model and ensemble them automatically by adaptive learning rate scheduling algorithm. The advantage of this method is to make the model converge to various local optima by scheduling the learning rate in once training. When the number of lo-cal optimal solutions tends to be saturated, all the collected checkpoints are used for ensemble. Our method is universal, it can be applied to various scenarios. Experiment results on multiple datasets and neural networks demonstrate it is effective and competitive, especially on few-shot learning. Besides, we proposed a method to measure the distance among models. Then we can ensure the accuracy and diversity of collected models.


Introduction
Optimizing structure of network and loss function is a NP-hard process [1]. To enhance the generalization capabilities of models, different network structures have been designed to apply to different scenarios. Hence, manually designed network structures are often highly targeted. According to different tasks, it often requires to redesign or optimize the network structure deeply to maintain generalization performance in the new scenarios, which spend a large amount of manpower and computing resources. Therefore, Neural Architecture Search (NAS) is proposed as a new method to construct powerful models. NAS searches network structure automatically and frees up expert time. However, NAS must consume large training budget to acquire the best network structure. Furthermore, NAS cannot guarantee the performance and generalization of model in the aspect of loss function and training algorithm. So, it only applies in the field of routine supervised learning with a large amount of labeled data [2,3,4,5]. Hence, NAS is rare to involve in other machine learning fields like semi-supervised learning, few-shot learning, etc. Besides, fine-tune can also achieve better generalize, which requires significant expertise. The above methods are not universal enough, for the same problem, ensemble learning is widely used to solve the problem of accuracy and generalization in machine learning applications [6]. Traditional ensemble learning such as Random Forest and AdaBoost which can hardly extract features about a certain task. And the feature engineering is highly based on manual selection.To avoid huge training budget and complicated feature engineering, this paper attempts to provide a deep learning based simple and automatic ensemble method, which called Auto-Ensemble (AE), to improve performance and generalization. The key idea is by scheduling learning rate to automatically collect checkpoints of model and ensembling them in once training.
Adaptive ensembles like AdaNet, make use of NAS models and automatically search over a space of candidate ensembles [7]. Auto-Ensemble(AE) differs from it but is also an auto searching process. By scheduling the learning rate, AE searches in the loss surface to collect checkpoints of model for ensemble. Compared to AdaNet, it reduces computing resources, has considerable improvement and its easier to implement.
AE makes use of adaptive cyclic learning rate strategy to achieve Auto-Ensemble. Cyclic learning rate strategy takes advantage of SGD's ability to avoid or even escape false saddle points and local minimum, by simply scheduling the learning rate, it can make the model converge to a better local optimal solution in a shorter iteration period [8]. However, Cyclic learning rate requires manual setting of many hyperparameters, cannot guarantee the diversity of the collected checkpoints of model. Based on this, we used an adaptive learning rate strategy with less hyperparameters, which can automatically collect as many checkpoints of model as possible with high accuracy and diversity in once training. In order to ensure the diversity of the checkpoints collected each time, we proposed a method to measure the distance between checkpoints, which can ensure that the model converges to different local optimal solutions in the process of continuous training.
The main contributions of this paper are: An easy-to-implement methodology for ensembling models automatically in once training. In addition to traditional supervised learning, our experiment on few-shot learning demonstrates that Auto-Ensemble can apply to other deep learning scenarios.
We proposed a method to measure the diversity among models, by which we can ensure the accuracy and diversity of models in the training process.
Our experiment demonstrates the efficiency of Auto-Ensemble. The accuracy of the classifier can be significantly improved and greatly exceed single model. These greatly reduces the workload of manual designed and optimized network models.
We organize the paper by first describe the significance and overview of Auto-Ensemble method. In the following section we briefly introduce the related work of our methodology. And explain each section of Auto-Ensemble method in detail in Section 3. In Section 4 we demonstrate our setup and the results in experimental procedures. In Section 5, we conclude the advantages and the future work in our research.

Related Work
Classification ensemble learning techniques have demonstrated powerful capacities to improve upon the classification accuracy of a base learning algorithm [6]. A common feature of these approaches is to obtain multiple classifiers by the repeated application of basic learning algorithms to training data. To classify a new sample, we need to obtain the classification results of each classifier, and then aggregate the voting results to get a final classification, this typically achieving significantly better performance than an individual learner [9]. In some scenarios, the integration of simple models can achieve comparable results with complex models, and greatly reduce the computational cost.
Training deep neural network is a process of training loss function composed of a corpus of feature vectors and accompanying labels. Li et al. [10] proposed a visualization method of depth neural network loss surface, which found that the more complex the network, the more chaotic the loss surface is. And it is also found that there are many local optimal solutions in the large loss surface [11,10]. Ensemble learning makes fully use of these different local minima [12].
By scheduling learning rate, model can converge to different optimal solution. Once the model encounters a saddle point during training, it can quickly jump out of it by increasing the learning rate. The related work of cyclic learning rate(CLR) proves that the CLR schedule can make the convolutional neural network model training process more efficient. And eliminates the need to perform numerous experiments to find the best values and schedule [13].
Research shows that by gathering outputs of neural networks from different epochs at the end of training can stabilize final predictions [14]. Checkpoint ensemble provides a method to collect abundant models within one single training process [15]. It greatly shortened training time that ensemble requires and achieved better improvements than traditional ensembles.
Snapshot Ensemble combined the cyclic learning rate with checkpoint ensemble: it adopted a warm restart method [12][16] [17], where in each restart the learning rate is initialized to some value and is scheduled to decrease follow a cosine function. At each end of cycle, they save the snapshots of model. Multiple snapshot models can be collected in one training, which greatly reduce the training budget. Wen et al. [17] proposed a new Snapshot Ensemble method and a log linear learning rate test method. It combines Snapshot Ensemble with appropriate learning rate range, which outperforms the original methods [12] [13]. FGE [18] proposed a method for quickly collecting models, which finds paths between two local optima, such that the train loss and test error remain low values along these paths. A smaller cycle length and a simpler learning rate curve can be used to collect a model with high accuracy and diversity along the learning curve.
For the ensemble accuracy depends on the number, diversity, and accuracy of individual models, adaptive ensemble aims to find an optimum condition for ensemble learning. Inoue [19] proposed an early-exit condition based on confidence level for ensemble. It automatically selects the number of ensembled models, which reduce the computation cost. This inspires us to collect models automatically during training.
According to Bengio [8], by scheduling the learning rate, the model can find as many different local optimum solutions as possible when exploring the loss surface. Auto-Ensemble method refers to cyclic learning rate schedule. Every time the checkpoint of model is collected, the learning rate rises to escape the local optimal solution. Finally, all the checkpoints of model are used for ensemble to improve the generalization performance of classification. The training epochs, the range of learning rate and the number of training models are adaptive and unpredictable.
Ju et al. [20] compared relative performance of various ensemble methods, they found that a special ensemble method: Super Learner, achieved best performance among all the ensemble methods. It is cross-validation based ensemble method, and it uses the validation set of the neural networks for computing the weights of Super Learner. Our AE method refers to its idea and proposes a weighted average method, which would help improve the prediction.

Auto-Ensemble
Auto-Ensemble(AE) proposed in this paper is scheduling learning rate to control the process of model exploring loss space. After having collected a checkpoint of model, the learning rate rises to escape from it. And start a new searching process. Finally, all the collected model checkpoints are used for ensemble. Our method can explore as many models as possible with high enough accuracy and diversity. The main thing is to solve the problem of model diversity. Huang et al. [12] have discussed the correlation of collected snapshots, and reasonably chose the snapshot models for combination. Auto-Ensemble ensures the diversity of collected checkpoints: Adaptive learning rate schedule can automatically find the local optimal solution by scheduling the learning rate during training process. After having collected a checkpoint of model, by steeply increasing the learning rate, it can automatically jump out of the local optimal solution, then continue to search for other optimal solutions. In Figure 1(a) shows the procedure of collecting snapshots of SnapShot Ensemble (SSE) [12]. In Figure 1(b), the blue arrow line is the convergence process of Auto-Ensemble. It can be seen that model escapes sharply from the local optimal solution and then converge to different local minima. The short arrow line is the convergence process of the traditional SGD, which is slow and inefficient to find a local optimal solution. We found that the weights and biases of the last dense layer (the last but one dense layer of Siamese network) can be extracted to measure the distance between models. We record two Euclidean distances d 1 and d 2 , where d 1 is the distance between the weight when model converges to a local optimal solution and the weight when the learning rate rises to the highest in the previous cycle, d 2 is the distance between the weights of the checkpoint at the local optimal solution and the weights when the learning rate rises to the maximum in current cycle( Figure  2). The arrow lines show two measuring distances during the search of loss surface. To ensure the distance among collected checkpoints, d 2 should be much greater than d 1 . We compared the Euclidean distance between the weights of models with the traditional correlation coefficient method on ResNet models in Figure 3(a): The distance among models last dense layer: to make the comparison more intuitive, we normalized the distance among models, mapping distance 0 to 1 and the maximum distance to 0.9 (corresponding to the maximum and minimum value of the coordinate axis). The distance value y is converted according to the following formula:

Metric of Model Diversity
where y max and y min are the coordinate axis boundary of correlation (Figure 3b), x is the actual distance between the weights and x max , x min are its maximum and minimum, respectively. Figure 3(b): The correlation coefficient among models. It is observed that the farther the model is, the greater the distance (the smaller the normalized distance value) and the smaller the correlation, preliminary proved that the distance between weights can be used to measure the diversity of models.

Learning Rate Schedule
The learning rate schedule use the piecewise linear cyclic learning rate schedule refers to Garipov et al. [18]. We set a learning rate boundary α 1 − α 2 (α 1 > α 2 ). The values of α 1 and α 2 are quite different (generally two orders of magnitude), where α 1 is to speed up the gradient descent process, while α 2 is to make the model converge to a wide local optimal solution. In the cycle of collecting one checkpoint, the learning rate decreases linearly (change rate β), and then keeps it at the minimum until model converges. Formally, the learning rate lr has the form: where n is the total number of training iterations, change rate β = (α 1 − α 2 )/N . Then learning rate increases linearly. Learning rate rise phase is divided into two parts: rapid rise phase and loss surface exploring phase. The learning rate lr has the form:

Algorithm 1 Auto-Ensemble Algorithm
Require: LR bounds α 1 ,α 2 , LR change rate β= (a > b > 1) number of iterations n, epochs of rapid rise phase m, ratio of d 2 to d 1 :α Pretrain phase:adopt 75% of epochs to run standard learning rate schedule. Ensure: repeat repeat lr = α 1 − βn if n > N and the model has not converged then lr = α 2 end if until the model is converged and then collect the checkpoint, record number of iterations M repeat for n in M +m(rapid rise phase) do lr = β 1 (n − M ) + α 2 end for, record the current learning rate lrnow lr = β 2 (n − M − m) + lrnow until d 2 > α * d 1 , the current cycle is over until satisfy the conditions for training stopping Ensemble phase: For each model:θ 0 · · · θ T , get the predicted softmax output h θ (x), where T is the total number of collected checkpoints, x is the training data. Define a fully collected network H(x) to train the weight: The weighted averaging result is: where M is the total number of training iterations from the beginning till now. m is the length of the rapid rise phase. lr now is the learning rate in the end of the rapid rise phase. The change rate of learning rate(β 1 ) in the rapid rise phase is the largest, which aims to make the model jump out of the current local optima quickly. The change rate of learning rate subsequently(β 2 ) declines to explore the loss surface more carefully.

Auto-Ensemble
The procedure is summarized in Algorithm 1. Before starting to schedule the learning rate, we adopt a pretrain phase. Warm start plays a key role in machine learning: before training a model, if the model is pretrained for a period, the experimental results tend to be significantly improved. After having collected several model checkpoints, the ensemble prediction at test time is the average of every models softmax outputs. In addition, in order to improve the efficiency of ensemble, we have designed a weighted averaging method, each model is weighted with different weights.
Simple averaging means average the softmax output of each model, while weighted averaging gives weights to the output of each model.The weight is generally learned from the validation set, which is generated from the training set through data augmentation, etc. Then we use test set to test accuracy. Our algorithm designs a small fully connected network with one layer automatically to learn the weights. It worth mentioning that the bias of fully connected network is fixed to ZERO. The weight is initialized as a one-dimensional vector with length T, where T is the number of collected checkpoints. When training fully connected network, the training data is the 1D array and the label is the original label. After training, the weight of layer is used to ensemble checkpoints.
The above method can improve the ensemble accuracy obviously and smooth the uneven distribution of model accuracy.

Experiments
We demonstrate the effectiveness of Auto-Ensemble on different datasets and networks, we compare our method with related state-of-the-art. And run all experiments with Keras.

Dataset
CIFAR The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset [21]. The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. We used a standard data augmentation scheme in Keras document .The augmentation scheme is different in VGG, compared with ResNet and Wide ResNet.
Omniglot The Omniglot dataset was collected by Brenden Lake and his collaborators at MIT via Amazons Mechanical Turk to produce a standard benchmark for learning from few examples in the handwritten character recognition domain [22]. The Omniglot dataset consists of 1,623 handwritten characters, with only 20 samples per class. The dataset was divided into training set and test set: 964 classes for train and 659 classes for test.

Architecture
We test several classic neural networks: including residual networks (ResNet) [23], Wide ResNet [24] and VGG16 [25]. For ResNet, we use the original 110-layer network, for Wide ResNet we use a 28-layer Wide ResNet with widening factor 10. We use the same standard data augmentation scheme on CIFAR10 and CIFAR100.

Hyperparameters
Our experiments used following hyperparameters: For Wide ResNet we set the learning rates: α 1 =0.5, α 2 =0.5 × 10 −3 ; for ResNet α 1 =0.5, α 2 =0.01; for VGG α 1 =0.4, α 2 =0.01. For few-shot learning the learning rate boundary is 0.005-0.03. For distance measurement, α was set to 1.5 for VGG, Wide ResNet and ResNet. For few-shot learning α was set to 2. For pretraining epochs, Resnet, Wide Resnet and VGG have 110, 45 and 100 pretrain epochs respectively. In few-shot learning, the pre-training phase contains 45000 tasks. We will discuss the effects of these parameters on the experimental results in the following sections.

Comparison in Model Collection
Auto-Ensemble has a unique learning rate scheduling method to collect models. To illustrate our advantage, different learning rate schedulers are implemented. We collected checkpoint ensemble models(CE) and random initialization ensemble models(RIE) with traditional learning rate schedulers [15]. Besides, Cosine Cyclical Learning Rate scheduler (SSE or CCLR) [12], Max-Min Cosine CLR (MMCCLR) [17] and Triangular CLR(FGE) [18] are also used for comparison. The baseline model is a single independently trained network (Ind) using stepwise decay SGD.
It is worth mentioning that for fair comparing with state-of-the-art, we reimplement above methods, using the same architecture and data augmentations. All parameters of them follow those in paper when we reproduce them. For CE and RIE, we reimplement these methods based on our datasets and models. For SSE, we totally reproduce experiment according to its paper. We didnt reproduce MMCCLR for its learning rate scheduler is similar to SSE. For FGE, we only refer to its Triangular CLR as stated in its paper(its implements of curve finding experiments changes the way of gradient update of model to some extent).

Comparison in Ensemble Method
After having collected ensemble models, there have been some rules to combine them together to make predictions. We reimplemented Adaptive Ensembles(CIs) [19], which automatically select number of models for ensemble: we used the confidence-level based early-exit condition with a 95% confidence level for all datasets.
And the weighted averaging method are compared with Super Learner(SL). We referred to Ju et al. [20] and directly extracted the experimental result from them. The results show that our weighted averaging method can make the ensemble more robust.

Ensemble Result
All the results are summarized in Table 1. For each method, we used simple averaging methods to ensemble models. And we list the improved accuracy compared to single model(Ind). For AE we show the weighted averaging result. These results are obtained by a fully connected network trained by validation set. We set the learning rate at the range of 0.01-0.001 and select appropriate learning rate that maximizes ensemble accuracy.
Our Auto-Ensemble (AE) results were compared with SSE, FGE, CE, RIE, Adaptive Ensemble based Confidence Intervals (CIs) and independently trained networks (Ind). The best ensemble results are bolded in Table 1. Experiment showed that in most cases, the ensemble accuracys of Auto-Ensemble were better than other methods. And compared with single model, the improved accuracy is considerable. SSE, FGE and CE sometimes have no improvement because of the poor diversity.
We also add several SL results to compare the effectiveness of our ensemble method. The results are directly extracted from the original paper (Table 2). Clearly, our weighted averaging method takes advantage of SL and achieves best combination.  To illustrate the higher diversity of our models, we calculated the correlation of the softmax output of each pair of models. Figure 4 shows the correlation coefficients among models collected by different methods: Figure  4(a) stands for Auto-Ensemble(AE) with adaptive learning rate cycle. Figure 4(b) shows Snapshot with cosine annealing cycles. Figure 4(c) is RIE with independently trained networks. The correlation of Snapshot is higher than that of the other two methods, indicating that there is less diversity between models. Although AEs diversity of models is not as good as RIE, compared with RIE, it reduces training budget and can collect models with enough diversity.

Training Budget
Our Auto-Ensemble has unfixed time and storage complexity: storage complexity depends on the diversity of collected model. As stated in our paper, there is a hyperparameter α to adjust the distance among models. The number of collected models cant be specified because it is related to the value of α. The training budget is adaptive for AE: the distance among models is adjustable, and the farther the model is, the more epochs needed to collect one model.
While these are fixed for SSE and FGE: the number of epochs can be specified to collect the target model. In our experiment, the ensemble size of SSE is 5: we ensemble the last 5 snapshots in once training. For CE the storage complexity is large: at each epoch, we save the checkpoint for later ensemble. For RIE, it requires to run a single model separately several times: according to Chen, Lundberg, and Lee [15], they ensemble 5 models. The time complexity is the largest. For adaptive ensemble, we select among RIE models, the time and storage complexity are less than RIE. Table 3 shows the average number of epochs required to train a ResNet model on CIFAR10.The comparison of computational expense is almost the same on other models and dataset.
Above all, the order of ensemble size is: SSE ≈ CIs < RIE < AE < F GE < CE, the order of time budget is: Ind = CE = SSE < F GE < AE < CIs < RIE. It is worth mentioning that the average epoch of SSE is unfixed: the cycle is 40 but we collect last 5 models for better convergence, so the average epochs is more than 40 and reaches 80(collect 10 snapshots and ensemble the last 5). For CE, we select the first M best models.

Learning Rate Boundary
The learning rate boundary in the decline phase is to make the model better converge. However, the learning rate is not limited during the rise phase. We need to set the learning rate boundary and its change rate. The choice of the learning rates refers to methods proposed by Smith [13]: let the learning rate increase linearly within a rough boundary. Next, plot the accuracy versus learning rate. Note the minimum learning rate when training accuracy is significantly improved and the learning rate when the accuracy slows, becomes ragged, or even starts to fall. This learning rate boundary is a good choice: set α 2 as the maximum learning rate and α 2 as the minimum learning rate.   In Figure 5, the model starts converging right away, so it is reasonable to set α 2 = 0.01, the accuracy rise gets rough after learning rate increases to 0.4, so α 1 = 0.4. Inappropriate learning rate ranges affect the accuracy of collected models. Table 4 shows the distribution of the collected VGG model accuracy on CIFAR10 with different learning rate boundary. Obviously, appropriate learning rate boundary can obtain the best accuracy of collected models. 0.4-0.01 is the best learning rate boundary for VGG.
The whole learning rate schedule is divided into three phases: the decline phase, the rise phase 1 and the rise phase 2. In the decline phase, the change rate of learning rate is (α 1 − α 2 )/N . N is usually set to 20 ∼ 30, in the decline phase N = 25, in the rise phase N = 5.We set the learning rate change rate in the decline phase, the rise phase 1 and the rise phase 2 separately to β, β 1 , β 2 , then β ≈ β 1 < β 2 .

Pretrain
It is noteworthy that the effect of the pre-training epochs on the collection of the model: If the pre-training model is under-fitting, the difficulty of collecting the first convergent checkpoint of model will increase, which will affect the subsequent ensemble work. If we start from an over-fitting model, it will also lead to poor accuracy and diversity of the collected models. Table 5 shows the effect of different pre-training epochs on ensemble accuracy of VGG models. It can be seen that 80 is a suitable pre-training epoch.

Parameter of Diversity
The learning rate stops increasing when d 2 > α * d 1 . Table 6 shows the effect of different value of α on the collection of VGG models. Experiments show that the model is not sensitive to the value of α. But if the distance is too Fig. 6 Training process of ResNet and Wide Resnet models large, it will cause the model to escape too far to converge, and the loss and accuracy become unpredictable. So we limit 1 < α < 2.

Conditions for Stop Training
The condition for the experiment to stop is a question worth discussing, for the number of model checkpoints needed to collect in the training process is unknown.
Experiments show that it is easy for VGG and Resnet110 to collect models. As long as the training process does not stop, infinite models will be collected. So, the number of checkpoints can be used to limit the conditions for training stopping. Figure 6(a) is the training curve of ResNet: Two nearly coincident curves represent the loss and accuracy changes of training set and test sets. The processes of collecting Wide ResNet and Siamese Network are relatively tough. Figure 6(b) demonstrates Wide ResNet training process: With the increase of training epochs, the learning rate should increase to very high level to escape the local optimal solution, and the accuracy of collected checkpoint becomes much lower. In this case, it can stop training when the learning rate has increased to a certain value, for the accuracy of the checkpoints collected next is not high enough.
Inoue [19] proposed an early-exit method based on the confidence level of local prediction, we reimplemented it and found that the adaptive ensemble result is not as good as that of ensembling all models(RIE). The result is shown in Table 1 4.6 Auto-Ensemble on Few-shot Learning Our mainly contribution also contains the application in few-shot learning, which indicates that our AE method can apply to some non-traditional supervisory problems. Few-shot learning is not a new research topic, and there have been many state-of-the-art. To verify the effectiveness of Auto-Ensemble, we choose Siamese Network as a baseline for its relatively poor performance. The experimental re-sults show that AE method brings significant improvement.
We used the Siamese Neural Network [26], but added a dense layer before the last neural layer. Since the weight of this dense layer is used to measure the distance between the models during the training process. The learning rate boundary is 0.005-0.03. The distance measurement α was set to 2. The pre-training phase contains 45000 tasks. We experimented on the Omniglot dataset, which is divided into training set and testing set according to Vinyals et al. [27] Table 7 illustrates the performance of Auto-Ensemble: The accuracy we obtained with weighted averaging (92.2%) has great improvement on an original Siamese Network model, and almost catches up with more complex models Matching Networks [27] (93.8%). Our experiment collected 10 checkpoints of model. Figure 7 illustrates the use of two methods to collect the checkpoints of the model: Using the cyclic learning rate schedule with 10000 tasks in one cycle, the accuracy of the model does not change much with the learning rate. Due  [27] 93.8 Siamese Network [26] 88.0 Siamese Network with ensemble method 92.2 Fig. 7 The comparison of piecewise linear cyclic learning rate schedule and auto adaptive learning rate schedule. to the instability of the model, the model often gets stuck in the saddle point during the training process, where the model is difficult to converge. Cyclic learning rate may be invalid when model reaches a saddle point ( Figure  8(a)). But in Auto-Ensemble, the learning rate jitters to let the model jump out of the saddle point automatically (Figure 8(b)). Adaptive learning rate schedule can fully mine different checkpoints of model.

Conclusions
This paper proposed an adaptive learning rate schedule for ensemble learning: by scheduling the learning rate, the model can converge and then escape the local optimal solutions.We pay attention to the improvement of performance rather than the absolute performance. By collecting the checkpoint of models, the ensemble accuracy can greatly exceed accuracy of single model. Besides, we proposed a method to measure the diversity among models, so that we can guarantee the diversity of collected models. In some non-traditional supervise problems, like Few-shot Learning, this method can be used to improve the performance of model simply and quickly. We refer to various related work and compare our method with these methods to analyze the results. To verify the effectiveness of our method, we need more experiment on other networks: DenseNet, and Matching Networks used for few-shot learning. In future work we will focus on the optimization of Auto-Ensemble: how to shorten the unpredictable training process. Since the purpose of the training is to collect as many models as possible, but the time and resources required for training are unpredictable, in order to save computing resources, the training process needs to be simplified.