dropCyclic: Snapshot Ensemble Convolutional Neural Network Based on a New Learning Rate Schedule for Land Use Classification

The ensemble learning method is a necessary process that provides robustness and is more accurate than the single model. The snapshot ensemble convolutional neural network (CNN) has been successful and widely used in many domains, such as image classification, fault diagnosis, and plant image classification. The advantage of the snapshot ensemble CNN is that it combines the cyclic learning rate schedule in the algorithm to snap the best model in each cycle. In this research, we proposed the dropCyclic learning rate schedule, which is a step decay to decrease the learning rate value in every learning epoch. The dropCyclic can reduce the learning rate and find the new local minimum in the subsequent cycle. We evaluated the snapshot ensemble CNN method based on three learning rate schedules: cyclic cosine annealing, max-min cyclic cosine learning rate scheduler, and dropCyclic then using three backbone CNN architectures: MobileNetV2, VGG16, and VGG19. The snapshot ensemble CNN methods were tested on three aerial image datasets: UCM, AID, and EcoCropsAID. The proposed dropCyclic learning rate schedule outperformed the other learning rate schedules on the UCM dataset and obtained high accuracy on the AID and EcoCropsAID datasets. We also compared the proposed dropCyclic learning rate schedule with other existing methods. The results show that the dropCyclic method achieved higher classification accuracy compared with other existing methods.

majority voting, Bayesian optimal, and stacked generalization. Kulkarni and Kelkar [8] used ensemble learning methods to classify multispectral satellite images. Three ensemble learning methods were compared (bagging, boosting, and AdaBoosting) and it was found that the ensemble learning method achieved better classification results than the single model. Cao et al. [9] segmented the building areas from the remote sensing images using a stacking ensemble deep learning model. In their method, images were first segmented using three models: FCN-9s, U-Net, and SegNet followed by, optimizing prediction results using a fully connected conditional random field (CRF). Finally, the multilayer features were extracted using a sparse autoencoder. Then, the final prediction results were computed using the Euclidean distance weighting method.
Additionally, several researchers have proposed ensemble learning methods for classifying satellite images. Minetto et al. [10] proposed an ensemble of convolutional neural networks (CNNs) for geospatial land classification. In their method the geospatial images were first sent to CNNs to predict the output. Hence, the predicted outputs from CNNs were determined as the final output using a majority voting method. Diengdoh et al. [11] used weighted and unweighted ensemble learning for land cover classification from the predicted output of various machine learning methods. Huang et al. [12] proposed an ensemble learning method for urban land use mapping tasks based on satellite images, street-view images, building footprints, points-of-interest, and social sensing data to explain the associations of land cover, socioeconomic activities, and land use categories.
Furthermore, new kinds of ensemble learning require expensive computation with no additional training cost while training neural network models, called snapshot ensemble learning [13]. The snapshot ensemble learning method aims to discover several local minimum values in one training. While training the model, we defined the number of cycles that we desired to snap the best model. For example, defining three cycles will return the three best models from each cycle, called snapshot. Further, the best model was snapped at the minimum loss value. Additionally, the learning rate schedule was used to quickly reduce the training loss value using the cyclic cosine annealing function. In addition, Wen et al. [14] proposed a new max-min cosine cyclic learning rate scheduler invented to find the acceptable ranges of maximum and minimum learning rates used in training.
Contribution. In this research, we focus on proposing the new cosine cyclic learning rate schedule by adding a step decay function to reduce the learning rate that directly decreases the training loss to converge local minimum in each cycle, called dropCyclic. For the dropCyclic learning rate schedule, the learning rate starts at the maximum learning rate. Further, the training loss decreases to converge on a local minimum while training in the first cycle. In the next cycle, the new maximum learning rate, which is a smaller value than the previous learning rate, is defined using the dropCyclic method. Consequently, the dropCyclic method narrows the learning rate range from the start until the last cycle. The snapshot ensemble CNN based on the dropCyclic learning rate schedule is proposed for aerial image classification. The proposed method is evaluated on three aerial image datasets: UCM, AID, and EcoCropsAID, and achieved good performance.
Outline of the paper. This paper is organized into five sections, as follows. Surveys of the related works are presented in Section 2. Section 3 presents the snapshot ensemble CNN for aerial image classification and the new learning rate schedule. In Section 4, three aerial image datasets are briefly described. Section 5 presents the experimental results and discussions. The conclusion and future work are presented in Section 6.

II. RELATED WORK
In this section, we briefly explain the research related to the ensemble learning and snapshot ensemble CNNs, including ensemble learning, snapshot ensemble CNN, and learning rate schedules for snapshot ensemble CNN.

A. ENSEMBLE LEARNING
Ensemble learning methods have been a growing research area in recent years. In this study, we surveyed ensemble learning methods with only two strategies: decision and ensemble.

1) THE DECISION STRATEGY
The outputs of other classifiers are combined and classified to the final output with various strategies, such as unweighted average, weighted average, majority vote, Bayes optimal, and stacked generalization [7]. Kim and Lim [15] proposed the ensemble CNNs method to learn on a large vehicle type dataset. The dataset contained more than 500,000 images and had 11 classes. The bagging method was used to randomly select the training data because the image distribution in each class was imbalanced. In the ensemble CNNs, the training images selected using the bagging method were transferred to the three CNNs. While training the CNN, the data augmentation techniques (flip, rotation, AR-fixed, AR-fixed rotation) were applied. The weighted average method was applied for the final prediction and achieved high performance. Minetto et al. [10] used state-of-the-art CNNs (ResNet50 and DenseNet161) and a majority voting method for geospatial land classification on multispectral images. In the first step, 12 CNN models were created using various settings, such as data augmentation, image crop style, and class weighting. The output of this step was the probabilities obtained from 12 CNN models. In the second step, the output probabilities were classified using the majority voting method. However, the correct prediction was accepted when the outputs from the CNNs were correct in more than five models. Their proposed method achieved an accuracy of 94.51% on the FMOW dataset.
Moreover, Diengdoh et al. [11] classified land cover using the ensemble learning method based on satellite imagery.
Their study classified the land cover images into six classes using the unweighted ensemble prediction method. First, four machine learning techniques: K-nearest neighbor (KNN), naive Bayes (NB), random forest (RF), and support vector machine (SVM), were proposed to predict probability outputs. Second, the probability outputs were classified using the unweighted ensemble learning method for the final output. Sefrin et al. [16] used three voting methods: unison vote, absolute majority, and no majority, to detect a land cover change from time-sequence Sentinel-2 images. The main architecture was the combination between the fully convolutional neural network (FCN) and long short-term memory (LSTM), called FCN+LSTM architecture. The timesequence images were first classified by the FCN+LSTM model and output as six classification maps and the final output classified using the voting method. The results showed that the final predicted class using unison or the absolute majority method achieved high accuracy.

2) THE ENSEMBLE STRATEGY
The ensemble strategy uses the weak learner to create a stronger learner and minimize errors while training. It also has various ensemble strategies. For instance, the bagging ensemble randomly selects subsets of the independent data of the same size. Then, the first, second, and subsets are trained using the first, second, and classifiers, respectively. Finally, fusing the output of the base classifiers with the majority voting method for predicting the final output [7], [15]. The boosting strategy, the original data is given to classify using a weak classifier. The original data that was misclassified from the weak classifier is weighted, and called weighted data, due to a decrease in bias obtained while training the weak classifier. Further, the weighted data is sent to the second weak classifier and again weighted to the misclassified data. It could repeat training with a weak classifier many times until it obtained the best weak classifier [8].
The idea of a combination between ensemble strategy (bagging and boosting learning) and the CNN-based method was proposed to short-term load forecasting [17]. In [17], the CNN model was firstly trained on the existing dataset to create the pre-trained CNN model. Then, the fine-tuned model was created by training the pre-trained CNN model from the first phase with the new dataset. Finally, the weak CNN models from phases one and two were constructed to create a robust model. Consequently, the average weighted method was used to compute the prediction result.
Korzh et al. [18] proposed the bagging ensemble and the stacking of CNN to classify remote sensing imagery. In their method, image processing techniques were first applied to the original images to reduce noise and increase sharpness. Then, the set of the original images was sent to CNN models (AlexNet, GoogLeNet, and VGG19) to extract the first feature. Also, the set of processing images was sent to CNN models to extract the second feature. Hence, the first and second features of more than 16,000 features were concatenated before sending to the machine learning technique. In their experiments, many machine learning techniques were compared, including SVM with different kernels (linear, radial basis function, polynomial), random forest, logistic regression. As a result, the SVM with a linear kernel obtained the highest performance on the Brazilian coffee scenes dataset with an accuracy of 96.11%.

B. SNAPSHOT ENSEMBLE CNN
The snapshot ensemble method was first proposed by Huange et al. [13]. The main purpose of the snapshot ensembles was to train CNN one time and obtain more CNN models. Therefore, while training the original CNN model, the model converged the minimum training loss value at the end. Hence, only one CNN model was obtained from the original CNN model. On the other hand, cyclic cosine annealing was used to converge multiple training loss values. The best CNN model in each cycle was used, called snapshot. Consequently, the output probability of each CNN model was calculated using the softmax function. Additionally, the unweighted average method was used as the final prediction. The output probabilities were averaged and the maximum probability was selected. The snapshot ensemble method was evaluated on various image classification datasets and achieved the best performance compared with a single CNN model.
In 2019, Wen et al. [14] proposed a new snapshot ensemble CNN for fault diagnosis. The max-min cosine cyclic learning rate scheduler (MMCCLR) was proposed instead of cyclic cosine annealing. The log-linear learning rate testing (LogLR) method was invented to search the fitting range of the maxmin learning rate when encountering new datasets. The MMCCLR method was evaluated on three datasets (bearing dataset of Case Western Reserve University, self-priming centrifugal pump dataset, and bearing dataset) and achieved very high accuracy on three datasets with 99.9%.
Moreover, Babu and Annavarapu [19] modified the snapshot ensemble method to classify COVID-19 from chest X-ray images. For training the CNN model, the pre-trained model ResNet50 was used and trained on the chest X-ray images. The data augmentation techniques (rotation, zoom, flip, and shift) were also applied while training. Subsequently, the weighted average method was used for ensemble learning instead of the unweighted average method. Hence, the weighted parameter was updated until it did not improve accuracy performance. The modified snapshot ensemble method achieved 95.18% accuracy on the COVID-19 XCR dataset and outperformed existing methods. Puangsuwan and Surinta [20] used the snapshot ensemble method to classify plant leaf diseases. Three CNN architectures (VGG16, MobileNetV2, InceptionResNetV2, and DenseNet201) were used as the backbone architecture of the snapshot ensemble method. The rotation method was used as the data augmentation technique while training the CNN models. In the snapshot ensemble method, training the DenseNet201 model using four cosine annealing cycles achieved the highest accuracy of 69.51% on the PlantDoc dataset compared to other ensemble methods (unweighted ensemble and weighted ensemble).
For the aerial images, Dede et al. [21] studied various ensemble strategies (including homogeneous, heterogeneous, and snapshot ensemble) to classify aerial scene images. In their experiments, two pre-trained CNN models were used: Inception and DenseNet. The snapshot ensemble method with Inception as a backbone architecture achieved an accuracy of 96.01% on the RESISC45 dataset. However, the snapshot ensemble method did not attain the best accuracy on the AID dataset. The best algorithm on the AID dataset was the heterogeneous strategy combining Inception and DenseNet and classified using the multi-layer perceptron (MLP). It achieved an accuracy of 97.15% on the AID dataset.

C. THE CYCLICAL LEARNING RATE FOR SNAPSHOT ENSEMBLE CNN
The popular optimization algorithm used while training the CNN model is stochastic gradient descent (SGD). SGD is used to update parameters of the CNN model until it converges to the local minimum value. In the original snapshot ensemble method, the SGD optimizer and the cyclic cosine annealing were computed to quickly decrease the training loss to converge the local minimum [13]. The training loss decreased very fast compared to the original CNN model. Wen et al. [14] proposed a new snapshot ensemble method that used the MMCCLR method to find the range of learning rates. Petrovska et al. [22] used an adaptive learning rate schedule with a triangular policy to train the snapshot ensemble method. Furthermore, Hung et al. [23] proposed a two-stage cyclical learning rate method using triangular methods. The triangular and triangular2 methods were used in the first and second states to find the best stable model and required few iterations while training.

III. PROPOSED SNAPSHOT ENSEMBLE CNN FOR AERIAL IMAGE CLASSIFICATION
The snapshot ensemble CNN was first proposed by Huang et al. [13] with the simple concept of finding many local minima values and then snapping the best CNN model at the local minimum in each cycle. Subsequently, the outputs of CNN models were combined and computed using the ensemble method. The essence of the snapshot ensemble CNN method is the cyclic learning rate schedule, which is the cyclic cosine annealing (CCA) schedule. The CCA schedule allowed the learning rate to decrease quickly, stimulating the CNN model to reach local minimum after a few epochs. The snapshot ensemble CNN produces lower errors than a single CNN model. This section briefly describes 1) The cosine cyclic learning rate schedule, including cyclic cosine annealing, max-min cosine cyclic, and the proposed dropCyclic. 2) The snapshot ensemble method.

1) CYCLIC COSINE ANNEALING
The cyclic cosine annealing (CCA) is the primary learning rate schedule of the snapshot ensemble CNN method used while training the CNN model. CCA allows the CNN to lower the learning rate faster than the traditional CNN model and converge to diverse local minimums [13]. The CCA curve training with 100 epochs and using five cycles (M), when where stands for the learning rate of current iteration, is the initial learning rate, is for the current iteration number, is the total iterations, and is the number of cycles.
In the CCA, only the initial learning rate ( ) is required to be adjusted. As a result, the wrong learning rate will cause the training process to not converge with the local minimum at the end of each cycle.

2) MAX-MIN COSINE CYCLIC LEARNING RATE SCHEDULER
Wen et al. [14] proposed the max-min cosine cyclic learning rate scheduler (MMCCLR). The upper and lower boundaries of the learning rate were proposed to adjust the boundary of the learning rate. However, in the MMCCLR, the log linear learning rate test (LogLR Test) method was proposed to find the learning rate range, which is the max and min learning rates. The LogLR Test method and the MMCCLR are calculated as Equation (2)   where is the minimum learning rate and is the maximum learning rate that is tested using Equation (2).

3) PROPOSED DROP CYCLIC COSINE LEARNING RATE SCHEDULE
In this study, we proposed a drop cyclic cosine learning rate schedule, called dropCyclic. The dropCyclic is the systematic reduction of the learning rate over a specific time during training. This research aims to decrease the learning rate that cuts by a constant factor called the drop parameter, every constant number of epochs (see Figure 2), as same as the step decay schedule [24]. The dropCyclic is also efficient in discovering the diversity of local minimums in each cycle using the c parameter.
Moreover, in dropCyclic, the maximum learning rate in each cycle is changed according to the drop parameter. While the learning rate range is limited, the CNN model can faster converge to the local minimum. The equation of the dropCyclic is computed as Equation (4).  )) (4) where is the initial learning rate, is the step decay parameter that drops the learning rate in every epoch, is a constant number that lets the model change to the new local minimum in the next cycle.

B. SNAPSHOT ENSEMBLE METHODS
The ensemble method is the final step of the snapshot ensemble CNN to enhance accuracy based on diverse CNN models from single training. We trained the CNN model using the proposed dropCyclic method and snapped the best CNN model from each cycle in the previous step. The output probabilities of each CNN computed using the softmax function were combined and classified using the unweighted average ensemble method. Indeed, the last models are significant to have the lowest error. We then normally ensemble the last models. The ensemble method is calculated as Equation (5).
where is the number of CNN models and is the output probabilities of CNN model that is computed using the softmax function.

A. UC MERCED LAND USE (UCM) DATASET
Yang and Newsam [25] first proposed the UCM dataset for land use classification tasks collected from the USGS national map urban area imagery. The aerial images were extracted from large images and divided into 21 classes, such as agricultural, forest, golf course, beach, harbor, buildings, medium residual, sparse residential, and dense residential. It is stored in the RGB color space image with 256x256x3 pixels. The UCM dataset contains 2,100 images and some examples of the UCM dataset are shown in Figure 3. However, the challenge of the UCM dataset is that classes of medium residual, sparse residential, and dense residential, are similar and difficult to classify, as shown in Figure 4.

B. AERIAL IMAGE DATASET (AID)
The AID [26] was proposed for aerial scene classification tasks. It has 10,000 images and contains 30 different aerial scene classes, for example, dense residential, medium residential, sparse residential, stadium, industrial, bridge, and baseball field. Each class has approximately 200 to 400 images of 600x600 pixels. The AID was collected from the Google Earth application at resolution of 8 to about 0.5 meters. Examples of the AID are shown in Figure 5.

C. ECOCROPSAID DATASET
Thailand's economic crops aerial image dataset (EcoCropsAID) was proposed by Noppitak and Surinta [27] for land use classification. The EcoCropsAID dataset was collected according to the information on the cultivation of economic crops in different regions between 2014 and 2018 obtained from Agri-Map Online. The economic crops aerial images were collected from the Google Earth application at resolution of 30 to 0.2 meters. The images were stored in the RGB format with 600x600 pixels. It has 5,400 aerial images of five classes: rice, sugarcane, cassava, rubber, and longan. Example images of the EcoCropAID dataset are shown in Figure 6. The challenges of the EcoCropsAID dataset are that the pattern of each class is quite similar (see Figure 7), and various patterns occur in the same class (see Figure 8). The details of the three aerial image datasets used in our experiments are summarized in Table I.

V. EXPERIMENTAL RESULTS AND DISCUSSION
We demonstrated the effectiveness of the proposed cyclic learning rate (dropCyclic) and compared it with two existing cosine cyclic learning rate methods: cyclic cosine annealing (CCA) and the max-min cyclic cosine learning rate scheduler (MMCCLR) on three aerial image datasets. For the backbone of the snapshot ensemble, we compared three CNN architectures: MobileNetV2, VGG16, and VGG19.
All the experiments were trained and evaluated on a Linux operating system using Intel(R) Core-i9-9900K CPU @ 3.60GHz x 16, RAM 32GB, and GPU GeForce GTX 1080Ti with RAM 11GB GDDR5x. We implemented all snapshot ensemble methods based on the TensorFlow deep learning framework with the Keras library.

A. EVALUATION METRICS
In this experiment, we used K-fold cross-validation (cv) with K=5 over the training set to prevent overfitting problems. Hence, the overall accuracy (%) and standard deviation evaluated the training set. Further, test accuracy was used to evaluate the classification performance, and the results were compared with existing snapshot ensemble methods. The accuracy performance was computed as shown in Equation (6).
where is a true positive (indicates the positive samples that are correctly classified), is a true negative (indicates the negative samples that are correctly classified), is a false positive (indicates the negative samples that are misclassified), and is a false negative (indicates the positive samples that are misclassified).
In order to prevent overfitting problems and compare different learning rate methods, we used the loss difference (LD) metric to evaluate snapshot ensemble methods when the difference learning rate policy was performed. The LD is the evaluation metric that indicates the robustness of the model against the overfitting problems [28]. Overfitting problems appear when the low loss value is on the training set, but the high loss value is on the test set. Hence, it results in low accuracy. The smallest LD value shows the robustness of the model, which is computed as Equation (7).
where and are the loss values obtained while training on the training set and validation set, respectively.

1) DATA RATIO AND NUMBER OF EXPERIMENTS
We reported the data ratio and the number of experiments of each dataset. We divided the dataset into training, validation, and test sets. We split with the ratio of 4:1:5 for the AID dataset and 7:1:2 for the UCM and EcocropsAID datasets. Due to the randomness of the training and validation sets, we computed experiments three times and reported the mean accuracy and standard deviation on the validation set. Further, we trained the model again on the training and validation sets with the best setting and evaluated the test set.

2) BACKBONE CNN ARCHITECTURES
In our previous study [27], several CNN architectures, including InceptionResNetV2, DenseNet201, Xception, ResNet152V2, NASNetLarge, MobileNetV2, VGG16, and VGG19 were experimented with. We found that the VGG16 and VGG19 achieved the highest accuracy. Subsequently, MobileNetV2 showed worse accuracy compared to VGG16 and VGG19. Hence, in this study, we mainly experimented with the snapshot ensemble CNN using three state-of-the-art architectures as a backbone CNN: VGG16, VGG19 [29], and MobileNetV2 [30]; to prove that the snapshot ensemble could manage both the best and the worst CNN architectures and also enhance the classification performance on the land use images.   1.0,0.95,0.85,0.75,0.65,0.50 and (a) c = 5, (b) c = 10, and (c) c = 15. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

3) SNAPSHOT ENSEMBLE METHODS
We compared the proposed drop cyclic cosine learning rate schedule (dropCyclic) with two existing learning rate schedules: CCA and MMCCLR. We trained the snapshot ensemble with 100 epochs and the snapshot parameter with = 5 cycles.

4) DROPCYCLIC LEARNING RATE SCHEDULE METHOD
As shown in Figure 9, we illustrated the learning rate curve of the dropCyclic learning rate schedule. The learning rate was computed using Equation 4. The small value made the model capable of escaping the local minimum to find a new local minimum, as shown in Figure 9(a). With the large value, the model could have the energy to discover a new local minimum, as shown in Figures  9(b) and 9(c). In the case of = 1.0, at the first cycle, the maximum learning rate was 0.0010 and the learning rate decreased in each epoch until zero. In the second cycle, the learning rate started again at 0.0010. Further, in the case of = 0.95, the maximum learning rate in the next cycle was slightly dropped until the last cycle according to the parameter. In our dropCyclic experiments, the and values were 0.95 and 10, respectively.
The details of the hyperparameter settings on CNN architectures and the learning rate schedule are summarized in Table II.

C. CLASSIFICATION RESULTS ON THE UCM DATASET
We first observed the optimal learning rate values to confirm that the proposed dropCyclic learning rate schedule performs well when using the optimal learning rate. In this experiment, we trained the snapshot ensemble learning using the MobileNetV2 as a backbone CNN and the proposed dropCyclic learning rate schedule on the UCM dataset. In this study, we illustrated the accuracies of various learning rates, including 0.1, 0.01, 0.001, and 0.0001. The validation accuracies are shown in Figure 10. The scatter plot Figure 10, show that the validation accuracies closed to approximately 98% were achieved when the learning rate value reached zero. However, we observed that the validation dropped when the learning rate was high. In this study, the validation accuracy dropped from approximately 80% to 40% when the learning rates were in the range of 0.0008 to 0.0010.   Figure 11 presents the test error (%) of snapshot ensemble CNN with different learning rate schedule methods. We trained the snapshot ensemble CNN with five cycles, and then the test error of the ensemble method when combined one, two, three, four, and five models, respectively, were reported. As seen in Figure 11(a) in the third column, the snapshot ensemble CNN using MobileNetV2 with dropCyclic learning rate schedule obtained the lowest test error when ensemble with only three models. In comparison, CCA methods achieved the lowest test error when ensembled with five models (see Figure 11(a) in the first column).
Furthermore, when training the snapshot ensemble CNN using VGG19 with MMCCLR and dropCyclic learning rate schedule methods, as shown in Figure 11(c) in the second and third columns, the lowest error was achieved by using only one model. Therefore, the VGG19 discovered the optimal local minimum at the first cycle according to the number of aerial images in the UCM dataset with only 2,100 images. The results are presented in Table III. We evaluated the snapshot ensemble CNN using three evaluation metrics: LD, validation (mean accuracy and standard deviation), and test accuracy. Note that the LD value presented the best method for preventing the overfitting problem. On examining Table III, we discovered that all learning rate schedule methods: CCA, MMCCLR, and proposed dropCyclic, can address the problem with overfitting because all learning rate schedule methods achieved low LD values. Consequently, the accuracies of the validation and test did not show an enormous difference value. As a result, the proposed dropCyclic learning rate schedule outperformed the existing learning rate schedules: CCA and MMCCLR when training the CNN model with MobileNetV2 and VGG19 in terms of test accuracy. The proposed dropCyclic method also outperformed when training with MobileNetV2 and VGG16 on the validation set. In conclusion, the snapshot ensemble CNN using the MobileNetV2 as a backbone CNN and the proposed dropCyclic learning rate schedule ( = 0.95 and = 10) achieved the highest test accuracy of 97.38% on the UCM dataset.
We illustrated the confusion matrix to show that the snapshot ensemble CNN method can be proposed to learn from many aerial image patterns and even similar patterns between two or more classes, such as classes of residentials, including spare, medium, and dense. The medium residential was misclassified as sparse residential (2 misclassified images) and dense residential (one misclassified image), as shown in Figure 12.  We compared the snapshot ensemble CNN using the proposed dropCyclic learning rate schedule with existing methods. The experimental results in Table IV show that our method achieved 97.38% on the UCM dataset and outperformed other methods, except only the IRELBP+SDSAE method that slightly obtained better accuracy with 97.61%.

D. CLASSIFICATION RESULTS ON THE AID DATASET
We experimented on a snapshot ensemble CNN using three CNNs: MobileNetV2, VGG16, and VGG19 and three learning rate schedules: CCA, MMCCLR, and dropCyclic. The test errors of each experiment are illustrated in Figure 13. The graphs show that combining more models obtained better performance than using a single model. We obtained the lowest test error when using MobileNetV2 as a backbone CNN, as shown in Figure 13(a). Furthermore, using the learning rate schedule with the CCA method outperformed other learning rate schedules on both validation and test sets. The overall performance is shown in Table V. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and   Table V shows the performance of the snapshot ensemble CNN methods on the AID dataset, which is an unbalanced dataset because each class has between 220 to 420 aerial images (see Figure 14). We used the LD value to measure the overfitting problems that can be found when training the CNN model. We found LD values between 0.1-0.3 with all experiments. Hence, the test accuracies of all experiments were higher than the validation accuracies. The snapshot ensemble CNN can address the problem with overfitting problems. As a result, using MobileNetV2 and the CCA learning rate schedule achieved 94.86% accuracy and outperformed other methods on the AID dataset. Three models of snapshot ensemble CNN with various learning rate schedules were selected for receiver operating characteristic (ROC) comparison, as shown in Figure 15. The snapshot ensemble CNN using MobileNetV2 and CCA learning rate schedule attained an AUC value of 0.9982. Using VGG16 and the CCA learning rate schedule and VGG19 and the dropCyclic learning rate schedule achieved AUC values of 0.9982 and 0.9981, respectively.
Consequently, the dropCyclic learning rate schedule outperformed other learning rate schedules when training with the VGG19. Moreover, we concluded that the snapshot ensemble CNN using MobileNetV2 could address the unbalanced data better than VGG16 and VGG19 architectures.   23 Yu et al. [34] This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

E. CLASSIFICATION RESULTS ON THE ECOCROPSAID DATASET
This experiment discovered that combining only one or two models could achieve the lowest test error, as shown in Figure  16. Because the EcoCropsAID dataset has only five classes and each class contains ~1,000 aerial images, the CNN models can cope well with many patterns in each class and are well classified. The overall performance is shown in Table VII. Table VII compares the performance of the snapshot ensemble CNN methods on the EcoCropsAID dataset. All the snapshot ensemble CNN models were trained using five cycles. We found that all the experiments prevented the overfitting problems with the small LD values between approximately 0.006 -0.01. In this experiment, training the VGG16 using the dropCyclic and the CCA learning rate schedules achieved the highest validation accuracy of 99.60%. However, the MMCCLR learning rate schedule slightly outperformed with an accuracy of 99.26%. Figure 17 shows the confusion matrix of the snapshot ensemble CNN based on dropCyclic and MobileNetV2 on the EcoCropsAID dataset and illustrates only a few misclassifications.    In addition, we compared the snapshot ensemble CNN based dropCyclic learning rate schedule and MobileNetV2 with the existing method [27]. The result showed that our proposed method achieved an accuracy that was 6% higher than the existing method, which achieved only 92.80%.

1) LOSS ERROR CURVE OF THE DIFFERENT LEARNING RATE SCHEDULES
We discovered that the learning rate parameter directly affects the accuracy of the CNN model. Hence, we have seen existing research focused on tuning the learning rate parameter [28], [35], [36]. However, the learning rate parameter was not the primary priority in our experiment because the dropCyclic method was proposed to change the maximum of the learning rate in each cycle. Further, the maximum learning rate was decreased in each cycle according to the drop parameter, as presented in Figure 9. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and  Figure 18 shows the training loss values of each learning rate schedule: CCA, MMCCLR, and dropCyclic when training with the snapshot ensemble CNN using the MobileNetV2 on UCM (see Figure 18(a)), AID (see Figure  18(b)), and EcocropsAID (see Figure 18(c)) datasets. In this experiment, the hyperparameters were adjusted, as shown in Table II. We adjusted the cycle of the snapshot ensemble CNN to 5 cycles. The loss values started with a high value and quickly decreased to the lowest, called the local minimum. Subsequently, the loss values increased and then decreased again to the lowest value in the next cycle to find other local minimum values. We then snapped the best CNN model at the local minimum value of each cycle and used it in the ensemble method. However, only training on the UCM dataset showed that the loss value did not increase too much because the aerial images in the UCM dataset contained only 2,100 images. Hence, the CNN model can learn and create the model that is appropriate with a small number of aerial images.

2) COSINE CYCLIC LEARNING RATE SCHEDULE WITH MAX AND MIN VALUES
In our proposed dropCyclic learning rate schedule, we utilized the idea of a step decay schedule to drop the learning rate in every epoch by adding the step decay process into the cosine cyclic learning rate schedule (see Equation (4)). In our dropCyclic method, only two parameters were required. The and parameters. The parameter was proposed as a step decay that drops the learning rate by half every epoch. The parameter allows the model to shift to the new local minimum in the next cycle, as shown in Figure 9(b). As a result, the best dropCyclic parameters were = 0.95 and = 10. Consequently, the proposed dropCyclic learning rate schedule outperformed other learning rate schedules on the UCM dataset. Further, the proposed dropCyclic learning rate schedule achieved high accuracy on the AID and EcoCropsAID aerial image datasets. In conclusion, the proposed dropCyclic learning rate schedule has the advantage that it restricts the maximum learning rate in each cycle by using step decay parameters: drop and c. Hence, the maximum learning rate is not the priority parameter required to adjust.

VI. CONCLUSION AND FUTURE WORK
This research proposed a new learning rate schedule called the dropCyclic. We developed the concept of the step decay schedule that decreases half of the learning rate value in every epoch, call drop. The parameter was contained in the cosine cyclic learning rate schedule. It contained two parameters, the and parameters. The benefit of the dropCyclic learning rate schedule is that the learning rate was dropped in the next cycle according to the drop parameter. The method allows the convolutional neural network (CNN) model to discover the new local minimum in the subsequent cycle using the parameter. We evaluated the proposed dropCyclic learning rate schedules and the existing methods: cyclic cosine annealing (CCA) and max-min cyclic cosine learning rate scheduler (MMCCLR) on three aerial image datasets, including UCM, AID, and EcoCropsAID datasets. Three CNN architectures were compared for the backbone CNN architectures, consisting of MobileNetV2, VGG16, and VGG19. The proposed dropCyclic learning rate schedule achieved the best results on the UCM dataset. The dropCyclic method obtained very high results on the AID and EcoCropsAID datasets. In comparison with other methods, the proposed dropCyclic learning rate schedule outperformed all methods on the AID and EcoCropsAID datasets, except on the UCM dataset for which the IRELBP+SDSAE slightly outperformed the dropCyclic method.
In future work, we will continue to concentrate on the learning rate schedule, such as adaptive learning rate [37], cyclical learning rate with triangular, triangular2, and exp_range policies [23], [38]. Second, we will consider extracting the spatial and temporal features instead of extracting them using only CNN architectures [38]- [40]. Finally, the unbalanced data is also a big challenge to be addressed and enhance the classification performance [41]- [43].