Weighted Residual Dynamic Ensemble Learning for Hyperspectral Image Classification

Recently, collaborative representation classifiers have been extensively studied as an essential method for the hyperspectral image. However, how to comprehensively utilize the classification advantages of multiple collaborative classifiers has not been well investigated. In this article, two new dynamic ensemble learning methods using local weighted residual (LWR-DEL) and double-weighted residual (DWR-DEL) of multicollaborative representation classifiers are proposed. First, the dynamic ensemble learning method based on clustering is utilized to introduce prior knowledge for the collaborative representation classifier. Then, with prior knowledge, the local weights of each classifier for a different region of competence are obtained. To consider the global information of hyperspectral data, the K-nearest neighbor algorithm is adopted to achieve validation samples with global information. The global weights for each classifier can be obtained and then used to constrain the locally weighted residuals. Similar to LWR-DEL, the global information is also used to constrain residual, and then double-weighted constrained residual fusion obtains the final classifier result. The effectiveness of the proposed methods is validated using three hyperspectral data sets. The experimental results show that both LWR-DEL and DWR-DEL outperform their single-classifier counterparts. In particular, the proposed methods provide superior performance compared with the state-of-the-art methods.


I. INTRODUCTION
H YPERSPECTRAL images have abundant spectral information in hundreds of contiguous narrow spectral bands [1]. Based on these properties, hyperspectral images have many applications in many fields [2], [3], [4], [5], [6]. Among them, hyperspectral image classification is one of the most critical tasks for real applications. With high spectral resolution, fine classification can be achieved. However, the vast amount of hyperspectral data, redundant data, few labeled samples, and correlation between bands have become essential factors restricting the classification performance of hyperspectral images Manuscript [7], [8], [9], [10], [11], [12], [13]. To solve the above problems, some advanced hypersectral image (HSI) classification algorithms have been proposed. Song et al. [14] proposed a new band selection method to deal with the redundant information problem in HSI classification. In this progressive band selection method, classification is performed incrementally in multiple stages. The experimental results show that this method performs better than other HSI classification methods that use full bands. Yu et al. [15] investigated the feedback attention modules in an HSI classification network and proposed a spatial-spectral dense convolutional neural network (CNN) framework with a feedback attention mechanism. Experimental results based on real HSIs demonstrate the superiority of the proposed methods over other state-of-the-art algorithms. Meanwhile, some advanced algorithms were developed to solve the limited samples problem in HSI classification [16], [17], [18], [19]. However, a single classifier often cannot solve the above problems comprehensively. Therefore, how to develop a classifier or classifier ensemble that can overcome the above limitations is a crucial problem in hyperspectral image classification. Ensemble learning can be divided into two categories according to whether prior information is used to measure its competence. The first category is the static ensemble [20], [21], [22], [23], [24], [25], [26]. It assumes that each base classifier is independent and has higher accuracy than random guessing. Then, a specific strategy is adopted to combine multiple classifiers for higher classification accuracy. The most famous static ensemble algorithms include boosting [27], [28], bagging [29], and random subspace [30], which has many applications in hyperspectral image classification [31], [32], [33], [34]. Su et al. [35] proposed a new ensemble fusion strategy that first uses collaborative representation (CR) based classifiers as base classifiers for hyperspectral image classification. The result shows that the traditional ensemble strategies such as bagging and boosting are also suitable for the CR-based models. Bao et al. [36] first proposed ensemble learning from the perspective of the feature layer and applied it to hyperspectral image classification. The results show that ensemble learning from the feature perspective is effective for hyperspectral image classification. Pan et al. [37] first proposed using an ensemble strategy to combine hierarchical guidance filtering and matrix of spectral angle distance for hyperspectral image classification, which can effectively improve classification accuracy. However, there are many ground objects. Some classifiers have higher classification accuracy for a specific object but lower overall accuracy (OA). Moreover, all static ensemble learning requires the high accuracy of the base classifier. These restrictions make the static ensemble This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0 /  TABLE I  SIXTEEN CLASSES OF THE INDIAN PINES DATA SET   TABLE II  NINE CLASSES OF THE UNIVERSITY OF PAVIA DATA SET   TABLE III  THIRTEEN CLASSES OF THE YELLOW RIVER DATA SET unable to fully play the advantages of these classifiers with higher local accuracy.
The second category is the dynamic ensemble selection (DES) [38], which is to obtain the prior information of the classification by using a specific region division and classifier selection strategy [39], [40], [41]. The best-fit classifier is assigned to unknown samples in each region based on these priors. DES assumes that each classifier, including weak classifiers with extremely weak classification accuracy, is an expert for specific testing samples [42], [43], [44], [45]. Therefore, compared with the  static ensemble, the DES method can better utilize the local classification advantages of weak classifiers. Recently, the DES has also been introduced into hyperspectral image classification. Damodaran et al. [46] first proposed using DES for hyperspectral image classification and further improved the method [47]. It first combines a dimensionality reduction process with the dynamic selection method to construct a DES framework. Then, the random subspace method, Markov random field, and extreme learning machine are introduced into DES. The experimental results show that the two proposed methods can obtain better classification performance compared with the traditional ensemble learning methods. However, the typical DES method is a classifier selection strategy, and some very weak classifiers with high classification accuracy for specific regions will still be eliminated. Therefore, the local advantage of the classifier is still not fully utilized. Meanwhile, DES models directly fuse the classification results without considering the residuals difference of these classifiers.
Based on the above analysis, the static ensemble has a specific improvement in classification accuracy compared with a singleclassifier method. However, most traditional ensemble methods directly fuse the classification results without considering the  local performance. The typical DES methods often use clustering or K-nearest neighbor (K-NN) methods to divide the classification target into different regions. However, local and global information are not both considered in DES methods. Ensemble learning based on a representation learning model is concise and computationally efficient and has strong generalization ability.
However, the existing representation learning-based ensemble methods are still based on traditional ensemble strategies, which do not fully apply the intrinsic principles of models. Therefore, how to fully consider the possible local classification accuracy of extremely weak classifiers and the diversity of CR-based classifiers for ensemble learning is still an open problem. Therefore, the goal of the article is to make full use of the ability of DES to obtain a priori information. Then, the CR classifier (CRC) is directly improved by weighting residuals. This method considers the inherent discrepancies of different representation learning through residual analysis. The idea of the DES is introduced to use the prior information of classifier behavior. Meanwhile, the unique advantages of each base classifier are fully utilized compared with traditional ensemble learning, double-weighting the residuals directly, and taking into account both local and global information of the classifier. The methods proposed in this article have the following advantages. First, then the two methods make full use of the representation learning model for the ensemble. Meanwhile, the local and prior global information is used for weighting. At last, the proposed methods also consider the inherent differences of the classifiers and use the specific advantages of each base classifier, which can yield better classification results. Notably, different from the traditional DES method, the region of competence (RoC) used in the article is only to obtain the prior information of the classifier without the selection process. Since the residuals of different CR-based classifiers are different, this prior information is used to construct a weight matrix to constrain the final result of the ensemble. That is, a new classifier-weighted learning strategy is proposed based on the DES strategy. The two methods proposed in the article do not perform classifier selection but weigh each classifier according to prior knowledge. Two classifier ensemble strategies called local weighted residuals (LWR) and double weighted residuals (DWR) dynamic ensemble learning are proposed. The major contributions are summarized as follows: 1) The multiple CR-based classifiers are combined through the use of residuals. The prior behavior information of the classifier is obtained by constructing validation samples to constrain the residuals of CR. Ensemble classifiers from the perspective of residuals can make better use of the differences between individual classifiers. Meanwhile, the misclassification problem due to insignificant residuals can be avoided by using the prior information to weigh the residuals. 2) The article proposes to obtain the prior information of the classifiers and use it directly for a residual ensemble to better exploit the unique advantage of each classifier. The prior information on the classification behavior in multiple target regions is obtained by K-NN and clustering. The residuals for each classification are then constrained with local priors' weights. Then, the behavior of each classifier is also constrained by weighting the residuals. This method is different from the traditional ensemble method that fuses classification results but directly weights the residuals from individual classifiers. This makes better use of the local classification advantages of each classifier.
3) The article proposes to use both local and global information to double constraints on the classifier to obtain ensemble results. Local weights are used to constrain the behavior of the classifier while also considering global information. Using the clustering method to obtain the local information of the classifier, the global behavior information of the classifier is also obtained by K-NN.
Finally, a more reliable dual-weight constrained classifier fusion result is obtained. This method simultaneously considers each classifier's regional and global behavior on the unknown testing samples while ensuring that weak classifiers can also be fully utilized. The remainder of this article is organized as follows. Section II introduces related work. Section III proposes the LWR-DEL and DWR-DEL algorithms. In Section IV, experiments and analyses with three real hyperspectral data are presented. Finally, Section V concludes this article.

A. Residuals of CR-Based Classifiers
The basic idea of sparse representation comes from compressed sensing. It is assumed that the fewest samples can represent the testing data. When the samples are highly correlated, the projection of the sample y in each class may be roughly the same, and the result of the sparse representation classifier is unstable. Therefore, CRC is proposed, which is described in detail as follows. Given a matrix of training samples X = [x 1 , x 2 , . . . , x k ] ∈ R m×n for k classes and a testing sample y ∈ R m , the objective is to solve the 2 −minimization problem The residual is computed as Finally, y can be classified as class(y) = arg min i r i (y).
Based on CRC, improved variants are developed from kernel tricks (KCRC) [48], probabilistic interpretation (ProCRC) [49], and other perspectives. Fundamentally, the basic principle of all CR-based classifiers is still based on the 2 − minimization problem.
However, the significant difference between the improved methods of CR-based classifiers makes their residuals dissimilar in value. Therefore, a suitable residual fusion method for optimal CR-based classifiers ensemble results is to be found in the article. The normalized residuals were used in the article because large difference among the residuals obtained by multiple CR-based classifiers.

B. Dynamic Selection
DES is an ensemble strategy that can choose an expert classifier for each testing region in the feature space. Unlike the traditional ensemble methods, the DES method assumes that each classifier has its own advantage. Even a weak classifier may have a better classification performance for some classes and some testing samples. Therefore, the most competent classifiers can be chosen for specific instances. For DES, the most essential concept is competence. It should be noted that the most competent here refers to the classification ability of each base classifier for a specific area. The general process for DES can be divided into three main steps: RoC definition, competence estimation, and selection strategy. RoC definition is the critical step, and it is also the main idea introduced by the algorithm proposed in the article. The summary is detailed as follows: 1) RoC Definition Based on Clustering: The first common RoC definition method is mainly based on clustering. Given validation and testing samples, they are assumed to have the same classes and feature space. The number of verification samples and test samples and the corresponding class are described in detail in Tables I-III, respectively. Then, according to the concept of multiview clustering, the validation and the testing set can be divided into the same homogeneity regions according to a specific pattern similarity measure. As shown in Fig. 1, , they are divided into the same homogeneous testing set. It can be considered that each base classifier has the same classification ability for the corresponding roc i . Therefore, the prior competence information of each base classifier can be obtained through the validation set. Finally, assign the most suitable base classifier for each RoC of the testing set. In fact, by using the clustering method to divide the RoC, the local prior information of the samples in each cluster can be obtained. The method is shown to be effective in selection classifiers with local classification ability.
2) RoC Definition Based on K-NN: The K-NN method is another standard partitioning method for RoC in DES. This method uses the K-NN to construct the k nearest samples for each instance in the testing set for validation. Unlike the clustering method, it defines N regions RoC = [roc 1 , . . . , roc N ] with homogeneity. The divided region of validation t and the RoC through K-NN can be regarded as a region partitioning method with global information. Because this method constructs k validation samples for each instance in the testing set. As shown in Fig. 2, given a testing sample X t = [x 1 t , . . . , x m t ], k global validation samples can be obtained through the K-NN method. Similarly, it can be considered that the k validation sample sets are homogeneous with the testing set. As with the clustering-based method, the classification accuracy of each base classifier is obtained the validation set. The difference is that the K-NN method can get the global prior classification information of each base classifier in the pool.

III. PROPOSED METHODS
The details of the proposed LWR-DEL and DWR-DEL algorithms are mainly described from three aspects: classifier pool construction, weight matrices calculation, residual weighting

A. Necessity of Weighted Residual Fusion
However, as can be seen from Fig. 3, the normalized residual results of CRC (λ = 1e−1), ProCRC (λ = 1e−1, γ = 1e−1), and KCRC (λ = 1e−1) for the same pixel are quite different. For a single classifier, as shown in Fig. 3(a), the CRC classifier has  little difference for the residuals of classes 4, 7, and 12. As can be seen from Fig. 3(b) and (c), the overall residual discrimination of ProCRC and KCRC is very low. If converted into probability output, the probabilities of classification results for the pixel belonging to these 16 classes are similar. Meanwhile, it can be seen from Fig. 3(d) that the residuals between the three classifiers are also different. Therefore, the results of residual fusion are also unreliable if they are simply added. In summary, different CR-based classifiers are not discriminative for the single classification results of certain pixels, and cannot use a simple addition method for residual fusion. Therefore, we introduce the local and global prior information obtained by DS and propose the following two weighted residual fusion algorithms.

1) Classifier Pool Generation:
Three different CR classifiers, i.e., CRC, KCRC, and ProCRC, are selected to meet the differences of the models, and various parameters are set to meet where w ij represents the weight of the classification accuracy of the ith classifier in the pool for the jth region.

3) Residuals Weighting and Fusing:
According to (1), the representation coefficient of RoC can be obtained by using the classifier clf i .
which is solved as Then, the coefficient matrix is The residual matrix is Finally, using the weight matrix in (4), the LWR is obtained where w ij represents the weight of the classification accuracy of the ith classifier in the pool for the jth region and r ij is the corresponding residual. Then, the final classification result is obtained according to (3) class(y) = arg min W R(y).
3) Residuals Weighting and Fusing: Similar to LWR-DEL, using the weight matrix in (4) and (11), the double-weighted residual is obtained where w ij represents the weight of the classification accuracy of the ith classifier in the pool for the jth region and r ij is the corresponding residual. Finally, the final classification result is obtained as class(y) = arg min DW R(y).

A. Experiment Setup
All experiments are implemented using the platform of Python 3.6.13. To ensure comparability and fairness, the compared and proposed algorithms use the same training, validation, and testing samples. 2) Parameter Settings: The range of cluster numbers n_r in LWR-DEL and DWR-DEL is set {2, 3, 4, 5}. The range of k numbers n_k in DWR-DEL is set {1, 2, 3, 4, 5}. It is worth noting that the method of repeated replacement sampling for the selection of training and testing samples is used in the article. This method can increase the randomness of samples and avoid the impact of sample importance on classification accuracy.
3) Comparison Algorithms: To evaluate the performance of the proposed algorithm, multiple classification algorithms were used for comparison. For example, the classic machine learning algorithms support vector machine (SVM) and random forest (RF) are the baselines. Moreover, the advanced ensemble algorithm GBDT, CatBoost [28], LightGBM [50], and XGboost [51] are also used as comparison algorithms. In addition, the two state-of-the-art DES algorithms, namely DES-MI and META-DES algorithms, are used as comparative algorithms in the article.

B. Hyperspectral Data Sets
The performance of the LWR-DEL and DWR-DEL is evaluated by three real HSI data sets.
The first data set is the Indian Pines data set, collected by the AVIRIS senor. This data set contains 224 spectral bands with wavelengths ranging from 0.4 to 2.5 μm. There are 200 effective bands remaining in the data after removing water absorption bands. The spatial size is 145 × 145. The details of 16 classes in this HSI are described in Table I, and the images are shown in Fig. 6.
The second image used in this article is the University of Pavia data set, acquired by the ROSIS sensor. This image contains 103 spectral bands with wavelengths ranging from 0.43 to 0.86 μm. The scene consists of nine classes, containing 512 × 614 pixels, and the spatial resolution is 20 m. The descriptions of classes in this data are listed in Table II, and images are shown in Fig. 7. The third is the real HSI Yellow River, which is collected on January 7, 2019, by Gaofen-5 senor [52], [53]. This image contains 285 spectral bands. The spatial size of the Yellow River is 1185 × 1324. There are 21 classes in this data, as shown in Table III, and the false-color image and ground truth image are shown in Fig. 8.

C. Ensemble Classification Performance Analysis
For the ensemble learning method, the key is that the final classification result is better than all base classifiers in the pool. Traditional ensemble learning methods often have higher requirements for the base classifier used. For example, the classification accuracy should be higher than 50%. To verify whether the ensemble performance of the proposed method is effective, we compare the accuracy of the base classifier and the classification accuracy of the proposed methods. As shown in Fig. 9(a), for the first data set, the accuracy of the two proposed algorithms is higher than that of all base classifiers. For the second data set, the same conclusion can be drawn, but the final classification performance is better than the former. For the Yellow River data set, LWR-DEL and DWR-DEL classification accuracy are much higher than that of most base classifiers. Overall, the ensemble results of the two proposed models are higher than those of the base classifier for the Indian Pines and HSI the University of Pavia data. Meanwhile, for the last data set, the proposed methods also can obtain great performance. According to Fig. 9, the proposed models can still achieve high accuracy even if the base classifier accuracy is low. Even when the base classifier accuracy is lower than 85%, the final ensemble accuracy is still higher than 95% [see Fig. 9(c)]. The results show that two ensemble models can fully utilize each base classifier's advantages in different regions. That is, the prior information effectively constrains the behavior of each base classifier. For the Indian Pines data set, the OA, average accuracy (AA), per-class accuracy, kappa statistic, F1-score, and running time (s) of different models are shown in Table IV. The classification maps are shown in Fig. 10(a)-(j).
For the first experiments, the OA of LWR-DEL and DWR-DEL reached 88.89% and 89.13%. Two proposed algorithms have better classification performance than other comparative models. Moreover, the proposed methods are superior to the state-of-the-art DES methods DES-MI and Meta-DES, whereas the LWR-DEL and DWR-DEL do not take too much running time. Compared with CatBoost and LightGBM, the OA of the proposed algorithm is higher, but the AA is not significantly improved. The time complexity of the proposed algorithm is much lower than that of LightGBM. The classification accuracy of LWR-DEL and DWR-DEL is also much higher than that of the XGboost algorithm. From the F1-score index, the performance of several ensemble algorithms is relatively similar. Therefore, the proposed two dynamic ensemble algorithms outperform the existing baselines and state-of-the-art ensemble learning methods.
To evaluate the performance of the two new DEL models, the University of Pavia data set was used in the second experiment. The best parameters are described in Table V, and the thematic maps of various models are displayed in Fig. 11(a)-(j). Similar to the Indian Pines data set, LWR-DEL and DWR-DEL obtain the best classification performance compared with other methods. The best OA and AA for the ROSIS data set are obtained by the DWR-DEL algorithm, which can reach 98.41%. Compared with the classic ensemble learning method GBDT, LWR-DEL and DWR-DEL yield 15.43% and 15.57% improvements. Moreover, compared with the three DES methods, the two proposed algorithms also have great performance.
For the Yellow River data set, the classification performance of all classifiers is the list in Table VI. The thematic maps are shown in Fig. 12(a)-(j). Compared with other classic machine learning classifiers such as SVM and RF, our methods yield nearly 5% and 3% improvements. Compared with the CatBoost and LightGBM algorithms, the classification accuracy of LWR-DEL and DWR-DEL did not significantly improve but the required running time is greatly reduced. Meanwhile, the classification accuracy of LWR-DEL and DWR-DEL is much better than the XGboost algorithm. Experiments on the Yellow River data set demonstrate that the time cost and classification accuracy of LWR-DEL and DWR-DEL are superior to that of other comparison algorithms.
To verify that weak classifiers contribute to the proposed method, a set of comparative experiments is set up in the article. As shown in Fig. 18(a)-(c), the classification accuracy with no weak classifier is better than all classifiers' ensemble results for the three real HSI data sets. It can be seen from the experimental results that the accuracy of the classifier is lower than using all the classifiers when some weak classifiers are removed. Therefore, this result indicates that the proposed method does take advantage of the classification advantages of the weak classifiers to a certain extent.

1) Sensitivity in Relation to Region Size for LWR-DEL:
For the LWR-DEL, the number of ROC (cluster) n_r significantly impacts the classification performance. To evaluate the influence of various n_r. Fig. 13(a)-(c) shows the accuracy of LWR-DEL when n_r changes in range.
For the first dataset, the classification accuracy increases first and then decreases as n_r increases. The best classification accuracy can be obtained when n_r = 3. For the University of Pavia data set, when n_r is gradually increased, the classification performance of LWR improved, and the optimal classification accuracy is reached when n_r = 4. Unlike the other two datasets, for the Yellow River dataset, the effect of n_r on the classification accuracy has no apparent regularity. The best results are obtained when n_r = 5.
In summary, the experimental results show that when the parameter n_r is set to different values, the accuracy of other datasets is affected differently. However, with the change of n_r, the accuracy of the three data sets does not change much. This result shows that this algorithm is sensitive to the parameter n_r in a limited range but has little effect on classification performance. So, the proposed method has strong robustness as the parameter does not need to be tuned carefully in practical applications.
2) Sensitivity in Relation to Region and k Size for DWR-DEL: Since the proposed DWR-DEL adds a global weight constraint based on K-NN. The combined effect of the parameters n_r and n_k on the classification accuracy is investigated in the article. First, it can be seen from Figs. 14(a)-(c) and 15 that after adding the prior global information, the variation trend of the classifier accuracy with the parameters change has changed significantly, which is different from that of LWR-DEL (see Fig. 13). Simultaneously, the classification accuracy of three datasets have been significantly improved. It shows that the double-weight constrained method proposed in DWR-DEL effectively changes the behavior of the classifier in different regions.
For the Indian Pines dataset, when the parameter n_r = 2, the classification accuracy is generally lower, and when n_r = 5, the classification accuracy is overall higher. Notably, this trend should be a combined effect of n_k and n_r. When n_k = 1 and n_r = 5, Algorithm 2 can get the highest accuracy. The results fully illustrate the impact of DWR-DEL on the behavior of the classifier.
Similar to the first dataset, when the parameter n_r = 2, the classification accuracy of the University of Pavia data set is overall lower. The difference is that when the parameter n_k = 2, DWR performs better for classifying the University of Pavia data set. When n_k = 2 and n_r = 5, Algorithm 2 can get the highest accuracy.
As with the first two datasets, the parameters n_r and n_k have a minor impact on the classification accuracy of the Yellow River dataset. The classifier accuracy has a slight trend of change. When n_k = 5 and n_r = 4, Algorithm 2 can get the highest accuracy.
To sum up, compared with LWR-DEL, the new method improves the classification accuracy of images, but the influence of parameter changes on the accuracy is not apparent. Overall, when the same parameters are selected for the three data sets, the variation in accuracy is not very large. Experimental results on this data set present the high robustness of the DWR-DEL algorithm.
3) Comparison of Original and Weighted Residuals: Since the two algorithms proposed in this article are based on the conclusion that the residual discrimination of multiple CR-based classifiers is not obvious enough. Therefore, it is very important to show whether the proposed residual fusion algorithm can effectively increase the final residual discrimination. To validate the effectiveness of the proposed weighted residual ensemble methods, the article compares the difference between the residual distribution of the base classifiers and the final ensemble result. As shown in Fig. 16, it is obvious that the residual distribution of LWR-DEL is more discriminative than the base  classifier. Meanwhile, it should be noted that the range of residuals after local weighting is also wider than of all base classifiers. Similar conclusions can be drawn from Fig. 17 , the DWR-DEL method can also increase the discrimination of residuals. The experimental results fully demonstrate the effectiveness of the proposed LWR-DEL and DWR-DEL algorithms.

V. CONCLUSION
In this article, two new weighted residuals ensemble learning strategies with CR are proposed, which introduces the idea of DES. First, a locally weighted dynamic ensemble algorithm is proposed, which uses the prior accuracy information of the classifier for different clusters as constraints. Residual weighting is then performed on different CR-based classifiers. Furthermore, the article uses the nearest neighbors of the testing samples to obtain the prior global information. Then, a dynamic ensemble method with local and global double-weighting constraints is proposed. Unlike traditional static and dynamic ensemble methods, the two algorithms weigh the residuals to obtain more distinguishable classification results. The experiments show that both the LWR-DEL and DWR-DEL algorithms provide better classification performance compared to the state-of-the-art classifiers. Compared with the base classifiers in the classifier pool, the ensemble results of the two proposed models also have better accuracy. The experimental results fully demonstrate the feasibility and effectiveness of the proposed algorithms. However, only the Euclidean distance between samples is used in the LWR-DEL and DWR-DEL algorithms. The relationship between the spectral features is not fully considered in the article. Future research will focus on designing more suitable distance metrics that consider both spatial and spectral features.