GPU-Accelerated CatBoost-Forest for Hyperspectral Image Classification Via Parallelized mRMR Ensemble Subspace Feature Selection

In this article, the graphics processing unit (GPU)-accelerated CatBoost (GPU-CatBoost) algorithm for hyperspectral image classification is first introduced and comparatively studied using diverse features. To further foster the classification performance from both accurate and efficient viewpoints, an ensemble version of GPU-CatBoost, the GPU-accelerated CatBoost-Forest (GPU-CatBF) algorithm, is proposed by adopting the parallelized minimum redundancy maximum relevance (mRMR) ensemble (PmRMRE) feature selection (FS) algorithm. To evaluate the performance and suitability of mRMR and PmRMRE, 11 other state-of-the-art FS algorithms are comprehensively investigated. Experimental results on three widely acknowledged hyperspectral benchmarks showed that: 1) GPU-CatBoost is also an advanced ensemble learning (EL) algorithm for hyperspectral image classification using diverse features; 2) mRMR and PmRMRE have advanced properties for highly discriminative features and band selection, and the best results are achieved by PmRMRE in most cases in terms of both the robustness and computational efficiency; and 3) GPU-CatBF always outperforms CatBoost and GPU-CatBoost, while compatible and even better results are reachable without losing much computational efficiency in contrast with other selected decision tree-based EL algorithms.


I. INTRODUCTION
L AND cover mapping is one of the main applications of remote sensing (RS) data and is essential for understanding the patterns and driving factors of land cover changes on the earth's surface [1], [2]. Over the past 40 years, large numbers of supervised, unsupervised, and semisupervised shallow and deep classification methods have been developed to map land cover using RS data due to their superior robustness compared to model-based approaches [3]- [6]. Theoretically and more practically, an ideal supervised classifier should be capable of addressing the following challenges.
As a novel and recent modification of ordered gradient boosting (OGB) but with categorical feature support, CatBoost has outperformed existing state-of-the-art algorithms such as gradient boosted DT (GBDT) [29], XGBoost [30], LightGBM [31], and H2O [32] on a diverse set of popular ML tasks on both central processing unit (CPU) and graphics processing unit (GPU) implementations [33]. However, according to our previous work in [34], the superior performance of CatBoost in terms of the classification accuracy was observed in only a few cases with a large value set of boosting iterations, and it was less computationally efficient than adaptive boosting (AdaBoost), GBDT, XGBoost with the classification and regression trees (CART) booster, and LightGBM on CPU implementation. This limitation could become an important challenge in the classification of large volumes of hyperspectral images with high dimensionality, especially in time-critical applications. Thanks to recent advances in high-performance computing techniques, accurate and efficient classification performance can be achieved for adopted classifiers by exploiting specialized devices, such as clusters and distributed computers, multicore CPUs, fieldprogrammable gate arrays (FPGAs), and GPUs in hyperspectral image processing [35], [36]. Specifically, it is possible to greatly accelerate the computational efficiency of a classifier on a GPU-based parallel computing platform by benefiting from its capacity of performing many computationally intensive tasks in parallel [36]- [38]. Once the computational complexity of the adopted classifier is greatly accelerated, it is also possible to further boost the classification accuracy by constructing an EL system [36], [37], [39]. Hence, it is of interest to investigate the performance of the GPU-accelerated CatBoost (GPU-CatBoost) algorithm and its ensemble version in hyperspectral image classification using diverse features.
An EL system usually consists of two key componentsa strategy to produce classifiers with high diversity and a rule to combine the results from multiple classifiers. The first key component, which is the cornerstone for constructing an effective EL system, can be achieved by resampling, label switching, feature partitioning, feature selection (FS), feature extraction (FE), model parameter shuffling, and hybridization techniques [5], [13], [40]- [42]. In contrast with other methods, FS-and FE-based techniques are capable of addressing the curse of dimensionality and tasks with high feature-to-instance ratios [43], [44]. In contrast with FS-based EL algorithms, the computational burden brought about by the FE procedure is always a challenge, particularly in the classification of large volumes of hyperspectral images with high dimensionality [15], [25]. Additionally, in the sense of maintaining the statistical and physical meanings of original features, FS is superior in terms of better readability and interpretability. Hence, we selected the FS strategy to construct our proposed GPU-CatBF model.
Theoretically, any FS method can be adopted to produce classifiers with high diversity to construct an EL system. However, practically speaking, a robust and highly efficient FS method is always the best option. Dimensionality reduction via FS, including band selection, is one of the most popular techniques to remove noise and redundant features, improve the learning performance, reduce the computational cost, build models with better generalizability, and decrease the amount of storage required in the context of hyperspectral image processing [45]- [47]. Therefore, numerous FS algorithms have been introduced and proposed in this field in the past few decades.
Based on whether a labeled training set is available, FS algorithms can be grouped into supervised, unsupervised, and semisupervised algorithms [48]- [50]. For the classification problem, FS aims to select highly discriminate features that are capable of discriminating samples belonging to different classes. In this regard, the supervised FS algorithm works better than unsupervised and semisupervised algorithms when sufficiently labeled samples are available. Based on the relationship between an FS algorithm and the inductive learning method used to infer a model, supervised FS algorithms can further be broadly categorized into three types-filter-, wrapper-, and embedded-based methods [49]. Compared with wrapper-based methods, which use a single learner as a black box to evaluate the subsets of features according to their predictive power, and embedded-based methods, which perform FS in the training process and are usually specific to a given learning algorithm, the filter-based method, which selects the subset of features as a preprocessing step independent of the induced algorithm, is advantageous for its low computational cost and good generalization ability [51]. Bing is one of the most powerful filter-based FS methods among the ML community, as shown by its high citation count (more than 8000). The minimum redundancy maximum relevance (mRMR) [52] algorithm has been extensively studied in the fields of DNA microarray data classification [53], protein classification [54], gene expression [55], water resource system management [56], 3-D facial expression recognition [57], lung cancer detection [58], real-time static voltage stability assessment [59], and many others. However, only a few works have introduced and investigated the performance of mRMR for band selection from multispectral and hyperspectral images [60]- [63].
To further boost the performance on maximal relevance and minimal redundant FS according to the mutual information (MI)-based maximal statistical dependency criterion, mRMRE is an extension of mRMR by using an ensemble technique [59]. However, mRMR is a centralized method, and it scales quadratically with the number of features and grows linearly with respect to the sample size [52], [65]. As a result, the computationally expensive limitation of mRMR will be inherited and further enhanced in mRMRE and become a serious challenge in high dimensionality and large-sample scenarios. To tackle this limitation, proposals have been made on the acceleration of mRMR and mRMRE using efficient parallelization techniques [65]- [67]. To the best of our knowledge, parallelized mRMRE (PmRMRE) has not yet been studied for FS-based hyperspectral image classification, either on its CPU-or GPU-based implementations. Therefore, in this work, to satisfy the need for a robust and highly efficient FS algorithm from the construction of an FS-based EL system, PmRMRE is selected to construct the subspace ensemble version of GPU-CatBoost, the GPU-CatBF, for hyperspectral image classification.
For the second key component of an effective EL system, popular fusion methods, including majority voting, weighted majority voting, error pruning, meta fusion, Bayesian fusion, fuzzy integral, D-S evidence theory, consensual theory, Borda count, and algebraic rule-based methods, can be used to combine results from multiple classifiers [41], [68]. However, on the one hand, advanced but complex fusion methods are not suitable for large data with large ensemble sizes from a computationally efficient point of view; on the other hand, simple voting-based methods could limit or even degrade algorithm performance in small ensembles and in large ensembles with lower classifier diversity [26]. To overcome such limitations, a metaensemble criterion that might yield the best solution was adopted.
The main contributions of this article are summarized as follows.
1) GPU-CatBoost was introduced and comparatively investigated for hyperspectral image classification using diverse features. 2) mRMR and PmRMRE were introduced and comparatively evaluated for discriminative subspace FS-based hyperspectral image classification. 3) To improve the classification performance, a new ensemble version of GPU-CatBoost, GPU-CatBF, was proposed by combining multiple GPU-CatBoost models trained on discriminative subspace features from PmRMRE.

A. CatBoost
Many solid theoretical and empirical results indicate that gradient boosting is a powerful ML method, especially for dealing with noisy data, heterogeneous features, and complex dependencies [29]- [33]. However, similar to standard boosting, classical gradient boosting also suffers from overfitting caused by the prediction shift in the learned model, also known as a special kind of target leakage [88]. Furthermore, categorical features with discrete sets of values that are not necessarily comparable to each other cannot be directly handled by binary trees. A common solution for using categorical features in gradient boosting is converting them to numerical features. In this regard, one-hot encoding (OHE), gradient statics (GS), target statics (TS), greedy TS, holdout TS, and leave-one-out TS solutions have been identified. Unfortunately, this transformation procedure can also cause target leakage and prediction shifts [88]. Hence, to avoid both issues of overfitting and target leakage caused by gradient boosting categorical feature transformation, CatBoost was proposed as a combination of OGB and ordered TS [33], [88].
is a vector of d features (some numerical and some categorical) and Y i ∈ R is a label value. CatBoost substitutes the categorical feature x σ p ,k with where σ = (σ 1 , . . . , σ s ) is the number of s random permutations of the dataset and a is the weight of the prior value P . Then, CatBoost can be built by following the pseudocode steps in Algorithm 1. where I is the number of boosting iterations, L Algorithm 1: Pseudocode for CatBoost [88] Inputs: is the loss function, M r (i) is the support model from the rth permutation using instance x i , and Mode is the boosting modes of plain and ordered. The former mode is the standard GBDT algorithm with inbuilt ordered TS. Due to limited space, [33] and [88] are recommended to readers interested in more detailed algorithmic descriptions. Both CPU and GPU implementation of CatBoost. 1

B. Minimum Redundancy Maximum Relevance
In pattern recognition applications, the definition of optional characterization often means the minimum classification error. In an unsupervised case where the classifiers are not specified, minimal error requires the maximal statistical dependency of target class c on the data distribution in the selected subspace R m . However, it is often difficult to obtain an accurate estimation of the maximal dependency for multivariate density, which often involves ill-posed problems. In addition, the computational complexity drawback of maximal dependency is the most pronounced problem, not only for continuous feature variables but also for discrete and categorical features. Alternatively, the maximal relevance, which is usually characterized in terms of correlation or MI, can be used to realize maximal dependency efficiently [53].
In terms of the MI, the maximal dependency criterion tries to find a feature set of S with m features, which jointly has the largest dependency on target class c [53] max D(S, c), D = MI({x i , i = 1, 2, . . . , m}; c) where MI(x i ; c) represents the MI between feature x i and class c. To approximate (2), the maximal relevance is measured by the mean value of all the MI values between the individual features and target class c as follows: However, features that are selected according to (3) could have rich redundancy, and the dependency among features could be large. As a result, the representative class discriminative power would not change much if some of the features were highly dependent on others. To solve these issues, the minimal redundancy condition can be adopted for mutually exclusive features by minimizing the following [54]: In practice, maximum relevance and minimum redundancy cannot always be achieved simultaneously. An optimization criterion that combines the above two constraints into a single constraint is called mRMR [54] where H(x i ) and H(x j ) are the entropy of the ith and jth features, respectively.

III. PROPOSED METHOD
Diversity is the cornerstone of constructing an effective EL system, and the underlying rule of thumb theory to this concept is that diversified classifiers lead to uncorrelated errors, which in turn improve the classification accuracy. Although many diversification techniques, as mentioned above, are available, FS-based techniques are not only capable of addressing the curse of dimensionality and high feature-to-instance ratio tasks but are also superior in terms of their computational efficiency and better feature readability and interpretability. Furthermore, the feature subset selection algorithm not only takes the performance of the ensemble into account but also directly supports the diversity of subsets of features. Additionally, from constructing an accurate and efficient EL system point of view, a robust and highly efficient FS algorithm is the best practical option. Therefore, the GPU-CatBF algorithm is proposed by utilizing the GPU-CatBoost and PmRMRE algorithm for subset FS, as presented in Fig. 1.
Indeed, a robust and highly efficient FS algorithm is an ideal choice, but the performance of an EL system constructed simultaneously using a robust FS algorithm (e.g., PmRMRE) and a classifier (e.g., GPU-CatBoost) can be limited due to a lack of diversity. For example, it is highly possible that the advanced FS algorithm PmRMRE could return very similar and even exactly the same feature subsets from two independent runs. To overcome this limitation, an incremental FS strategy was adopted in the subspace FS phase, and a metafusion criterion that might be capable of yielding the best results was adopted in the ensemble phase. Finally, the proposed GPU-CatBF algorithm can be built by following the algorithmic steps described in Algorithm 2.

Algorithm 2: Algorithmic Code Steps for CatBoost-Forest
Inputs: Process: 1) for t = min to max by step υ: where F t is a set of features selected by PmRMRE at round t, D t is the new data keeping only the features in F t , t is trained GPU-CatBoost learner using D t , ε t is the error of t , * is the learner with the lowest classification error, and H(x) is the final decision function for CatBoost-Forest.

A. Datasets
To evaluate the performances of the considered methods, three hyperspectral benchmark datasets, i.e., the Pavia University, GRSS-DFC2013 Houston, and GRSS-DFC2018 Houston datasets, are utilized in our experiments.
1) Pavia University: This hyperspectral image was acquired with a reflective optics system imaging spectrometer (ROSIS) optical sensor, which provides 115 bands with spectral range coverage ranging from 0.43 to 0.86 μm. The geometric resolution is 1.3 m. The image shown in Fig. 1(a) was captured over the Engineering School, University of Pavia, Pavia, Italy. It has 610×340 pixels with 103 spectral channels (a few original bands are very noisy and were discarded immediately after data acquisition). The validation data refer to nine land cover classes; Table I shows the details about the number of samples and the legend.  Table I with the corresponding number of samples for both the training and validation sets.
3) GRSS-DFC2018 Houston: This hyperspectral image was collected by the NCALM at the University of Houston on February 16, 2017, between 16:31 and 18:18 GMT using an ITRES compact airborne spectrographic imager (CASI)-1500 sensor covering a 380-1050 nm spectral range with 48 bands at a 1-m ground sampling distance (GSD). This data cube has been orthorectified and radiometrically calibrated to spectral radiance units (milli-SRU). The data were distributed in radiance, and the image size was 4172×1202 pixels. The 20 classes of interest by the DFTC of the GRSS are reported in Table I. The corresponding number of samples for both the training and validation sets account for 1% and 99% of the total ground-truth samples, respectively, from Fig. 2(h).
Finally, the classification overall accuracy (OA), kappa (κ) statistic, and code running time in seconds were used to evaluate the classification performances of the considered classifiers and FS algorithms. All the experiments were conducted using Python 3.7.8 on a 64-bit Windows 10 system with an Intel Core i7-7820X 3.60-GHz CPU and 128 GB RAM, and with an NVIDIA Quadro P4000 card with CUDA toolkit version 10.2.

A. Evaluation of mRMR and PmRMRE
To evaluate the performance of mRMR and PmRMRE for FS on hyperspectral images with diverse features, we first present the OA values of an RaF model (ensemble size = 200, other parameters set to the default values) with an increasing number of selected features using the considered FS algorithms on the experimental datasets. For a more objective performance evaluation, the mean OA values from ten independent runs of the test were used to draw graphs.
According to the results in Fig. 3, it can be easily noticed that the performance of mRMR can be further boosted by the ensemble, which is in accordance with the results in [59], [65]- [67]; see the "x"-marked lines in light blue for PmRMRE and the "x"-marked lines in cyan for mRMR. Moreover, compared with all the other adopted FS algorithms, PmRMRE shows the fastest convergence speed in all cases of using the original spectral, EMPs, EMPPR, EMSER-MPsM, SP-MPsM, and MRS-OO features. Specifically, see the results shown by using EMPs [see Advanced performance of mRMR is also true for hyperspectral image FS, as shown by the "x"-marked lines in cyan. For instance, a faster convergence speed than the ReliefF, CFS, TraceR, FishS, MIM, and CIFE algorithms can be observed for mRMR on the original spectral features from the Pavia University data [see Fig. 3(a)]. The secondary fast convergence speeds from mRMR can be observed on the original spectral features from the GRSS-DFC2018 Houston data [see Fig. 3(k)]. In addition, obviously faster convergence speeds than the Reli-efF, GiniI, and TraceR algorithms can be observed for mRMR by using the EMPs [see Fig. 3(g) and (l)], EMPPR [see Fig. 3(h) and (m)], and MRS-OO [see Fig. 3(i) and (n)] features from the GRSS-DFC2013 Houston and GRSS-DFC2013 Houston datasets, respectively. Nevertheless, MIM and CMIM could be alternative choices to mRMR when dealing with highly correlated features with high dimensionality, such as EMPs and EMPPR features from the test images, according to results shown by the "x"-marked lines in pink and brown shown in Fig. 3(g), (h), (l), and (m).
Aside from the robustness, the computational efficiency of an FS algorithm is another key factor that needs to be considered in practice. Hence, in Fig. 4, we present the CPU-based computational time costs for the considered FS algorithms using diverse features from the experimental datasets.
According to the results, we can clearly see that the highest computational efficiency is achieved by PmRMRE on all the considered features from the considered datasets, while FishS and TraceR achieved the second best efficiency, and the lowest computational efficiency is attained by CFS, especially using the MRS-OO features with the highest dimensionality (see the results shown by green bars). Similar to the FS algorithms, including ReliefF, JMI, DISR, ICAP, GiniI, MIM, CMIM, and CIFE, mRMR shows the computational efficiency at the third state. In addition, as the computational efficiencies of CFS, JMI, DISR, ICAP, GiniI, MIM, mRMR, CMIM, CIFE, and PmRMRE decrease with increasing data dimensionality, no obvious effects are shown by the ReliefF, FishS, and TraceR algorithms for all three datasets.
Based on the results shown in Figs. 3 and 4, we can conclude that PmRMRE is an ideal algorithm for constructing an EL system from the perspectives of both robustness and efficiency. However, there is high possibility that a robust FS algorithm (i.e., PmRMRE) returns highly similar and even exactly the same features in independent runs, which results in low classifier diversity and could limit the performance of the EL system. To avoid this issue, an incremental FS criterion was adopted. Considering the heterogeneous properties of the considered datasets on dimensionality and landscapes, the values of the incremental FS ranges ([ min , max ], max ≤ d), as shown in Algorithm 2, need to be determined empirically.
According to the results in Fig. 3, we can see that RaF reached OA values higher than 68% by using ten raw spectral features (approximately one-tenth of the total number of spectral bands), and there were no obvious improvements when using more than 30 raw spectral features for Pavia University data [approximately one-third of the total number of spectral bands, see Fig. 3(a)]. Similarly, practically acceptable OA values (>65%) are reached by RaF when using only five selected raw spectral features (exactly one-ninth of the total numbers of spectral bands), and obvious improvements are not available by using more than 20 features for the GRSS-DFC2018 data [see Fig. 3(k)]. Additionally, for the EMPs, EMPPR, and EMSER-MPsM features, approximately 8-10 features for the lower range and 25-30 features for the upper range can be noticed, as shown by Fig. 3(b)-(d) for the Pavia University data, Fig. 3(g)-(i) for the GRSS-DFC2013 Houston data, and Fig. 3(l)-(n) for the GRSS-DFC2018 Houston data. Based on the above results and for simple implementation, we roughly set the range to min = d/10 and min = d/3 for GPU-CatBF in the next experiments, where d is the feature dimensionality.

B. Evaluation of the Proposed Method
Usually, the performance of a classifier is evaluated in terms of the classification accuracy and computational complexity. Hence, we show the OA values versus the ensemble sizes of the considered classifiers using various features from the considered   Fig. 6.
According to the results for the Pavia University data shown in the first row of Fig. 5, HistGBT (Cls1) and LightGBM (Cls11) showed the fastest convergence speed compared with the other classifiers by using the raw spectral features (see the learning curves in red and brown, respectively), while the worst results were attained by GPU-XGB-RaF (Cls10) (see the learning curve in olive green). Moreover, GPU-CatBF (Cls12) reaches the upper bound of the OA values when the ensemble size is greater than 150; see the learning curve in dark green. Compared with the results from CatBoost (Cls6) and GPU-CatBoost (Cls7), as shown by the learning curves in magenta and green, respectively, the learning curve from GPU-CatBF is always high. While the same results for GPU-CatBF are superior to those of CatBoost and GPU-CatBoost in terms of the classification accuracy and can also be found in Fig. 5(b)-(d) for the EMPs, EMPPR, and MRS-OO features, respectively, the highest OA values reached by and GPU-CatBF using the EMPs and EMPPR features, as shown by the dark green learning curves in Fig. 5(b) and (c), particularly when the ensemble size is greater than 150. Additionally, CatBoost, GPU-CatBoost, and GPU-CatBF showed better capability of avoiding the well-known overfitting issue compared with AdaBoost (Cls4), GBDT (Cls5), and GPU-XGB-RaF (Cls10), as shown by the learning curves in spring green, blue, and cyan, respectively.
From the results shown in the second row of Fig. 5 for the GRSS-DFC2013 Houston data, it can be clearly seen that: 1) higher OA values are always available for GPU-CatBF compared with CatBoost and GPU-CatBoost; 2) GPU-CatBF shows the highest OA values in cases using the EMPs, EMPPR, and MRS-OO features; and 3) worse classification results caused by overfitting are obvious for XGBoost with RaF booster (Cls9) by using raw spectral, EMPs, EMPPR, and MRS-OO features, and for ExtraTrees (Cls3) using EMPPR features, as shown by learning curves in colors cyan and grey, respectively.    Looking at the graphs from the last row of Fig. 5, the best results with the highest OA values are reached by HistGBT (Cls1), LightGBM (Cls11), and XGBoost with CART booster (Cls8) on raw spectral, EMPs, EMPPR, and MRS-OO features, while better than results from RaF (Cls2), ExtraTrees (Cls3), AdaBoost (Cls4), GBDT (Cls5), CatBoost (Cls6), GPU-CatBoost (Cls7), XGBoost-RaF (Cls9), GPU-accelerated XGBoost-RaF (Cls10) are obtained by GPU-CatBF (Cls12) on raw spectra, EMPs, and EMPPR features. Additionally, the OA of GPU-CatBF is maximized only when the ensemble size is greater than 150.
According to the computational complexity cost results shown in Fig. 6, the following results can be observed. First and foremost, RaF (Cls2, orange curves) and ExtraTrees (Cls3, grey curves) are the most efficient algorithms in all cases compared with the other classifiers. Moreover, GPU-CatBoost (Cls7) is always at least ten times faster than the CPU-based implementation of CatBoost (Cls6), as expected, as depicted by the cost curves in green and magenta, respectively. Furthermore, greater computational efficiency than HistGBT (Cls1) is reachable for GPU-CatBF (Cls12) on the Pavia University and GRSS-DFC2018 Houston datasets, particularly if the ensemble size is approximately less than 150. GPU-CatBF is more efficient than LightGBM (Cls11) on the GRSS-2013 Houston data only when the ensemble size is less than 100. Finally, HistGBT (Cls1, red curves) is at least 100 times faster than the original LightGBM version when large amounts of training data are available (see the graphs in the second row of Fig. 6), which is in accordance with the assumption in [73].
Summarizing the above results drawn from Figs. 5 and 6, we empirically set the ensemble size of the proposed GPU-CatBF algorithm to 150 for both accurate and efficient classification.

C. Classification Results Comparison
To further compare the proposed classification approach for land cover mapping using hyperspectral images, Tables  II -IV show the Tables II-IV. For a fair comparison, the ensemble size for all the classifiers is set to the same value of 150, which is the recommended value of the ensemble size for CatBoost, CPU-CatBoost, and the proposed GPU-CatBF algorithm.
Again, it can be clearly seen that the HistGBT classifier reaches the highest classification values for raw spectral features from all three test images; see the highlighted numbers in bold from the third column of Tables IV-VI. However, HistGBT is slower than all the other classifiers in the prediction phase, specifically using the raw spectral, first ten principal components (PC10), EMPs, EMPPR, SP-MPsM, and EMSER-MPsM features from the Pavia University and GRSS-DFC2018 Houston test images and using all features from the GRSS-DFC2013 test image; see the underlined numbers from the first row of Tables II-IV.  TABLE IV  OA, KAPPA, AND PREDICTION TIME VALUES FROM THE CONSIDERED CLASSIFIERS USING THE VARIOUS FEATURES OF THE  GRSS-DFC2018   GPU-CatBF achieves higher OA values than CatBoost and GPU-CatBoost, but prediction is slower. This finding is universally true for all cases of using various features from the considered test images and is in accordance with the previous results shown in Figs. 5 and 6. In addition, compared with the results from the other classifiers, GPU-CatBF is capable of reaching compatibility and achieves an even better classification accuracy. For instance, GPU-CatBF reached the highest OA values by using EMPs, EMPPR, SP-MPsM, EMSER-MPsM, and MRS-OO features from the GRSS-DFC2013 Houston data; see the numbers in bold in the last row of Table III. If we compare the results from the Pavia University data, although the best results are not obtained by GPU-CatBF, OA values higher than those from other classifiers are present. For example, HistGBT and LightGBM obtained OA values of 92.37% and 92.10%, respectively, when using EMPs features, while an OA value of 92.75% was reached by GPU-CatBF in a shorter prediction time. From the results from the GRSS-DFC2018 Houston test data, although the highest OA values are reached by either HistGBT or LightGBM in most cases, OA values higher than those from GBDT and XGB-CART can be observed for GPU-CatBF using the raw spectral, EMPs, EMPPR, and MRS-OO features.
In our previous works in [25], [26], [71], and [72], the performances of EMSER-MPsM, SP-MPsM, and MRS-OO were separately investigated by comparison with the MPs, EMPPR, EMPs, and EMPPR features. Hence, it is worth comprehensively comparing their performances here. First, according to the results shown in Tables II-IV and Figs. 6 and 7, it is clear that better classification results with higher OA values can be obtained by the SP-MPsM and EMSER-MPsM features than with the EMPs and EMPPR features. For example, while all the considered classifiers reached OA values between 89.04% and 96.94% by using the EMSER-MPsM features from the Pavia University data, the OA value ranges shown by the considered classifiers are between 85.33% and 93.42% and between 84.13% and 89.85 when using the EMPs and EMPPR features, respectively. Furthermore, when we compare the classification results from the SP-MPsM, EMSER-MPsM, and MRS-OO features, generally better classification results are obtained by EMSER-MPsM in contrast with the results from SP-MPsM features for Pavia University and GRSS-DFC2013 Houston test images, while the best results are obtained from MRS-OO features for all the test images. For instance, see that the area in the lower part of the GRSS-DFC2013 Houston image, which is covered by dense cloud shadows, is more precisely classified by using the MRS-OO features.

VI. CONCLUSION
In this article, the GPU-CatBoost algorithm for hyperspectral image classification was introduced and comparatively studied in terms of the classification accuracy and computational efficiency using diverse features. To further boost the classification performance by considering its highly accelerated advantage with respect to the CPU-based implementation of CatBoost, an incremental subspace FS-based ensemble version, GPU-CatBF, is proposed. To evaluate the performance of the proposed approach, 11 popular DT-based EL algorithms, namely, the HistGBT, RaF, ExtraTrees, AdaBoost, GBDT, CatBoost, GPU-CatBoost, XGB-CART, XGB-RaF, GPU-XGB-RaF, and LightGBM algorithms, are selected in the experiments. Moreover, 11 popular FS algorithms, namely, the ReliefF, CFS, JMI, DISR, ICAP, GiniI, FishS, MIM, TraceR, CMIM, and CIFE methods, are selected to evaluate the performance of mRMR and PmRMRE. According to the experimental results from three widely used hyperspectral benchmarks, the concluded results are as follows.
1) Compared with existing popular FS algorithms, the superior properties of mRMR and PmRMRE for highly discriminative subspace FS from hyperspectral images categorized with diverse feature sets are clear, while the best results are shown by PmRMRE in the context of both the robustness and computational efficiency. 2) Compared with popular DT-based EL algorithms and the CPU-based implementation of CatBoost, GPU-CatBoost is also an advanced EL algorithm for hyperspectral image classification using various features. 3) While further improved performance of the proposed GPU-CatBF is clear compared with CatBoost and GPU-CatBoost, similar and even better classification accuracy results are reachable for GPU-CatBF in some cases. Although GPU-CatBF outperforms CatBoost and GPU-CatBoost in terms of the classification performance, the computational cost from a larger value set of the ensemble size is also clear. Additionally, as an advanced FS algorithm, the computational efficiency of PmRMRE could be further enhanced in its GPU-based implementation. Therefore, we will focus on the self-adaptive selection of the ensemble size of GPU-CatBoost, GPU acceleration of PmRMRE, and mixed-precision technique in future work.