HOLP-DF: HOLP Based Screening Ultrahigh Dimensional Subfeatures in Deep Forest for Remote Sensing Image Classification

To overcome the high intramodel dimensionality and low ensemble diversity issues, which limit the classification performance of original deep forest (DF), a new version of DF, the high-ordinary least square projection (HOLP) DF, was proposed in this article by introducing model-based HOLP feature screening (FS), random subspace propagation, and reduced error pruning techniques. To evaluate the performance of the proposed HOLP-DF, total eleven popular FS algorithms and total six advanced deep learning methods are selected. Experimental results on three widely acknowledged hyperspectral and PolSAR image classification benchmarks showed that: 1) HOLP is an optimal choice for FS in contrast with other screeners in terms of high classification accuracy and execution efficiency; 2) HOLP-DF is capable of obtaining better results than the original DF, DF with confidence screening and feature screening; 3) optimum sets of model depth, propaganda ratio and screening ratio parameters are 30, 40%, and 40%, respectively; 4) performance of HOLP-DF can be further boosted by extra usage of patch-based pooling and morphological profiling techniques.


I. INTRODUCTION
L AND cover classification has always been a research hotspot in the fields of remote sensing (RS) image processing and applications, which has received general attentions from many socio-economic and environmental application fields [1], [2], [3]. Among the many methods, machine learning (ML) based RS image classification has always been an active topic in the RS image processing and application communities, mainly due to their superior robustness and efficiency compared to conventional model-based approaches [4], [5], [6]. And in contrast with the conventional shallow ML methods, neural networks (NNs) based deep learning (DL) methods has gradually outperformed shallow ones and become the mainstream solution for the most RS image classification problems owing to their strong intramodel feature extraction ability, complex model structure, and plentiful parameters [5], [7], [8], [9], [10].
Even though remarkable advances have been achieved, still neither of those deep NNS (DNNs) based models can serve as the one solution to solve all problems. Particularly when considering the scenarios of learning from small-scale samples, predetermination of network topology structure, and the wellknown difficulty of theoretical analysis of black-box features [11]. After many attempts of theoretical analysis, the learning mechanism of NNs based DL are not completely clear, but it has also been preliminarily proved that layer-to-layer processing, in-model feature representation and sufficient model complexity are the three basic principles that may underpin the success of DNNs [11], [12]. Based on this, a novel non-NN style DL model named multigrained cascade forest (gcForest), which realized by nondifferentiable modulus without backpropagation training for constructing deep forest (DF) was proposed by Zhou and Feng [12]. In contrast with NNs based DL models, gcForest comes with the appealing properties of the following: 1) easy to deploy and train with much fewer parameters; 2) can achieve high predictive accuracy on datasets across different domains by using almost the same setting of hyperparameters; 3) capable of capturing contextual or structure features; 4) the model complexity can be determined automatically in a data independent way which enabling gcForest (represented by DF thereafter) to perform well even on smallscale datasets [11], [12]. Despite the remarkable advantages previously that have been proven from wide range of applications, original DF is also limited by the high time cost and memory requirement with owes much to the aspect of: 1) cascade structure of DF leading to a linear increase in price of time complexity as the numbers of level increases; and 2) multigrained scanning procedure significantly increase the number of training instances not only, also produces a high-dimensional input for the cascade This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ procedure [11]. To address these issues, gcForest has been improved by introducing techniques like confidence screening (CS) [13], feature screening (FS) [14], feature pooling and error screening [15], patch-based pooling, morphological profiling, and pseudolabeling (PL) [16]. However, these solutions were conducted on sufficient sample training sets, and may not hold for small samples (n) with large numbers of features (p) training set scenarios, the so-called large-p-small-n problem in statistics. Because the classical ordinary least-squares (OLS) estimate used for linear regression is no longer applicable due to insufficient degrees of freedom. Additionally, the proven superior performances of patch based pooling and morphological profiling techniques could be further boosted and generalized by interacting with state-of-the-art FS techniques those designed for high and ultrahigh dimensional FS problems with sparsity.
Thanks to the rapid advances in earth observation technology, now it has brought us an unprecedented array of large, enrich, diverse and complex RS data. In the meantime, challenging issues from dimensionality that exists in statistics, ML, RS data processing, storage, and access arose more severely in this era of big data. Although, an explosion of developing approaches for handling large data sets with high dimensionality have been witnessed in recent years, the common assumption underlying these approaches, variables that affect the response is relatively small, may not hold for large-p-small-n problem, and the computation cost for large-scale LASSO-based optimization becomes a serious concern as well [17], [18]. Additionally, current common feature selection and extraction methods may also not work well due to the simultaneous challenges of computational expediency, statistical accuracy, and algorithmic stability, which often requires sophisticated estimation techniques, strong model assumptions, and advanced computing algorithms [19], [20]. For example, traditional variable selection methods like Akaike information criterion, Bayesian information criterion (BIC), and extended BIC always involve a combinatorial NP-hard optimization problem with computational time increasing exponentially with the data dimensionality. Hence, it is truly desirable if one can rapidly reduce the ultrahigh dimensionality before conducting a refined analysis. And a practical approach is to use a screening procedure to reduce the dimension of feature space to a moderate scale, and then apply variable selection methods in the second phase.
In contrast with the consistent variable selection, FS (also known as variable screening) deals with a much less ambitious goal of sure screening could be achieved by using some both conceptually and computationally simple method [22]. Thereby, marginal FS becomes indispensable for linear, generalized linear, and robust linear models on ultrahigh dimensional data and has received much attention, especially since the seminal work of sure independent screening (SIS) method was proposed by Fan and Lv for linear regression [22], [23], [24]. However, all these approaches based on sure screening protocol require the specification of a particular model structure, which is usually an impossible task under the ultrahigh dimensional setting. Thus, model-free FS methods are naturally more appealing. Toward this direction, Ball correction SIS (BcorSIS) [25], distance correlation SIS (DC-SIS) [26], sure independence ranking and screening (SIRS) [27], mean-variance SIS (MVSIS) [28], martingale difference correlation SIS (MDCSIS) [29], and Henze-Zirkler SIS [20] are the most undertaken ones in diverse studies.
Although the model-free FS methods are capable of avoiding the impossible task of a particular model structure specification, current methods are still based on some assumptions for the predictor and response variables on the one hand. For example, correlation metric-based SIRS and DCSIS are not robust to the predictors whose distributions are heavy tail, and DC-SIS requires both the predictors and response variables to satisfy the subexponential tail probability uniformly. On the other hand, model-free FS methods cannot simultaneously satisfy the following two important demands in designing a screening operator: 1) straightforward and efficient to compute; and 2) the resulting estimator must possess the sure screening property under reasonable assumption to assure that the most discriminate subfeatures are remained. In this sense, model-based screeners like high-ordinary least square projection (HOLP) [18] and sparsity-restricted maximum likelihood estimator (SMLE) [30], which possesses the sure screening property, gives consistent variables selection without strong marginal correlation assumption and computationally efficient, still will be practically appealing. And in contrast with the latter one which is in the context of ultrahigh dimensional generalized linear models, HOLP is more simple, easy to implement, and more flexible for both linear and generalized linear models with Ridge-regression. Thus, it is of great interest to adopt HOLP to solve the aforementioned issue of dimensionality in DF via screening out the most irrelevant features in the cascade structure construction phase.
Essentially being as novel decision trees based EL method, multigrained cascade structure based in-model feature representation and layer-to-layer processing enable DF with high predictive accuracy that competitive to diverse DNNs in wide range of tasks [11], [12], [13], [14], [15], [16]. However, this property not only limit the original DF by high time cost and memory requirement, as discussed earlier, but also can limit the predictive accuracy by using layer-to-layer processing with lower diversity and average fusion strategy by underfitting and/or overfitting. This is mainly due to the fact that original DF and its variants simply passing the original input features for concatenate with features from next layers [11], [13], [14], [15], [16]. In this case, the variance of the input will typically get smaller as learners get better and better at predicting the output and the remaining errors become increasingly difficult to correct. As a result, this multi-co-linearity can significantly limit the ability of the ensemble to improve upon the best score of the subsequent layer as there is too little variation in predictions for the ensemble to learn useful combinations. And this is true in small ensemble size with high classifier diversity and large ensemble size with low classifier diversity scenarios that particularly use average and majority voting fusion strategies [4], [31]. Noteworthy, there are only four learners, two random forest and two ExtraTrees, to enhance the diversities in the cascade layers of DF and have equal contributions to produce the final prediction. This is why there was no obvious improvement in accuracy and the increased depth of DF being larger than ten layers [16]. Moreover, this nondate-driven manual hard-definition of diversity from forests may raise the risk of overfitting and/or underfitting on smallscale or class-imbalanced data [32].
Diversity is the first key component of constructing an effective EL system, usually can be achieved by applying specific techniques on sample, label, feature, and model parameters in separate or hybrid ways [33], [34]. For the techniques from feature space, the most popular approaches are random rotation [35], regularized random rotation [36], random projection [37], random partition [38], and random subspace [39]. Theoretically, any of these approaches can be adopted to increase variations between propagated features from the original input and/or earlier layers in DF. However, compared with the random subspace approach, the most other existing works do not have strong theories support [40]. Most importantly, the random subspace method leverages the idea that neighborhoods of feature space have a specific local structure via sticking to the original features, and the minimal discriminative subsets could appear in some of the subspaces. This property can be very powerful when the local structure or the discriminative subset first needs to be extracted, before an estimator learning to generalize for features ranking, screening, and classification [39], [40]. In fact, most ML algorithms with convex optimization objectives are ill-equipped to solve the problem of multimodal probability distribution estimation from all feature spaces. The random subspace method can overcome this issue by allowing base estimators to fit one mode of the distribution at a time. Thus, random subspace based feature propagation technique is selected to increase the variations from features before screening to the next layers of our modified version of DF.
However, propagating and screening all generated random subspaces may not be a wise idea in practice. Especially in the sparse classification case as many random subspaces could contain low discriminative and even corrupted signals, and adopting an average voting-based fusion strategy in the original DF. Furthermore, the unnecessarily large random subspaces-based ensemble can lead to extra memory usage, computational costs, and may occasionally degrade the generalized performance [41]. A straightforward way to alleviate these shortcomings is the selection of a fraction of the base learners before combination, which is commonly called as ensemble pruning, ensemble selection, ensemble thinning, and selective ensemble [42], [43]. Some theoretical and empirical evidences have also shown that performance of an ensemble consisting of small ensembles could better than all [44], [45]. Motivated by these reasons, many ensemble pruning algorithms have been proposed in the last decades. However, while those algorithms report to greedy search not have theoretical or empirical quality guarantee, those based on swarm intelligence optimization algorithms are sensitive to noise, and those stand on Bayesian probabilistic distribution are usually require advanced computation techniques and may not applicable for ensemble pruning [41], [43]. Hence, due to the straightforward, effective, efficient, and easy to implement reasons, we select the reduced error pruning (REP) [46] before fusion in the proposed DF algorithm.
The main contributions of this article are summarized as follows.
1) A new version of DF was proposed for pixelwise RS image classification by adoption of HOLP based feature screening, random subspace propagation, and reduced error running techniques. 2) Total of twelve popular feature screening methods for high and ultrahigh dimensional settings were studied to solve the issues of memory requirement from the original DF. 3) Optimum choices for random subspace propagation and screening ratios were recommended for the proposed HOLP-DF algorithm.

II. RELATED WORK
In ultrahigh dimensional setting, SIS procedure was first introduced to significantly reduce the dimensionally by strongly rely on the assumption that the valuable features in the data have large marginal correlations with the response. But this assumption is often violated in reality, as predictors are often correlated. In further, valuable features that are jointly correlated to the response can be screened out simply because they are marginally uncorrelated to the response [18]. Nevertheless, as a seminal work for FS, SIS is still appealing in practice due to its sure screening, straightforward, and computationally efficient properties. And more appealing solutions might be reachable by loosening the restrictive marginal correlation assumption based on the OLS estimator and the Ridge regression, which are the HOLP and Ridge-HOLP [18].
Consider the familiar linear regression model where X ∈ R n×p is the design matrix composed of number of n samples with p variables, Y ∈ R n is the response vector, β = (β 1 , β 2 , . . . , β p ) T is a p-vector of parameters, and ε ∈ R n consists of independently identical distribution errors with ε i follows a distribution with zero mean and variance σ 2 . A general class of linear estimates of β can be formed as where A ∈ R n×p maps the response to an estimate and the SIS set A = X ⊥ [24], ⊥represents the matrix transpose, Aεconsist of linear combinations of zero mean random noises, and (AX)β is the signal. In order to preserve the signal part as much as possible, an ideal choice of A is it should satisfy AX = I. And if this choice is possible, the signal part would dominate the noise part Aεunder suitable conditions, which leads naturally to the OLS estimate where A = (X ⊥ X) However, when p is larger than n, X ⊥ X is degenerate and AX cannot be an identity matrix I . Fortunately, (X ⊥ X) −1 X ⊥ and X ⊥ (XX ⊥ ) −1 can be seen as the Moore-Penrose inverse of X for p < n and p > n, respectively [18]. In p > n case, nevertheless the AX = X ⊥ (XX ⊥ ) −1 X is no longer an identity matrix,β i (i / ∈ S) can take advantage of the large diagonal terms of AX to dominateβ i (i / ∈ S) that is just a linear combination of OFF-diagonal terms, as long as AX is diagonally dominant. Where S = {j : β j = 0, j = 1, 2, . . . , p} is the index set of the nonzero β j 's with cardinality s = |S| from the true model matrix, D is an n × n diagonal matrix, and U is an p × n matrix that belongs to the Stiefel manifold V n,p [47]. Then by It can be proven that AX = X T (XX T ) −1 X can reduces the impact from the high correlation of X by removing the random diagonal matrix D, will be diagonal dominating with overwhelming probability as well. In this regard, a very simple variable screening method can be obtained by rewriting the where M γ is a submodel of full model M, γ is a user specified threshold. This estimator also named as the HOLP, where the first term indicates that it can be seen as a projection β. And from the standpoint of matrix XX ⊥ is of full rank whenever p > n, HOLP is unique to high and ultrahigh dimensional data analysis. Furthermore, HOLP is easy to implement and can be efficiently computed with the complexity of O(n 2 p), and it is scale invariance in the signal part X ⊥ (XX ⊥ ) −1 Xβ [18].

III. PROPOSED METHOD
As a novel non-NNs based DL model, original DF has gained increasing attentions in recent years. Although its remarkable performances have been proven from wide range of studies and applications, also many modified versions have been proposed to overcome the high time cost and memory requirement limitations, which owes much to the aspect of cascade structure and multigrained scanning procedure, limited classification performance caused by high intramodel dimensionality and low ensemble diversity is still remains open. To solve this issue and also enlightened by the earlier works of [13], [14], and [15], an improved version of DF, the HOLP-DF, was proposed by adoption of HOLP based FS algorithm to solve the issue of high intramodel dimensionality on the one hand, adopting random subspace propagation, and REP techniques to increase the ensemble diversity on the other hand.
The architecture of the proposed HOLP-DF for RS image classification method is shown in Fig. 1, which consist of three major steps of: 1) contextual features extraction with multigrained scanning (overlapped image patching); 2) HOPL based screening the propagated random subfeatures of flattened features from step 1; and 3) layer-by-layer training DFs with screened features by following the basic structure, as shown in Fig. 1, but using a different features concatenation strategy and using REP ensemble selection strategy before fusion.
For a given RS image ↔ I ×N×d , where , N , and d represents the numbers of rows, columns, and channels, respectively, we first obtain the overlapped neighboring image patches P = {P w×w×d r=1 , P w×w×d r=2 , . . . , P w×w×d r= * N } ×N r=1 with the specified patch size of w. Then flattening procedure will be executed on image patches P to obtain vectorized image for random subspace propagation to the next level for screening by using HOLP according to (4). Let δ is the random subspace propagation ratio, ϕ is the screening ratio using HOLP, K is the number of total random subspaces, K independent random subspaces after screening are generated as where g(x k rs ) is the prediction function of original gcForest model using data x k rs , x k rs is the training set from the kth random subspace after HOLP based screening of original training set x, C is the number of class labels, [f T (x k rs )] c is the cth element of the label vector f T (x k rs ). At level t ∈ {1, . . . , T }, f t is the cascade of elements of forests f = {f i , . . . , f T } up to level t. Notably, when stacking several layers of forests in original gcForest, the variance of input [x, f t−1 (x)] will typically get smaller as the next layers of forests get better and better at predicting the output and the remaining errors become increasingly difficult to correct for. In result, ability of the ensemble to improve upon the best score of the subsequent layers can be significantly limited because there is too little variation in predictions for the ensemble to learn useful combinations. One way to increase this variation is to propagate features from the subset of original input and/or earlier layers. In this sense, the cascade of elements of forests f = {f i , . . . , f T } up to level t that defined in [14] will be rewritten as where [x k rs , ∀x =k rs , f t−1 (x k rs )] is the input of the ensemble of forest h t at level t.
Propagating portions of features at random could enhance the ability of ensemble by increasing diversities from the variations on feature space. Meanwhile, it could also limit and even degrade the generalized ability of ensemble in sparse classification and small ensemble cases, could also limited the performances on memory usage, computational cost, and occasionally degrade the generalized ability in unnecessarily large random subspaces scenarios. To solve aforementioned issues, we can use the REP technique, which was inspired by the decision tree pruning algorithm with the goal of choosing the set of Ψ * classifiers that give the best voted performance on the pruning set [54]. In general, REP use a sophisticated search method called back-fitting as follows: 1) initialize the set of classifiers Ψ to contain the one classifier h 1 t that has the lowest error on the pruning set; 2) add the classifier h 2 t such that the voted combination h 1 t and h 2 t has the lowest pruning set error; 3) adds the classifier h k⊆K,k =1,2 t such that the voted combination of all classifiers in Ψ has the lowest pruning set error; 4) revisit earlier decisions and deleting previously chosen classifiers and replacing them with best classifier continues until none of the classifiers changes or reaches the number of iterations. Then, the objective function presented in (5) can be rewritten as where Ψ * ⊆ {f T (x k rs )} K+1 k=1 and number of classifiers in Ψ * is much smaller than K.

IV. DATASETS AND SETUP
A. Datasets 1) Pavia University: This data was captured over the Engineering School, University of Pavia, Pavia, Italy, by the reflective optics spectrographic image system (ROSIS) sensor, which provides 103 spectral channels with a spectral coverage ranging from 0.43-0.86 μm and with the spatial resolution of 1.3 m.  Color images shown in Fig. 3(a) has 610 × 340 pixels size and the validation data refer to 9 land cover classes are shown in Table I with details about the number of samples and the legends.
2) AirSAR Flevoland: This data was obtained by the Airborne Synthetic Aperture Radar (AirSAR) over the Flevoland region (The Netherlands) in 1989. As part of National Aeronautics and Space Administration Earth Science Enterprise project, AirSAR was designed and built by the Jet Propulsion Laboratory and operating in full polarimetric mode L-band. The scene shown in Fig. 2(c) was extracted from the SIR-C education program, it has spatial resolution of 6.60 m in the slant range direction and 12.10 m in the azimuth direction with the size of 705 rows×1024 columns size, and covers a large agricultural area of flat topography and homogeneous soils. The ground truth map shown in Fig. 2(d) refer to 11 land cover classes are shown  Table I (row 2). also shown in Table I with details about the number of samples and the legends.  Table I reports the corresponding number of samples for both the training and validation sets.
As for the features, while six upper OFF-diagonal features of coherence matrix T3 stacked with Span feature from AirSAR Flevoland data was used, the first 10 principal component, which contain the most information at much lower volume size, were selected for Pavia University and DFC2013 Houston high dimensional hyperspectral datasets for all considered methods to avoid the out of memory issues in the running of experiment from high intermodel feature dimensional. Because even for 10 principal components, dimensionality of the original input features will be at 10 × (1 + 3 × 3 + 5 × 5 + 7 × 7 + 9 × 9 + 11 × 11) = 2860 at the first round of flattening procedure.
In all experiments, the overall accuracy (OA), average accuracy (AA), kappa coefficient (Ka), algorithm running time in seconds are used to evaluate the classification performances of the adopted classifiers. All the experiments are conducted by using Python 3.9.7 and PyTorch 1.9.0 installed on a machine with 64-bit Windows 10 system use an Intel (R) Core (TM) i7-7820X 3.60-GHz CPU and 128 GB RAM, and with an NVIDIA Quadro RTX8000 GPU card equipped with 4608 CUDA parallel processing cores and 48 GB of RAM memory.

A. Results of DF With Various Screeners
In this section, we first investigate the performance of the adopted feature screeners to handling high and ultrahigh dimensionality issue of original DF in RS image classification. To make a comparison among the different sample size and multiscreening ratios, we report the OA and the time costs in seconds from model training by using different sample sets presented in Figs. 3 and 4, respectively.
From the graphs in Fig. 3, various OA values can be observed for considered feature screeners with different sample size, screening ratios, and datasets setting. In small sample (10 samples per class) setting scenario that shown by the graphs in the first row of Fig. 3, HOLP and SIS are better than others on the Pavia University data, BcorSIS, Kfilter, MVSIS, and HOLP are better than others on the DFC2013 data, and SMLE is better than others on the AirSAR Flevoland test data, but the SMLE screener almost shows the worst results on DFC2013 data, as shown in Fig. 3(b). Moreover, rapid decreasing trend of OA values from the most screeners are obtained as the screening ratio is smaller than 40% for all three datasets. In using all sample training scenarios that shown by the graphs in the second row of Fig. 3, OA values of SMLE, MVSIS, HOLP, and Kfilter are always higher than OA values from FC, SIS, RRCS, BcorSIS, CSIS, MDCSIS, and SRIS screeners on all considered datasets. Additionally, the rapid decreasing trend of OA values from SMLE, MVSIS, HOLP, and Kfilter screener are obtained as the screening ratio is smaller than 30% for all three datasets. And in the most cases, decreasing ratios of MDCSIS, CSIS, SIRS are much faster than others along with increasing values of screening ratios, see the learning curves in cyan lines with left triangles, green lines with upper triangles, and black lines with cross, respectively.
According to the bar charts shown in Fig. 4, we can easily see that first the lowest computational efficiency is shown by FC and MVSIS screeners on Pavia University data, by MVSIS screener on DFC2013 Houston data, and by BcorSIS and MVSIS screeners on AirSAR Flevoland data. While the secondary low computational efficiencies are shown by MDCSIS, RRCS, and SMLE screeners and PmRMRE feature selector, the highest computational efficiencies are shown by screeners including SIS, Kfilter, HOLP, CSIS, and SIRS in the most cases. Moreover, screening ratio set does not have obvious influences on the computational efficiency of the considered feature screeners. Summing-up with the findings from the Fig. 3 in previous paragraph, it can be concluded that HOLP screener is the optimum one to handling the high and ultrahigh dimensionality issue of original DF in RS image classification task, from both high OA values and highly efficient point of view in contrast with the other considered the screeners.

B. Parameters Analysis for HOLP-DF
According to the methodology details presented in previous Section III, and the working mechanism of conventional DF, subspace feature propagation ratio, feature screening ratio, sample size, and model depth are the critical parameters could influence the performance of the proposed HOLP-DF. Hence, we show the OA and model training time values versus the propagation ratio and screening ratio of HOLP-DF using all samples from considered datasets in Fig. 5 with the depth of 50 layers.
From the results shown in Fig. 5, it can be clearly seen that both propagation ratio and screening ratio values have considerable influences on both classification accuracy and model training efficiency. In further, influence from the propagation ratio on OA values is more critical than influence from the screening ratio. For example, there not exits obvious changes in OA values by increasing screening ratio values after the propagation ratio is higher than 40%, especially on the Pavia University and AirSAR Flevoland datasets, as shown in Fig. 5(a) and (c). On the contrary, influence from the propagation ratio and screening ratio is almost the same on computational complexity, and an optimum choice of screening ratio is less than 50% for efficiently model training. Summing-up with the previous findings, combination of 40% of propagation ratio with 40% of screening ratio is recommended for the proposed HOLP-DF from high classification accuracy, high model training efficiency, and small size of subspace data volume (only occupy 40% × 40% = 16% of original data volume) points of view.
In the Fig. 6, we present the OA values versus the number of samples for each class and number of layers (depth) of the considered DFs with 40% of propagation ratio with 40% of screening ratio set as recommended earlier. To comprehensively evaluate the performance of the proposed HOLP-DF, the original DF and DF with patch-based pooling [DF(PP)], morphological  Based on the results shown in the first row of Fig. 6, it can be clearly seen that as follows.
1) OA curves from the proposed HOLP-DF is always stays upper place than OA curves from original DF. And similarly, OA curves with the lowest values are always shown either by the original DF or by the DF-HOLP without using random subspace propagation strategy, as shown by sky blue line marked with asterisk and magenta lines marked with diamond. This proofs again that random subspace propagation can benefits the classification accuracy of HOLP-DF by increasing the diversity of ensemble.

C. Classification Results Comparison
To show the performance of the proposed HOLP-DF, we show OA, AA, and Ka values from the original gcForest (represented as DF for short), gcForest with confidence screening (DF-CS), and feature screening (DF-FS), and the proposed versions of DF with patch-based pooling [DF(PP)], morphological profiling (MP), and PL, which were proposed in our previous work [8] in Table II. Also results from the DL classifiers including double-branch, multiattention mechanism network (DBMA), double-branch dual attention mechanism network (DBDA), contextual deeper convolutional neural network (CDCNN), fast dense spectral-spatial convolutional network (FDSSCN), spectral-spatial residual network (SSRN), and random patches network (RPNet) are considered. Classification maps with OA values corresponding to the underlined numbers in Table II are presented in Figs. 7, 8, and 9 for the considered datasets. Notably, critical parameters including the number of layers, propagation ratio and screening ratio are set by 30, 40%, and 40%, respectively, as recommended by the previous results, whereas the parameters for other considered classifiers are set by default values as recommended in previous work [16].
Based on the results from Table II, again it is clear that compatible and even better classification results can be obtained by the proposed HOLP-DF by only using 16% of features from considered datasets compared with original DF. For example, while the original DF reached an OA value of 86.57% on DFC2013 Houston data, an OA value of 89.14% is reached by HOLP-DF. And in contrast with the classification results from DF-CS and DF-FS, AA, OA, and KAPPA values from the HOLP-DF are universally higher on Pavia University and DFC2013 Houston hyperspectral datasets. Take the DFC2013 Houston data as an example, the area at the lower part, which is covered by dense cloud shadows is more precisely classified by HOLP-DF [see Fig. 9(g)] compared with the maps from DF-CS and DF-FS, as shown in Fig. 9 Table II, the

VI. CONCLUSION
In this article, model-based HOLP feature screening method is introduced to overcome the intramodel high data dimensionality drawback of the conventional DF model (the gcForest) in hyperspectral and PolSAR image classification, where the random subspace propagation and REP techniques are also adopted to further boost the classification performance by increasing the ensemble diversity and decreasing the forests ensemble redundancy. To comparatively evaluate the performance of the proposed method, popular feature screeners, and state-of-the-art DL methods are selected in the experiments. According to the results from three widely used hyperspectral and PolSAR image classification benchmarks, the following results are concluded. 1) HOLP is an optimum solution to reduce the high and ultrahigh dimensionality in contrast with feature screening algorithms like FC, SIS, RRCS, SMLE, BcorSIS, CSIS, Kfilter, MDCSIS, MVSIS, and SIRS, from both highly accurate and efficient execution points of view. 2) HOLP-DF capable of obtaining better classification results than original DF and its modified versions of DF-CS and DF-FS, and the optimum sets of model depth, propagation ratio and screening ratio parameters are 30, 40%, and 40, respectively. 3) Classification performance of HOLP-DF can be further boosted by extra using PP and MP techniques. Although the proposed HOLP-DF algorithm show advanced performance in terms of classification accuracy, computational efficiency, and intramodel feature reduction effectiveness, the classification accuracy of HOLP-DF is still limited in some scenarios in contrast with the DF, which fused usage of PP and MP with PL techniques and in contrast with the state-of-the-art DL algorithms. Additionally, the HOLP feature screening algorithm may not be the best choice for those originally lower dimensional data such as PolSAR and even multispectral imageries. Therefore, we will focus on the study of other advanced feature screening algorithms for originally low dimensional but highly redundant intramodel dimensional RS image classification cases.