Remote Sensing Scene Classification Using Heterogeneous Feature Extraction and Multi-Level Fusion

Understanding the scenes provided by remote sensing (RS) images has drawn increasing attention in the last decade. It is to automatically classify a RS scene image by feature extraction and label assignment. Although much effort has been dedicated to developing discriminative feature extraction as well as automatic classification techniques, it is still very challenging owing to the complex distributions of the ground objects in high spatial resolution scenes. To enhance the ability to represent the RS scenes, an integration of multiple types of features for remote sensing scenes is considered as an effective way. Nevertheless, different kinds of features possess different characteristics, and how to fuse them together is a critical problem. In this paper, to fuse three different but complementary types of features that are extracted to characterize global attributes of scenes together, a discriminant correlation analysis based feature-level fusion approach is proposed, which maximizes the correlation of corresponding features across the two feature sets and in addition de-correlates features that belong to different classes within each feature set. In addition, to further improve the scene classification performance, another kind of information fusion form, called decision-level fusion is adopted in the classification stage. In our proposed decision-level fusion technique, the final results of multiple classifiers are combined via majority voting. Extensive experiments results conducted on the well-known SIRI-WHU dataset, the WHU-RS dataset and the UC Merced Land Use dataset have demonstrated that the proposed remote sensing scene classification method based on heterogeneous feature extraction and multi-level fusion is superior to many state-of-the-art scene classification algorithms.


I. INTRODUCTION
Remote sensing (RS) scene classification, aiming at classifying scene images into various semantic categories, has been a subject of increased interest in recent years, and widely used in remote sensing image retrieval, segmentation, especially for image understanding. Compared with classifying pixels or targets, scene classification is a challenging task due to highly complex geometrical structures, spatial patterns, and the existences of various objects inherent in the scenes [1]- [4].
The scene classification pipeline always contains two main stages: feature extraction and scene classification [5]. Good features that have good discriminability for different cate- The associate editor coordinating the review of this manuscript and approving it for publication was Jon Atli Benediktsson . gories of image scenes can maximize inter-class differences and at the same time, minimize intra-class variations [6]. Many methods have been proposed to extract various features for RS images. For instance, histogram of oriented gradients (HOG) and scale invariant feature transform (SIFT) are two most widely used descriptors for shape information extraction [7], [8]. Gabor filtering, gray level co-occurrence matrix (GLCM) and local binary pattern (LBP) are popular descriptors for texture description [9]- [11]. The representations, the first-order and second-order statistics, are also used to extract spectral information [12]. However, most of the above features belong to pixel-based cues, which may perform well on the images with uniform structures and spatial arrangements, but become less effective for scenes with complex geometrical structures and high spatial variations [1].
In [13], Bo et al. constructed compact kernel descriptors, which could turn pixel attributes (gradient, color, etc.) into compact patch-level features. Later, bag-of-words (BOW), representing an image by the histogram of local patches on the basis of a visual vocabulary, has attracted intensive attention in image classification [14]. However, the weakness of BOW for RS images is that it ignores spatial information, limiting its descriptive ability. To overcome this limitation, a spatial pyramid matching (SPM) approach was proposed [15]. It first partitioned an image into fine subregions, and then computed BoW histograms of features in each subregion. Finally, histograms from all subregions were concatenated to form SPM representation of the image. As SPM only considers the absolute spatial arrangement, the resulting features are usually sensitive to rotation variations. In [16], a novel simple yet effective coding scheme called localityconstrained linear coding (LLC) was introduced for feature representation. It utilized the locality constraint to project each descriptor onto local-coordinate system and integrated the projected coordinates by max pooling to generate the final representation. In addition, some topic models, such as probability latent semantic analysis (pLSA) and potential dirichlet distribution (LDA) are also introduced for remote sensing scene classification [17]- [19]. In recent years, with the development of artificial intelligence, deep learning has been applied to image feature extraction. In [20], deconvolution networks were investigated to learn a set of feature maps for remote sensing images. In [21], a convolutional neural networks based method was presented to extract discriminative features for scene classification.
Since a certain kind of features tends to represent images from one aspect, ignoring other information in images, to describe complex scenes, the fusion of complementary features is usually preferred, which has been verified to be effective to acquire richer information about the input data [22]- [26]. However, how to properly select those features is still one of the main research topics in the areas of pattern recognition and machine learning. As pointed out in [5], [6], [8], [17], rather than adding features with similar attributes, one should combine those with complementary information in order to achieve better classification performance. In addition, the heterogeneous features fusion strategy is another key step to construct the final informative representation for RS image scenes. Serial feature fusion and parallel feature fusion are two typical feature fusion schemes. In [27], features were connected in series for hyperspectral image classification. In [28], a featurelevel fusion method based on joint sparse representation was proposed. In [29], a deep fusion framework was designed to combine multiple semantic cues for complex event recognition. However, these fusion methods, which do not consider maximizing the correlation of corresponding features across the two feature sets and de-correlating features that belong to different classes within each feature set, are not very effective for the heterogeneous features fusion for RS scene classification.
In the classification stage, many classifiers have been proposed, which can be mainly divided into two categories: generative and discriminative strategies. Take the generative ones for example, in [30], a Markov random field (MRF) model was built for image classification. In [31], a landuse classifier based on stacked autoencoder was addressed. Moreover, K-nearest neighbor (KNN) classifier is a classical and simple discriminant model. It tends to have high classification errors when the distribution of training samples is unevenly. In [32], a remote sensing image classifier was proposed based on support vector machine (SVM). In [33], sparse representation (SR) was adopted to classify images in complex situations. In [34], a kernel-based extreme learning machine (KELM) was employed for scene classification.
Since a single type of classifiers has its own advantages and limitations, to enhance the performance of classification, the idea of decision-level fusion is addressed in some literatures. For instance, in [35], a multiple classifier fusion method via weighted decision templates was presented. It used a statistical vector to measure the classifier's performance and made a weighed transform on each classifier according to the reliability of its output. In [36], a spatially joint and post-partitioning collaborative representation based classifier was presented for hyperspectral image classification. In [37], a decision level fusion scheme via linear weight assignment was proposed.
Based on the aforementioned analysis, in this paper, we propose a scene classification method via heterogeneous feature extraction and multi-level fusion.
The main contribution of this paper is as follows.
• We do an in-depth investigation on feature extraction and present three different but complementary kinds of features, called DS-SURF-LLC, Mean-Std-LLC, and MO-CLBP-LLC, for scene images. These different features can describe image information from different angles.
• We present a heterogeneous feature fusion scheme to integrate different features based on discriminant correlation analysis. In this way, we can both reduce the dimensionality of heterogeneous features so as to speed up the method and obtain powerful fused features for subsequent scene classification.
• We build a decision-level fusion technique to enhance the discrimination power in the classification stage.
The rest of this paper is organized as follows. Section 2 describes the proposed method in details. Section 3 evaluates the proposed method against various state-of-the-art algorithms on several public RS scene datasets. Finally, Section 4 concludes this paper.

II. PROPOSED METHOD
The overall framework of the proposed method is shown in Fig. 1. As depicted in Fig. 1, the method mainly consists of three parts, i.e., heterogeneous features extraction, feature-level fusion, and decision-level fusion. VOLUME 8, 2020

A. HETEROGENEOUS FEATURE EXTRACTION
In our previous work [6], we have introduced three heterogeneous features, DS-SURF-LLC, Mean-Std-LLC, and MS-CLBP, for remote sensing images. Thereinto, the former two features are designed by us, while the last one is a well-known descriptor. In our later work, however, we find that the MS-CLBP, i.e., multiscale completed local binary pattern, which is based on Gabor filtering, has two intrinsic limitations: its lack of spatial information and the high computational cost. To overcome the shortcomings of MS-CLBP, in this paper, we propose a novel feature, called MO-CLBP-LLC instead of MS-CLBP for extracting texture information of scenes. Thus, the DS-SURF-LLC, Mean-Std-LLC, and MO-CLBP-LLC are used as three heterogeneous features for RS image characterization in this paper. Also, since the detailed procedures of DS-SURF-LLC and Mean-Std-LLC have been introduced in [6], here we only review these two features in brief.

1) DS-SURF-LLC AND MEAN-STD-LLC FEATURES
The DS-SURF feature is referred to as the combination of dense SURF (DSURF) and sparse SURF (SSURF For DSURF feature extraction, firstly utilize a regular grid to divide I into a number of image patches. Then, selecting patch center as a key point, and thus the feature vector su_d i (1 ≤ i ≤ M ) can be computed for each key point. The dense SURF feature set Su_d is given as: After obtaining DSURF and SSURF features, they are coded by locality-constrained linear coding respectively so as to get more powerful cues: DSURF-LLC and SSURF-LLC. At last, concatenate DSURF-LLC and SSURF-LLC together, and the augmented DS-SURF-LLC feature can be gotten.
In addition, to extract the Mean-Std-LLC feature, the image I is first transformed from RGB space into the other two spaces: HSV and XYZ. In each space, calculate the values of mean and standard deviation as the spectral features. At last, by stacking each channel's feature vectors Mean i and Std i together, the Mean-Std feature set Sp can be obtained: Similar to DS-SURF-LLC, LLC is also applied to Mean-Std features so as to get the enhanced Mean-Std-LLC features.

2) MO-CLBP-LLC FEATURE
In this section, a novel feature descriptor named MO-CLBP-LLC is designed based on multi-scale monogenic filtering. The monogenic signal, built on the Riesz transform, is an extension of the analytic signal [38]. For an input scene image I (z), z = (x, y), the Riesz transform of I (z) in spatial domain is: where * represents the convolution. The monogenic signal is defined as the combination of I (z) and I R (z): where i, j denote the imagery units. Then, orthogonally decompose I M (z) into three components, i.e. local amplitude A, local phase ϕ, and local orientation θ as follows: where I x (z) and I y (z) denote the i-imaginary and j-imaginary components of I M (z), respectively. Practically, since RS scene image is of finite length, it results in infinite spectra in the frequency domain. To capture broad spectral information with compact support, Log-Gabor filters are adopted here to modify the computation of the monogenic signal in (5): where h LG represents the Log-Gabor kernel. Based on S-scale Log-Gabor filter bank, the multi-scale monogenic signal of I (z) can be produced as I k M (k = 1, . . . , S), with their components shown as A k , ϕ k , and θ k (k = 1, . . . , S). Compared with a single scale monogenic filtering, multi-scale monogenic filtering results are more robust to noise, directional changes, etc.
Subsequently, for each monogenic component, apply CLBP to it to extract the multi-scale monogenic filteringbased completed local binary pattern features, i.e., MO-CLBP. Specifically, partition the monogenic component image into a number of patches at first. Then, for each patch, extract CLBP codes CLBP_S, CLBP_M , and CLBP_C as follows: where g p is the gray value of the neighborhood pixel. g c is the gray value of the center pixel. P is the number of pixels in the neighborhood. R is the radius of the neighborhood. N is the number of sub-windows divided by the image. CLBP_S, CLBP_M , and CLBP_C construct the completed LBP framework [10]. With the coding procedures for each A k , ϕ k , and θ k (k = 1, . . . , S), we can get the MO-CLBP features. At last, to bridge the semantic gap between low-level visual features and high-level image semantic meanings, the simple yet effective technique, LLC, is applied to MO-CLBP, so that the final MO-CLBP-LLC features can be obtained.

B. FEATURE-LEVEL FUSION
Each type of features (i.e., DS-SURF-LLC, Mean-Std-LLC and MO-CLBP-LLC) reflects various properties and has its own meanings. To enhance the ability of feature characterization, we propose to fuse them together by discriminant correlation analysis (DCA). As an improvement of canonical correlation analysis [39], DCA has excellent characteristics in maximizing the pairwise correlations across two feature sets and meanwhile reducing redundant information between the features, so that compact but discriminative feature representations can be obtained. The fusion process is as follows.
Given one kind of heterogeneous feature set X 1 , its n columns are divided into C classes, and n i columns correspond to the i-th class. Suppose x i,j is the feature vector of the j-th sample in the i-th class. First, define the inter-class scatter matrix as: where x i is the mean of the x i,j vectors in the i-th class, while x is the mean of the whole feature set. Second, by computing the eigenvectors of T bx bx , the dimensionality of X 1 can be effectively reduced, and the projection of X 1 in a reduced space can be gotten by: where W bx is the transformation using S bx . In the reduced space, the classes can be well separated. Similarly, given another type of heterogeneous feature set X 2 , we can also get its projection X 2 by following the above steps. Third, to let the features in one set own nonzero correlation only with their corresponding features in the other set, X 1 and X 2 are further transformed as follows: * where W cx 1 and W cx 2 are the transformations using S x 1 x 2 = X 1 X T 2 . Fourth, concatenate the transformed feature vectors by: According to the above steps, we can fuse any two heterogeneous features together, so that three integrated features, named DS-SURF-LLC+Mean-Std-LLC, DS-SURF-LLC+MO-CLBP-LLC, Mean-Std-LLC+MO-CLBP-LLC, can be obtained. Compared to the original heterogeneous features, the dimensionalities of the fused features are reduced largely. Finally, to further enhance the characterization ability of features, we adopt the tandem fusion strategy to combine the above three fused features together. Thus, the fourth type of fused features is obtained, which is referred to as DS-SURF-LLC+Mean-Std-LLC+MO-CLBP-LLC. These four fused features will be fed into the subsequent classification stage.

C. DECISION-LEVEL FUSION
Driven by the success of decision-level fusion in computer vision communities, we investigate an effective and efficient decision-level fusion method here for RS scene classification.
First, the widely used SVM is selected as the basic classifier. Let x i and y i (i = 1, . . . , N ) denote the feature vector and the corresponding label of the training samples, respectively. Based on SVM, a classification can be obtained by: where a i and b are the parameters to be optimized. The physical meaning of a i is the support vector sample weight. k(x i , x) is the kernel function. By applying a set of SVM classifiers to the above four fused features respectively, we can get a set of pre-classification results. Second, fuse the pre-classification results of different SVM classifiers together. For an image sample, suppose p ij denote the probability output of the j-th SVM classifier in the i-th class. α j is the corresponding weight for the j-th SVM classifier ( J j=1 α j = 1, where J is the number of pre-classification results to be fused). Given a set of α j , the final probability p i of the sample belonging to the i-th class can be calculated by: After obtaining the probabilities p i , i = 1, . . . , C of the sample belonging to each class, we select p m = max {p 1 , p 2 , . . . , p C } according to the maximum rule, and then regard the sample as the m-th class.
It is worth pointing out that, if the number of pre-classification results to be fused is large, it is computationally hard to traversing all possible α j combinations. Therefore, in this paper, we perform decision fusion on pair-wise pre-classification results. Then, by traversing all possible α j combinations for all training samples, we can get the optimal combination for α j .
At last, using the trained decision-level fusion framework, for each test sample, we can get several decision-level fusion results. Based on the majority voting rule, the final classification result can be effectively obtained.

A. DATASETS
In order to test the performance of the proposed method, three widely used benchmark datasets are selected to conduct scene classification experiments.
The first dataset used in our experiments is the SIRI-WHU dataset [15]. It consists of 12 classes: agriculture, commercial, harbor, idle land, industrial, meadow, overpass, park, pond, residential, river, and water. Each class contains 200 images with the size of 200 × 200 pixels. A typical example of each class is shown in Fig. 2. In the experiments, 100 training samples are randomly selected from each category, and the remaining ones are tested.
The second dataset is the WHU-RS dataset [7], which is composed of 19 classes: airport, beach, bridge, commercial, desert, railway station, viaduct, football field, forest, industrial, meadow, mountain, residential, river, farmland, park, parking, pond, and port. The size of each image is 600 × 600 pixels. Fig. 3 displays a sample of each class. There are 50 images for each class, and we randomly select 30 images per category as training data and use the rest as test data.
The third dataset, UC Merced Land Use dataset [40], consists of 21 scene categories, i.e., agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis court. Some examples from the dataset are shown in Fig. 4. Each class contains 100 images with the size of 256 × 256 pixels. For per category, we randomly select 80 images for training and the other 20 images are used for testing.   In order to quantitatively evaluate the scene classification performance, three metrics are utilized: overall accuracy (OA), standard deviation (SD), and confusion matrix [6]. Thereinto, OA measures the proportion of scenes classified accurately. SD measures the stability of the method. To compute SD, a number of individual runs should be conducted. Confusion matrix is a specific table layout, and each column of it denotes the instances in a predicted class, while each row denotes the instances in an actual class.
First, we compare our heterogeneous features extraction method with some strongly related methods, including our previous presented approach (referred to as DS-SURF-LLC+Mean-Std-LLC+MS-CLBP) in [6], the SIFT-LLC+MS-CLBP in [41], the SIFT+Mean-Std+GLCM in [42], the MS-CLBP in [10], the GCLBP in [43], the SIFT-ScSPM in [44], and the SIFT-LLC in [45]. The reader is referred to those papers for detailed information about their features. To make the comparisons as fair and meaningful as possible, we use the same feature-level fusion strategy introduced in our paper for multiple features integration, and for classification, a SVM classifier without using our decision-level fusion scheme is used for simplicity.
Second, for scene classification, the feature dimensionality influences both computation time and classification performance. A higher feature dimensionality usually costs more computation time, so many works propose to reduce the dimensionality of features. However, reduced features may produce lower classification accuracy since the VOLUME 8, 2020 discriminative capability of the generated image representation becomes lower. Different from the existing algorithms, in this work, we adopt discriminant correlation analysis as our feature-level fusion scheme, which can maximize the pairwise correlations across two heterogeneous feature sets and meanwhile reduce redundant information between the features, so that compact but discriminative feature representations can be obtained. As a result, computation time can be well reduced and at the same time, feature discriminative capability is enhanced rather than weaken. To verify the above advantages of our feature-level fusion scheme, we compare it with the following technique: First, three heterogeneous features, DS-SURF-LLC, Mean-Std-LLC, MO-CLBP-LLC, are combined in series. The combined feature is directly fed into a SVM classifier without dimensionality reduction. We will investigate the computation time as well as classification accuracies of different multiple feature combinations on various datasets.
Third, to evaluate our decision-level fusion scheme, we compare it with the method without using decisionlevel fusion. Specifically, the classification method used for comparison here only utilizes the DS-SURF-LLC+Mean-Std-LLC+MO-CLBP-LLC feature as input and adopts a SVM classifier for image categorization. On the contrary, our decision-level fusion scheme uses DS-SURF-LLC+Mean-Std-LLC, DS-SURF-LLC+MO-CLBP-LLC, Mean-Std-LLC+MO-CLBP-LLC, and DS-SURF-LLC+ Mean-Std-LLC+MO-CLBP-LLC as inputs, and at the same time adopts four SVM classifiers. Then, the outputs of these classifiers are fused to compute the final classification results.
Fourth, to verify the overall performance of our whole method, we compare it with many state-of-the-art classification algorithms. For instance, the first one is the heterogeneous feature multiple kernel learning method (referred to as HF-MKL) proposed in [6]. In [17], Zhu et al. presented a scene classification method via semantic-feature fusion fully sparse topic model (referred to as SFF-FSTM). In [18] It first utilized the VGG-Net to extract informative features from the original VHR images. Then, discriminant correlation analysis (DCA) was adopted as feature fusion strategy. At last, a support vector machine (SVM) was used as a classifier for scene classification (referred to as DCA-fusion). Note that, for all competing algorithms, the number of training samples and the number of testing samples of each dataset are the same as ours, expect for the following situations: First, for 12-class dataset, both SICNNs and Pre-trained-AlexNet-SPP-SS randomly select 160 samples for training and the other 40 for testing. Second, for 19-class dataset, Pre-trained-AlexNet-SPP-SS randomly select 25 samples for training and the other 25 for testing. To make fair comparisons, for all the competing methods, we follow their best classification results.

C. EXPERIMENTS ON 12-CLASS SIRI-WHU DATASET 1) EVALUATION OF OUR FEATURE EXTRACTION METHOD
All techniques are performed on the 12-class dataset with 10 trials of random partition of this dataset. The average overall accuracies and standard deviations obtained by various feature algorithms are presented in Table 1. From it, we can see that our proposed DS-SURF-LLC+Mean-Std-LLC+ MO-CLBP-LLC method achieves the best classification performance, while our previous DS-SURF-LLC+Mean-Std-LLC+MS-CLBP method in [6] achieves the suboptimal results. The explanation is that our newly proposed MO-CLBP-LLC feature is more discriminative than the previous MS-CLBP descriptor.

2) EVALUATION OF OUR FEATURE-LEVEL FUSION SCHEME
To verify the advantages of our feature-level fusion scheme, we compare it with the following technique: First, three heterogeneous features, DS-SURF-LLC, Mean-Std-LLC, MO-CLBP-LLC, are combined in series. The combined feature is directly fed into a SVM classifier without dimensionality reduction.
The comparison results are shown in Table 2. As it shows, the dimensionality of the fused features obtained by our method is much smaller than that of the competing approach. Hence, both the training time and the testing time of ours are much less. In addition, although the heterogeneous features  are reduced according to our method, the overall accuracy of ours is still superior to the method without feature reduction. This reflects that by using our feature-level fusion, the feature discriminative capability is enhanced rather than weaken. Note that either the training time or the testing time is for the whole training or testing datasets. Table 3 shows the average overall accuracies and standard deviations for two algorithms. One is based on our proposed decision-level fusion scheme, while the other only utilizes the DS-SURF-LLC+Mean-Std-LLC+MO-CLBP-LLC feature as input and adopts a SVM classifier for image categorization. As can be seen, our classification method delivers better performance. Its average overall accuracy is a bit higher than that of the comparing method, and its performance is more stable.

3) EVALUATION OF OUR DECISION-LEVEL FUSION APPROACH
Subsequently, the confusion matrix for our method via heterogeneous feature extraction and multi-level fusion on the 12-class scene is given in Fig. 5. We can see that most classes are easily distinguished from others, for the classification accuracies of almost all classes are above 0.90 by our method.
The major confusion occurs between class 12 (i.e., water) and class 3 (i.e., harbor), or class 9 (i.e., pond) and class 11 (i.e., river). It is not surprising that harbor or pond areas are filled with waters, making these categories easy to confuse.

4) COMPARISON WITH STATE-OF-THE-ART METHODS
A comparison of our method's performance with previously reported performance in the literature is carried out on the 12-class dataset. First of all, we illustrate the classification performance of several non-deep learning based methods, as shown in Table 4. The results demonstrate our method produces the best performance. In addition, we also give results for three deep learning based algorithms in Table 4. Thereinto, Pre-trained-AlexNet-SPP-SS achieves better performance than ours. However, compared with SICNNs and SRSCNN, our approach achieves superior performance.

D. EXPERIMENTS ON 19-CLASS WHU-RS DATASET 1) EVALUATION OF OUR FEATURE EXTRACTION METHOD
The classification performances of various feature methods on the 19-class dataset are reported in Table 5. As can be seen, the results of our proposed DS-SURF-LLC+Mean-Std-LLC+MO-CLBP-LLC approach are much better than    those of other types of features, which indicates that our proposed heterogeneous features are more valuable for WHU-RS scenes. Table 6 shows the experimental results of different featurelevel fusion schemes on the 19-class dataset. As shown in Table 6, scene recognition of our approach outperforms the competing method in both efficiency and accuracy. It confirms that our feature fusion strategy indeed contributes to scene recognition, even in the challenging situation with smaller feature dimensionality.

3) EVALUATION OF OUR DECISION-LEVEL FUSION SCHEME
The comparison results between the traditional classification technique as well as the decision-level fusion based technique are shown in Table 7. For the 19-class dataset, our classification algorithm is more satisfactory. This indicates that the decision-level fusion is able to provide discriminative recognition for WHU-RS scenes.
An overview of our method's performance is shown in the confusion matrix in Fig. 6. It can be seen many of scene categories including beach, bridge, desert, farmland, footballfield, viaduct, can be fully recognized by our method. There is only a little confusion between certain scenes. For example,  a scene belonging to port is recognized as bridge. This can be explained by the fact that these two classes have similar spatial patterns. Table 8 summarizes the results of the proposed method as well as some state-of-the-art algorithms for 19class scene. Thereinto, HF-MKL, L1R-LR, KELM, MS-CLBP+FV+KELM, KLR are five representative non-deep learning based methods, while Pre-trained-AlexNet-SPP-SS, VGG-VD-16 and DCA-fusion are three deep learning based algorithms. We observe that our method is superior to many approaches, while only KLR, VGG-VD-16 and DCA-fusion achieve better performance than ours. Hence, we will seek more informative features to further enhance the representation power in the future. A feasible way is to adopt the deep convolution neural network (CNN) models as feature extractors to extract more informative features from the original remote sensing images.  Table 9 shows the classification results of different feature methods on the 21-class database. It is clear that our feature method can dramatically improve the classification accuracy, and its performance remains stable compared to others. This evidence infers that our strategy can obtain a more holistic and discriminative representation of scenes, which is beneficial to effective scene classification.

2) EVALUATION OF OUR FEATURE-LEVEL FUSION SCHEME
In Table 10, the dimensionality of the fused feature without feature reduction is 19200, while that obtained by our scheme is only 120. Also, the training time of the former is about 110 times that of the latter. The testing time of the former is 67.50s, which is much higher than that of the latter. On the contrary, the accuracy of the latter is superior to the former. This again confirms that although our method contains    feature reduction, it still has good performance due to the reduced feature has good discrimination.

3) EVALUATION OF OUR DECISION-LEVEL FUSION SCHEME
We do two different classification experiments using the traditional classifier and our decision-level fusion based classification technique on 21-class dataset, and the corresponding results are shown in Table 11. It is clear that, the overall performance of ours is better than that of the comparing algorithm. The results verify the benefits of the decision-level fusion approach we design.
An overview of the best performance from one run of our method for all 21 categories is given by the confusion matrix presented in Fig. 7. Performance is measured as the average classification accuracy per class. Totally 12 classes of the UC Merced Land Use scenes achieve 100% using  our method. Of course, there are a few confusions between some categories. The most difficult categories are the medium residential and the tennis court. This can be explained by the fact that the two scenes share similar spatial distributions. Table 12 shows the classification performance for all competing methods on the 21-class dataset. From the results, we can observe that the proposed method delivers better performance than other algorithms including HF-MKL, SFF-FSTM, DMTM, MS-CLBP+FV+KELM, L1R-LR, UFLA, SICNNs, SRSCNN, and VGG-VD-16. And LGFBOVW, Pre-trained-AlexNet-SPP-SS and DCA-fusion achieve a bit higher accuracies.

4) COMPARISON WITH STATE-OF-THE-ART METHODS
Finally, as reported in Tables 4, 8, and 12, we can find that in some situations, our proposed method achieves better VOLUME 8, 2020 performance than some deep learning based algorithms, while in other situations, our method is inferior to certain deep learning approaches. After careful analysis, we think that the reason why some deep learning methods are inferior to our methods may lie in the characteristics of deep learning, such as the need of a large number of training data, high-performance hardware, etc. Hence, in our future work, we plan to explore deep learning based methods and try to combine the proposed method with deep learning method together to further enhance the scene classification performance.

IV. CONCLUSION
In this paper, we propose an effective scene classification method based on heterogeneous feature extraction and multilevel fusion for remote sensing images. In our method, the heterogeneous feature extraction strategy can enhance the ability of feature characterization. The feature-level fusion scheme can reduce the dimensionality of heterogeneous features so as to speed up the method and meanwhile obtain powerful fused features for subsequent scene classification task. The decision-level fusion technique is beneficial to effective scene classification. Experimental results on three challenging scene datasets demonstrate that our method can achieve better or comparable performance as compared to the state-of-the-art algorithms.