Aggregating Features From Dual Paths for Remote Sensing Image Scene Classification

Scene classification is an important and challenging task employed toward understanding remote sensing images. Convolutional neural networks have been widely applied in remote sensing scene classification in recent years, boosting classification accuracy. However, with improvements in resolution, the categories of remote sensing images have become ever more fine-grained. The high intraclass diversity and interclass similarity are the main characteristics that differentiate remote scene image classification from natural image classification. To extract discriminative representation from images, we propose an end-to-end feature fusion method that aggregates features from dual paths (AFDP). First, lightweight convolutional neural networks with fewer parameters and calculations are used to construct a feature extractor with dual branches. Then, in the feature fusion stage, a novel feature fusion method that integrates the concepts of bilinear pooling and feature connection is adopted to learn discriminative features from images. The AFDP method was evaluated on three public remote sensing image benchmarks. The experimental results indicate that the AFDP method outperforms current state-of-the-art methods, with advantages of simple form, strong versatility, fewer parameters, and less calculation.


I. INTRODUCTION
The development of satellite technology and Earth observation systems in recent years has led to the capturing of massive amounts of remote sensing images. Accurate classification and understanding of these images enable visualization of the developments and changes of Earth's surface to improve the living environments of human beings [1]- [3]. Remote sensing scene classification is the basic, and one of the most challenging tasks of remote sensing image interpretation. Its main purpose is to recognize aerial or satellite imagery according to predefined category information [1]. Therefore, the classification of high-resolution remote sensing image scenes has important and extensive applications, such as forest and land investigation, urban planning, object detection, updating maps, and monitoring geological disasters [3], [4]. However, with improvements in the image resolution and the availability of increasing amounts of data, the categories of remote sensing images are becoming ever The associate editor coordinating the review of this manuscript and approving it for publication was Stefania Bonafoni . more diversified and fine-grained. Further, owing to the top view of images and the variance in resolution, objects in the remote sensing imagery have multiple orientations and scales. These factors present significant challenges to the accurate classification of remote sensing image scenes.
The essence of scene classification is image classification, which includes two main steps: feature extraction from the image and classification of the extracted feature. Feature extraction is the key, and the design of a stable and reliable method for feature extraction is the main research direction for improving classification accuracy. Before the breakthrough of deep learning, scene classification of remote sensing images mainly relied on handcrafted features. These handcrafted features usually include color [5], texture [6], gradient histogram [7], and SIFT [8]. These features are simple and easy to obtain; however, their direct use cannot achieve satisfactory performance, mainly because they are low-level features that cannot effectively represent the global semantic information of images. By feature engineering or recoding low-level features, local information of images can be integrated and optimized to obtain middle-level features, which improves the representation ability of the global information of images. Typical methods include bag of visual words (BOVW) [9], spatial pyramid matching (SPM) [10], spatial co-occurrence kernel (SCK) [11], latent Dirichlet allocation (LDA) [12], and Fisher kernel (FK) [13] vector of locally aggregated descriptors (VLADs) [14]. When the categories of remote sensing images are relatively simple and easily distinguishable, these methods are capable of obtaining good results. However, they have a weak ability to deal with high-resolution remote sensing images with diverse and finegrained categories. Moreover, methods using handcrafted features are usually difficult to process end-to-end and in realtime, which also limits their practicability in massive remote sensing images.
In recent years, deep learning, especially convolutional neural networks (CNNs), has made significant progress in image classification and recognition [15]- [17]. With massive amounts of data and high-performance hardware, the accuracies of CNNs such as AlexNet [15], VGGNet [16], and ResNet [17] in image classification have far exceeded handcrafted features. The application of CNN to scene classification and object extraction in remote sensing imagery also significantly improves the performance of remote sensing image interpretation [1], [2], [18]. However, unlike natural images in ImageNet [19], remote sensing images have three main characteristics [1]: arbitrary orientations, intraclass difference, and interclass similarity. The remote sensing image is a top view of Earth, which is different from the natural image, and the direction of the object on the image is arbitrary. Furthermore, compared with natural images, remote sensing images have lower resolution, more scale variance, and a wider spatial range. And multiple objects usually coexist which makes remote sensing images have complicated and diverse backgrounds. Consequently, the differences between images within the same category are enhanced, while the similarity between images in different categories also increases significantly. Overall, the existence of these factors leads directly to difficulties in extracting strong robust and discriminative features with general CNNs.
Based on the above discussion, to extract robust and distinguishable features from remote sensing images, we designed a novel end-to-end feature fusion method for remote sensing image scene classification with the idea of low-rank bilinear pooling [20] and feature connection. First, we use the pre-trained CNN to construct a dual-branch architecture to extract features from images. To reduce the parameters and computation of the network as much as possible, we prefer lightweight CNNs. Then, to obtain high-order features from convolutional neural networks, the corresponding elements from the double-branched feature maps are multiplied and normalized. Finally, the high-order features and low-order features are fused by connection. The major contributions of this paper are as follows.
(1) Based on lightweight convolutional neural networks, we designed a dual-path feature fusion method, which improves the distinguishability and richness of features extracted by convolutional neural networks with fewer parameters and calculations.
(2) In the process of feature fusion, we fused the high-order features and low-order features of remote sensing images. After fusion, the semantic information is much richer compared with feature addition or connection only.
(3) Extensive experiments on three widely used remote sensing image data sets show that, compared with the existing methods, our method not only has obvious advantages in performance but also has fewer parameters and calculations.
The remainder of this paper is organized as follows. Section II summarizes the latest research on remote sensing image scene classification using convolutional neural networks. Section III describes the architecture used for scene classification in detail. The proposed method is evaluated, and extensive experiments conducted on three widely used data sets are outlined in Section IV. Finally, we conclude in Section V.

II. RELATED WORK
The application of CNNs for remote sensing image scene classification has been extensively studied. The methods employed can be summarized into three categories: transfer learning, improvement on the structure of CNN, and optimization of the loss function.
Transfer learning usually relies on pre-trained CNNs, whose parameters have been obtained by training on Ima-geNet. All or some parameters of the CNNs are fixed and the CNNs are regarded as feature extractors. Subsequently, the features extracted by the CNNs are used for classification. Alternatively, the parameters can be employed as their initial values, finetuning all or some parameters of the CNN using remote sensing images. For example, Pennate et al. [21] applied CNNs to the classification task of remote sensing images and demonstrated their pleasurable performance. Hu et al. [22] explored the ability of features from different layers of CNNs. Cheng et al. [23] constructed a bag of convolutional features extracted by CNN. Through extensive experiments, Nogueira et al. [24] found that among the three strategies of retraining, finetuning, and using CNN as a feature extractor, the latter two strategies achieve better results than the first one. The method of transfer learning mainly appeared in the early stages of scene classification using CNN. The advantage of transfer learning is that it does not require the design of a new network and also is capable of achieving favorable results. Because the method of transfer learning does not change the main structure of CNNs and the remote sensing image has its particularity, the CNNs for natural image classification may not be suitable for remote sensing images. Consequently, the capability of a CNN is not completely exploited with remote sensing images.
Improving the structure of the existing CNN can significantly boost its accuracy in remote sensing image scene classification, which is also the employed mainstream method currently. Ma et al. [25] used multi-objective neural evolution to search architecture for scene classification. Xie et al. [26] designed a scale-free CNN that could support any image size. Wang et al. [27] attempted an orientation response network to deal with the orientation problem of the objects in remote sensing imagery. Zhang et al. [28] introduced positional attention in ResNet50. Zhang et al. [29] and other researchers [30]- [32] used channel attention, Cao et al. [33] adopted self-attention, to enhance the features from CNNs. Wang et al. [34] employed recurrent attention, and Bi et al. [35] used residual attention in the CNN. Besides attention, feature fusion is also a good choice. For example, Li et al. [36] and Yu et al. [37] designed a double-path architecture to extract the features for fusion. He et al. [38] obtained the second-order features of an image using covariance pooling. And Yao et al. [39] adopted multiple pyramid pooling to generate a fixed-length representation regardless of image size. Chaib et al. [40] adopted a discriminant correlation analysis (DCA) for feature fusion. Lu et al. [41] integrated the features from different layers of the CNN. Du et al. [42] and Xue et al. [43] combined features extracted by multiple CNNs. Improving the structure of a CNN, particularly the method of feature fusion for scene classification, has demonstrated outstanding results. Among the methods based on CNN, feature fusion is a popular and efficient approach to obtain robust and discriminative representations. Many remote sensing image scene classification methods adopted deep CNNs like VGG and ResNet, although good results could be obtained, their parameters and calculations are huge. Therefore, designing lightweight scene classification methods that could achieve high-precision accuracies remains quite challenging.
Further, some scholars have begun to focus on the loss function of the classification. They replaced the cross-entropy loss function with other loss functions that were used in face recognition during training. Wei et al. [44] adopted both marginal-center loss and cross-entropy loss to minimize the distance of features within the class in the feature space. Liu et al. [45] used three loss functions on triplet networks to improve the classification accuracy. Cheng et al. [46] used metric learning to further compact the features extracted by CNN. In the existing literature, the improvement in the loss function did not lead to a significant improvement in performance. Designing the novel architecture of a CNN with an efficient feature fusion method is still key to improving the capability of representation for remote sensing image scene classification.  Table 1.

A. DUAL BACKBONES FOR FEATURE EXTRACTION AND FEATURE TRANSFORMATION
A complete CNN for image classification consists of convolutional layers, pooling layers, and fully connected layers. Dozens of convolutional layers and pooling layers can be regarded as feature extractors, which convert the input image into a set of feature maps. By global pooling of each channel in the feature maps, a global descriptor can be obtained, thus transforming a set of feature maps into a feature vector for classification. Because of the different performances of various CNNs, the distinguishability of the extracted features also varies. Multiple CNNs can be used to extract multiple features from a single image. These features can be fused in a certain way, and usually, the fused feature has stronger robustness and richer semantic information than the feature from a single CNN. However, the simultaneous use of multiple CNNs means that the parameters and computation are multiplied; hence, the performance and computational complexity must be balanced. To achieve this goal, lightweight and efficient CNNs are preferred. In the existing research (e.g., [37], [42], [43]), the use of multiple networks for feature extraction has proven to be an effective way to improve the accuracy of scene classification. In the feature extraction stage, we hope to reduce the parameters and calculation of the model as much as possible to obtain acceptable performance. MobileNetv2 [47] and MobileNetv3 [48] were the best choices. The top-1 accuracy of MobileNetV2 in ImageNet was 72.0%, with 3.4 million parameters. The top-1 accuracy of MobileNetv3 in ImageNet was 75.2%, with 5.4 million parameters. In contrast, VGG16 achieved a top-1 accuracy of 72.0% in Ima-geNet, with 130 million parameters and ResNet50 has more than 25 million parameters. However, VGG16 and ResNet50 remain the main structure adopted by most scene classification methods at present. MobileNetv2 and MobileNetv3 achieved a good balance between accuracy and calculation largely owing to the use of depth-wise separable convolutional layers. Further, MobileNetv2 adopted an inverted residual and linear bottleneck. MobileNetv3 inherited the main structures of MobileNetv2 and employed channel attention, novelty activation function, and other functions. The final convolutional layer and classification layer of MobileNetv2 and MobileNetv3 were removed, leaving the other part of the structure as the backbone of our architecture. When the size of the input image is 256 × 256 pixels, the feature extraction layer of MobileNetv2 can be used to obtain a set of feature maps with a size of 320 × 8 × 8, and the feature extraction layer of MobileNetv3 can be used to obtain a set of feature maps with a size of 160 × 8 × 8. To ensure the two sets of feature maps have the same number of channels, we added two transformation layers following each backbone. Each transformation layer contains a convolutional layer, a batch normalization layer, and a ReLU activation function layer. The number of convolution kernels of the convolutional layer in the two transformation layers is 512, and the size of the convolution kernel is 1 × 1, such that the two feature maps are both transformed to the same size of 512 × 8 × 8.

B. FEATURE AGGREGATION
Currently, there are three widely used feature fusion methods: addition, connection, and bilinear pooling. Feature addition refers to the addition of the corresponding values of two features which are usually extracted by CNN. This fusion method requires multiple sub-features to have the same dimension, and the fusion process does not change the dimension of the sub-features. Feature connection refers to the concatenation of multiple features in a certain dimension, and the dimension of the fused vector is the sum of the dimensions of all sub-features. Feature addition and feature connection are the simplest and most widely used methods for feature fusion. For example, Chaib et al. [40] attempted to conduct feature fusion using addition, concatenation, and discriminant correlation analysis (DCA). The DCA fusion method exhibited good performance with low-dimensional features, in contrast to the high-dimensional features produced by addition or concatenation. This indicates that there may be much redundant information or noise in high-dimensional features after simple addition and concatenation. Bilinear pooling [49] is another feature fusion method applied to finegrained visual classification; its core idea is the out product of features. The main process of bilinear pooling includes a series of operations, specifically, out product, pooling, signed square root, and normalization, to obtain features with second-order information. Bilinear pooling is usually performed between the feature maps from the convolutional layers of two different CNNs. The dimension of the fused feature is the product of the dimensions of the two feature maps, which causes dimensional explosion problems because of the extensive calculation. Kim [20] replaced the out product in bilinear pooling with the Hadamard product and obtained robust features for fine-grained visual classification. Thus, the calculation could be significantly reduced.
Inspired by the above research, we designed a novel feature fusion method integrating the idea of bilinear pooling and feature connection. Suppose that the two sets of feature maps that were extracted from the image I by A and B (A and B are both CNNs) are denoted as X and Y. X ∈ R h×w×a , Y ∈ R h×w×b , h, and w are the height and width of the feature maps, a and b denote the channel number of the feature maps (usually A and B are two different CNNs, a is 320, b is 160, VOLUME 10, 2022 h is 8, and w is 8 in the proposed AFDP architecture). First, projection transformation on the two sets of feature maps is conducted to transform them to the same dimension. This is achieved by the transformation layer in our architecture.
where X ∈ R h×w×o , Y ∈ R h×w×o ; Q and R represent the projection matrix formed by the transform layer, Q ∈ R a×o , R ∈ R b×o , o is the channel number of deep features after the transformation (o is 512 in the proposed AFDP architecture).
The transformation layer has two main purposes. One is to transform the dimension of the feature maps and eliminate the adverse effects during feature fusion occurring due to different dimensions of the two features. The second is that the convolutional layer of the transformation layer can be regarded as a re-weighting of the two features before the fusion and realizing the automatic adjustment of the weight between different features. The transformed feature maps X and Y both have a size of h × w × o. Then, the feature maps are converted into a feature vector using global average pooling. The implementation is as follows: Here, F Avg denotes the global-average-pooling function, X i ∈ X , Y i ∈ Y , Z X and Z Y are the feature vectors obtained from feature maps X and Y , respectively. The fully connected layer is used to perform nonlinear transformations on the feature vectors Z X and Z Y (the fully connected layer does not change the dimension of the feature), and the results are recorded as V X and V Y . By performing the element-wise product operation and square root operation on the feature vectors V X and V Y new feature vector V XY is generated: where The final vector U, whose size is 1 × 3o (U is a 1536-dimensional feature vector in the proposed AFDP architecture), is obtained by normalizing V X , V Y , and V XY and connecting the normalized vectors. The implementation is as follows.

C. FEATURE CLASSIFICATION
After feature fusion, a feature vector with a dimension of 1536 can be obtained, which is the global semantic information of the remote sensing scene image. The obtained feature vector includes both second-order information (i.e., those obtained by bilinear fusion on feature vectors from the dualpath CNN) and first-order information (i.e., those obtained by normalization of feature vectors from the dual-path CNN), and it has an excellent performance in image representation. Therefore, in the training process, we adopt the cross-entropy loss function.

IV. EXPERIMENTAL EVALUATION
The proposed method is evaluated in this section. First, the data sets employed and the experimental setup are introduced in detail. Subsequently, the performance of our method is compared with that of the state-of-the-art methods. Further, the proposed method for feature fusion was also adopted by other CNNs, such as ResNet18 and ResNet34 to verify its universality.

A. DATA SET
We carried out experiments on three widely used data sets for remote sensing image scene classification, namely UC Merced [9], AID [50], and NWPU-RESISC45 [2]. The basic information of the three data sets is as follows.
The UC Merced data set has 21 categories, and each category contains 100 color RGB images with a size of 256 × 256 pixels. There are 2,100 images with a spatial resolution of approximately 1 m. Some image examples of the UC Merced data set are displayed in Fig. 2. The UC Merced data set was released in 2010, whose images were adopted from the United States Geological Survey National Map covering more than 20 regions of the United States. There is great similarity between different categories, such as sparse residential areas, medium residential areas, and dense residential areas, annotated according to the density of residential buildings in the image. In addition, building and mobile home parks have many similarities. The confusion between these categories is the main reason why it is difficult to improve classification accuracy further. In the experiment, 20%, 50%, and 80% of the images (i.e., percentages denoting training ratios) in each category were selected as training images, and the remaining images were used as the test images.
The AID data set consists of 30 categories and 10,000 color RGB images in total. The number of images in each category ranged from 220 to 420. The image size is 600 × 600 pixels, and the image spatial resolution is approximately 0.5-8 m.
Some image examples of the AID data set are displayed in Fig. 3. The AID data set was released in 2017 with images from Google Earth covering numerous countries and regions. In the experiment, 20% and 50% of the images in each category were selected as the training images, and the rest were the test images (i.e., training ratios were 20% and 50%).
The NWPU-RESISC45 data set expands the category number to 45, where each category contains 700 images,  amounting to a total of 31500 RGB images. The images in the NWPU-RESISC45 cover more than 100 countries and regions under different weather and seasons. The image size is 256 × 256 pixels and the spatial resolution of the images varies from 30 to 0.2 m. Therefore, the NWPU-RESISC45 data set is one of the most challenging data sets at present. In the experiment, 10% and 20% of the images were selected as the training set, and the remaining data were used as the test set (i.e., training ratios were 10% and 20%). Some image examples of the NWPU-RESISC45 data set are displayed in Fig. 4.
To avoid the influence of random factors, the training images were randomly selected from each data set and under each training proportion, and the mean values were calculated through five experiments. To exploit fully the limited training images and avoid overfitting, some data augmentation techniques were employed. The images in the training set were rotated clockwise by 90 • , 180 • , and 270 • , and flipped horizontally and vertically.

B. EVALUATION OF PROTOCOLS AND PARAMETER SETTING
For the remote sensing image scene classification task, there are currently two evaluation protocols: the overall accuracy (OA) and confusion matrix. Overall accuracy is defined as the number of correctly classified images divided by the total number of test images. Furthermore, we report the standard deviation after five runs. The confusion matrix shows the details of the correct and incorrect classification of each category in a visual form. Each row of the matrix represents VOLUME 10, 2022 the true category, and each column represents the predicated category.
We used PyTorch, a popular deep learning library, to build the architecture. The parameters of the architecture in the training process were set as follows. The pretrained weight of MobileNetv2 and MobileNetv3 was loaded as the initial value. The initial learning rate of the feature extractor was 0.01, and the initial learning rate of other parts of the architecture was 0.1. The learning rate was reduced 0.5 times every ten epochs, and the epoch number was 30. The model was optimized using stochastic gradient descent (SGD). The weight decay was 0.00001, and the momentum value was 0.9. The batch size was 32.

1) CLASSIFICATION ON THE UC MERCED DATA SET
In the task of remote sensing image scene classification, the performance of methods using CNNs has far exceeded the methods using handcrafted features. On the UC Merced data set, we compared the performance of the proposed AFDP with the other state-of-the-art methods in the last four years, as shown in Table 2. Table 2 shows that when 50% and 80% of the images were used for training, the proposed method was superior to the almost latest algorithms, with overall accuracies reaching 98.81% and 99.27%, respectively. We found that when the training ratio was 80%, the overall accuracies of the latest methods reached saturation (approximately 99%), and there was little room for further improvement. Because when the UC Merced data set was released, deep learning had not yet made such a breakthrough. It was mainly used to evaluate handcrafted features in remote sensing image classification. Therefore, the categories and images of UC Merced are relatively simple. On the other hand, the training images are much more than test images (80% vs. 20%), and this setting is inappropriate. Even when an algorithm achieves a high classification accuracy at the training ratio of 80%, it is still difficult to determine whether the algorithm is overfitting. Therefore, a new evaluation benchmark is required by introducing training ratios of 20% and 50%. Compared with ARCNet, D-CNNs, FACNN, GBNet, and CapsNet, whose classification accuracies were close to ours when the training ratio was 80%, our method exceeded the performance of ARCNet, GBNet, and CapsNet by 2.00%, 1.76%, and 1.22%, respectively. DCNNs and FACNN did not provide their accuracies with the training ratio of 50%, but they both adopted VGG16 as the backbone, whereas our method adopted MobileNet with fewer parameters and calculations. Thus, our method has evident advantages. Although MSDFF had a similar performance to AFDP, it adopted three deep CNNs to extract features and fused them into a 6144-dimensional vector for classification. Our method employs lightweight CNNs as feature extractors, obtaining a 1536-dimensional vector. Few methods used a 20% training ratio to validate the generalization of their methods. The deep CNNs were likely prone to overfitting with such a small number of training images. For example, the overall accuracy of ResNet50 in the literature [53] was only 74.11%, whereas that of Siamese ResNet50 was only 76.50%. Fusion by addition used the pre-trained CNNs to extract features, and subsequently SVM for classification. Although the trainable parameters were significantly reduced, the accuracy was only 92.96%, which is far lower than the 97.19% of our method, as expected. Using the pre-trained model to extract features from an image could reduce the difficulty of training; however, it cannot fully release the potential of CNN in remote sensing scene classification. At the same time, the simple addition of features will also introduce noise, and the effect of this fusion method is inevitably inferior to the proposed method, which integrates second-order and first-order information. Fig. 5 shows the confiscation matrices obtained by AFDP at training ratios of 20% and 50%. At the training ratio of 20%, the classification accuracy of 13 categories exceeded 99%. Fewer than one sample (80 × 0.01 = 0.8) was on average misclassified in each category. Of the 21 categories, four were easily misclassified. They were dense and medium residential areas, mobile home parks, and storage tanks; 12% of dense residential area images were misclassified as medium residential areas, 5% of medium residential areas were misclassified as dense residential areas, and 4% of mobile home parks were misclassified as medium residential areas. Three percent of storage tank images were misclassified as buildings and intersections. At the training ratio of 50%, there were 20 categories whose accuracies were above 96% and 15 categories whose accuracies reached 100%.
The three categories with the lowest accuracy were dense residential areas (90%), medium residential areas (96%), and storage tanks (98%). Dense and medium residential areas are prone to confusion, mainly because they have very similar features and semantic information. They are annotated according to the density of buildings, and even human beings cannot distinguish them effectively. SCCov16 adopted covariance pooling to obtain the second-order information of the image. However, at the training ratio of 80%, the three types with the lowest accuracy of SCCov16 were dense residential areas (90%), medium residential areas (85%), and storage tanks (95%), which show values lower than those obtained by our method. This demonstrated that, compared with other second-order information, the semantic information obtained by AFDP had a stronger representation ability.

2) CLASSIFICATION ON THE AID DATA SET
The performance comparison between AFDP and the stateof-the-art methods on the AID data set is shown in Table 3. Table 3 shows that, regardless of whether the training ratio is 20% or 50%, the proposed AFDP exhibited a better classification performance compared to the other compared methods, and the overall accuracies reached 96.36% and 97.49%, respectively, when the input size was 256 × 256 pixels. However, when the training ratio was 50%, the accuracy of numerous methods exceeded 96%, such as DCNN (96.89%), SF-CNN (96.66%), MDPMNet (97.14%), , and DDRL-AM (92.36%), their accuracies were significantly lower than those of our method (96.36%). Furthermore, the image size of the AID data set was 600 × 600 pixels. Current methods usually resized the image to 256 × 256 pixels or 224 × 224 pixels before training. Image scaling would inevitably lead to the loss of detailed information on the image, which hurt feature extraction and classification. When we resized the AID image to 448 × 448 pixels for experiments (limited by computing capability, batch size was set to 16), the overall accuracies were boosted to 96.78% and 97.86%, respectively. The classification accuracy of the AFDP with a training ratio of 20% exceeded the accuracies of most methods with a training ratio of 50%. When the input size was 448 × 448 pixels, the accuracies were improved by 0.42% and 0.37% compared with 256 × 256 pixels, respectively. Fig. 6 lists the confusion matrices of the AFDP when the training ratios were 20% and 50%. When the training ratio was 20%, the four easily confused categories had the lowest accuracies, namely center (85%), resort (86%), school (86%), and square (89%). When the training ratio was 50%, the accuracies were boosted to 92%, 88%, 93%, and 91%, respectively, while four categories with the lowest accuracies by SCCov16 were church (94.2%), resort (76.6%), school (88.7%), and square (83.6%). These four categories were easily confused, mainly because the images of these categories usually contained objects such as similar buildings and trees, which resulted in images with similar features.
Therefore, improving the model's performance to recognize these easily confused categories is key to improving the overall accuracy.

3) CLASSIFICATION ON NWPU-RESISC45 DATA SET
For the NWPU-RESISC45 data set, the performance of the state-of-the-art methods and AFDP is shown in Table 4. At training ratios of 10 and 20%, the overall accuracy of AFDP reached 93.32% and 95.07%, respectively, which exceeded all other methods. Further, Table 4 shows that when the training ratio was 10%, the accuracy of AFDP had already significantly surpassed most methods at a training ratio of 20%. At the training ratio of 20%, only three methods had an overall accuracy of over 94%: ADSSM, MDPMNet, and Hydra. Our method improved by 0.78%, 0.96%, and 0.56%, respectively. At the training ratio of 10%, our method improved by 1.63%, 1.52%, and 0.88%, respectively. Apart from MDPMNet, all methods listed in Table 4 adopted deep   TABLE 4. Overall accuracies (%) of state-of-the-art methods on NWPU-RESISC45 data set.
CNNs, such as VGG16 and ResNet50. The disadvantages of deep CNN include too many parameters and calculations, low speed, and high time consumption. MDPMNet could be regarded as a lightweight model; however, it did not achieve a desirable result at a training ratio of 10%. Even at the training ratio of 20%, it remained inferior to our method. ADSSM used multilayer stacked covariance pooling in VGG16 and adopted SVM as the classifier, which was not an end-to-end method. Hydra, whose performance was closest to AFDP, used multiple deep CNNs (ResNet50 and DenseNet-161) to achieve such a classification accuracy. The parameters and calculations were excessive. In terms of the complexity and performance of the model, the proposed AFDP based on the lightweight CNN had the advantage of being lightweight and exhibiting excellent performance. Fig. 7 shows the confusion matrices of AFDP on the NWPU-RESISC45 data set at training ratios of 10% and 20%. The classification accuracy of most categories was above 90% for both training ratios, and the two categories with the lowest accuracy were church (78%, 83%) and palace (73%, 79%). As these have similar textures and buildings that are easily confused, even human beings find them difficult to distinguish. In comparison, when the training ratio was 20%, church and palace achieved accuracies of 83% and 79%, respectively, whereas the accuracies of the two categories of the VGG16 method were 75% and 64%, respectively, the accuracies of the two categories of DCNN were 79% and 73%, and the accuracies of the two categories of MDPMNet were 81.1% and 77.3%, respectively.

D. ABLATION STUDIES FOR THE PROPOSED METHOD 1) COMPARISON OF DIFFERENT FEATURE FUSION METHODS
To study further the effect of fusion methods on the performance of CNN, we compared Concat, Addition, and AFDP with the baselines. Finetuning MobileNetv2 with the three data sets was named MV2, and finetuning MobileNetv3 was named MV3. MV2 and MV3 were the baselines, as we did not change any structure of MobileNetv2 and MobileNetv3. Concatenating two 1280-dimensional feature vectors extracted by MobileNetv2 and MobileNetv3 was named Concat. Adding two 1280-dimensional feature vectors extracted by MobileNetv2 and MobileNetv3 was named Addition. Feature fusion with AFDP based on MobileNetv2 and MobileNetv3 was named AFDP. The performances of the five methods on the three data sets are shown in Table 5. Table 5 indicates that MV3 exhibits better performance than MV2, as expected. MobileNetv3 is an improved version of MobileNetv2, and its classification performance has been significantly improved. Particularly in the case of no data augmentation, the accuracy of MV3 was approximately 2% higher than that of MV2 on the three data sets on average. For the UC Merced data set (TR = 20%), the performance improvement was more obvious, and the accuracy of MV3 could reach 91.92%, whereas that of MV2 was 84.95%. The accuracy standard deviation of MV2 was larger (the standard deviation of the five results of MV2 was 2.07%). This demonstrates that if there is insufficient training data, the performance of MV2 is far inferior to that of MV3 and by the training images, such as the training process influenced by random factors. Moreover, the training image, training process, and other random factors have a large influence on MV2. Data augmentation can reduce the performance gap between MV3 and MV2. With data augmentation, the accuracy of MV2 was improved by 94.19% on the UC Merced (TR = 20%) data set, and the gap with MV3 was significantly reduced. In the case of data augmentation on the three data sets, the accuracy of MV3 was approximately 1% higher than that of MV2 on average. This shows that simple data augmentation can effectively improve the accuracy of remote sensing image scene classification.
Concat and Addition, two commonly used methods of feature fusion, did not show significant improvement in accuracies on all the data sets compared with MV3. Although the accuracies of Concat and Addition were higher than that of MV2, they were only similar to that of MV3. This indicates that these two fusion methods did not fully exploit the advantages of MobileNetv3 and MobileNetv2, whereas features from MobileNetv3 with stronger robustness were dominant in the feature after fusion. Furthermore, when two features from the two CNNs are simply connected or added without any processing, noise may be introduced and the representation ability and robustness may be reduced. For the three data sets, the classification accuracies of AFDP outperformed all other methods. In the case of data augmentation, the accuracies of AFDP were 1.34%, 1.10%, and 1.39% higher than those of MV3 on the three data sets. Even without image augmentation, AFDP achieved classification accuracies of 95.39% (TR = 20%), 95.35% (TR = 20%), and 91.34% (TR = 10%), which had already exceeded many of the latest methods. This indicates that less training data leads to a stronger representation ability by our designed feature fusion method.
To verify further the generalization of the feature fusion method proposed in this study on other CNNs, ResNet18 and ResNet34 were also adopted for ablation experiments. In the structure of ResNet18 and ResNet34, a feature transformation layer was added. The fine-tuned ResNet18 was named R18 and the fine-tuned ResNet34 was named R34. R18 and R34 are the baselines. Concatenating two 512-dimensional feature vectors extracted by R18 and R34 was named Concat. Adding two 512-dimensional feature vectors extracted by R18 and R34 was named Addition. Feature fusion with AFDP based on R18 and R34 was named AFDP. The performances of these five methods on the three data sets are indicated in Table 6. The same conclusion can be drawn from Tables 5 and 6. Comparing the performance of R18, R34, Concat, Addition, and AFDP, we found that R34 was better than R18, which was as expected.
Compared with R34, Concat and Addition had no evident improvement in accuracy and even showed a decrease, which proved that there was redundant noise information after simple concatenation and addition. The AFDP method outperformed the other four methods. The experimental results on ResNet18 and ResNet34 indicate that the conclusion is consistent with Table 5. This shows that the feature fusion method proposed in this study has strong universality in the task of remote sensing image scene classification and exhibits better performance than ordinary feature fusion methods. Fig.8 shows the accuracy comparison of the different methods on the three data sets with data augmentation. The AFDP method significantly improves the accuracy of CNNs.

2) EFFECTIVENESS OF AFDP WITH SINGLE CNN
Typically, two types of features are selected for fusion. One uses features from multiple CNNs, and the other uses features from different layers of a single CNN. The advantage of the former is that it maximizes the diversity of features from the same image, and the advantage of the latter is that it can share most parameters and calculations. The feature fusion method proposed in this study can also be applied to a single CNN. The parameters of the backbone of the feature extraction layer are fully shared, and the feature transform layer is used to convert the feature into two different features. AFDP using MobileNetv2 is named MV2+, AFDP using MobileNetv3 is named MV3+, AFDP using ResNet18 is named R18+, and AFDP using ResNet34 is named R34+. MV2, MV3, R18, and R34 were the baselines. The performance comparison of the data sets used is shown in Fig. 9. From previous experimental results, we know that a simple concatenation or addition of two features from different CNNs cannot effectively VOLUME 10, 2022 boost the performance of the CNNs, and the performance achieved by these two feature fusion methods is close to that of MV3 and R34. Further, MV3 is significantly superior to R34. From Fig. 9, we can observe that using AFDP on a single CNN also resulted in a remarkable improvement in performance compared with the CNN without AFDP in remote sensing image scene classification, particularly the training data, and the performance improvement was more evident. For example, for the UC Merced data set (TR = 20%), the accuracies of MV2+, MV3+, R18+, and R34+ were approximately 2.60%, 1.00%, 1.51%, and    1.91% higher than those of MV2, MV3, R18, and R34, respectively. For the AID data set (TR = 20%), the accuracy improvements were 1.19%, 0.73%, 1.14%, and 0.63%, respectively. For the NWPU-RESISC45 data set (TR = 10%), the accuracy improvements were 1.29%, 1.07%, 1.26%, and 1.07%, respectively. MV2+, MV3+, R18+, and R34+ all   exhibited better performance than MV2, MV3, R18, and R34. Among the eight methods, MV3+ showed the best results, with overall accuracies of 96.85%, 95.99%, and 93.00% for UC Merced (TR = 20%), AID (TR = 20%), and NWPU-RESISC45 (TR = 10%). However, MV3+ based on a single CNN was still inferior to AFDP based on MobileNetv2 and MobileNetv3, which achieved overall accuracies of 97.19%, 96.36%, and 93.32% on the three data sets. This is as expected because the information of features from multiple CNNs is richer than features from a single CNN's different layers.

3) EFFECTIVENESS OF AFDP COMPARED WITH BILINEAR CNN MODEL
The idea of the AFDP algorithm is similar to that of the BCNN algorithm [49], but AFDP has a simpler form.
To compare the performance of AFDP and BCNN, we used different backbones for feature extraction, and the shape of the features was 512 × 8 × 8. Table 7 shows the accuracies of AFDP and BCNN when using MV2 and MV3 for feature extraction. As can be seen from Table 7, whether using a single backbone (MV2 or MV3) or both backbones (MV2 and MV3) for feature fusion, AFDP shows a significant accuracy improvement over BCNN. In addition, when BCNN adopts both backbones, the classification accuracy is lower than that obtained using MV3 only. It indicates that when two different CNNs are used for feature extraction, BCNN cannot take full advantage of the two CNNs and BCNN is better at utilizing a single CNN. ADFP not only contains the core idea of BCNN but also introduces the fusion method of connection, which can make the most of the two models. Table 8 shows the accuracy comparison of AFDP and BCNN when using R18 and R34 as the backbones; we can conclude that it is consistent with Table 7. The experimental results on multiple CNNs show that AFDP has better performance than BCNN. In addition, BCNN adopts outer-product during feature fusion and the dimension of the output feature is 512 × 512, whereas the dimension of AFDP's output feature is 512 × 3. This means that our methods not only involve less computation but are also less likely to overfit.

E. PARAMETERS AND CALCULATION
With remote sensing image data growing explosively, the demand for fast and efficient scene classification can be met only by adopting lightweight and low-complexity algorithms. However, almost all methods listed in Section IV.C are based on deep CNNs, such as VGG16 [16] and ResNet50 [17], which focus on the improvement of classification accuracy and ignore the computational complexity of their model. Although lightweight CNNs in natural image recognition have been extensively explored, only a few researchers have applied them to remote sensing image scene classification tasks. We compared VGG16, GoogleNet [62], MDPMNet, SCCov16, MV2, MV3, and our methods in terms of model size, parameters, and floating-point operations (FLOPs), as shown in Table 9. Generally, the more parameters there are in the CNNs, the more training data are required for training, and it is easier to overfit with a small amount of training data. The parameters of VGG16 reached 134 million with 15 GFLOPs. Numerous remote sensing image scene classification algorithms based on VGG16, such as DCNN [39], did not change the main structures of VGG16 and most parameters were retained. Our method is superior to VGG16 in terms of parameters, model size, and calculation, and it is superior to most scene classification algorithms based on VGG16. Based on MobileNetv2, Zhang et al. [29] designed a lightweight scene classification algorithm called MDPMNet by introducing channel attention and multi-dilation pooling module. Compared to VGG16, although the parameters were significantly reduced, compared to MobileNetv2, the calculations were expanded almost 10-fold. The number of parameters of our method is 5.3% that of VGG16, and the number of FLOPs is approximately 5.6% of that of VGG16. Compared with MDPMNet, MV2, and MV3, our method achieves a better balance between accuracy and model complexity. Table 9 indicates an interesting phenomenon, namely, just by finetuning MobileNetv2 and MobileNetv3 (MV2 and MV3), we can achieve better performance than numerous state-of-the-art methods. The experimental results are consistent with the conclusions of MDPMNet, which shows that the structures of MobileNetv2 and MobileNetv3 are more suitable for remote sensing image scene classification tasks than other CNNs. Methods based on MobileNetv2 and MobileNetv3 have better scene classification performance, which provides an important benchmark for scene classification.

V. CONCLUSION
In this paper, we proposed an efficient and lightweight endto-end architecture for remote sensing image scene classification. The proposed architecture adopts lightweight CNNs to extract the deep features of images and a novel feature fusion method that aggregates features from dual paths. The features obtained by the proposed method have strong representation ability and distinguishability because of the integration of second-order and first-order information. Experiments were conducted on three widely used benchmarks. The results show that in comparison with other state-of-the-art methods, the proposed architecture not only has better classification performance but also has fewer parameters and calculations. Further, the proposed architecture could extract more robust and discriminative features that are more resistant to overfitting with small remote sensing scene data sets. The ablation experiments show that the designed feature fusion method is effective on multiple different CNNs, and we provide a novel baseline for remote sensing scene classification. Although we achieved better performance in remote sensing image scene classification with fewer parameters and calculations compared with other methods, the parameters and calculations of AFDP can be further compressed. The backbones of AFDP are still using the existing convolutional neural networks, and in the next work, we will design more suitable backbones for remote sensing image scene classification. In addition, unsupervised or weakly supervised scene classification will be explored using convolutional neural networks.