Gated Recurrent Multiattention Network for VHR Remote Sensing Image Classification

With the advances of deep learning, many recent CNN-based methods have yielded promising results for image classification. In very high-resolution (VHR) remote sensing images, the contributions of different regions to image classification can vary significantly, because informative areas are generally limited and scattered throughout the whole image. Therefore, how to pay more attention to these informative areas and better incorporate them over long distances are two main challenges to be addressed. In this article, we propose a gated recurrent multiattention neural network (GRMA-Net) to address these problems. Because informative features generally occur at multiple stages in a network (i.e., local texture features at shallow layers and global profile features at deep layers), we use multilevel attention modules to focus on informative regions to extract more discriminative features. Then, these features are arranged as spatial sequences and fed into a deep-gated recurrent unit (GRU) to capture long-range dependency and contextual relationship. We evaluate our method on the UC Merced (UCM), Aerial Image dataset (AID), NWPU-RESISC (NWPU), and Optimal-31 (Optimal) datasets. Experimental results have demonstrated the superior performance of our method as compared to other state-of-the-art methods.

. Visualization of the attention maps produced by GRMA-Net for different VHR RS images. The informative and irrelevant areas are highlighted in red and blue. GRMA-Net can assign discriminative weights for informative areas and suppress the irrelevant ones. necessary to develop a discriminative method for VHR RS image classification.
As shown in Fig. 1, RS images generally have complex spatial structures. They usually cover a large-scale area with many types of objects. The informative areas usually occupy a small part of the image. Although the classic CNN (i.e., ResNets [9]) can generate the global representation by cascaded convolutions, they fail to assign discriminative weights to the informative local areas. The irrelevant areas cannot be well suppressed. This problem easily leads to misclassification of the network. Moreover, because of the long imaging distance, informative areas generally scatter around the whole image and exhibit complex spatial distribution. How to effectively aggregate these widely distributed features is the other problem to be solved.
Attention mechanism is widely used to address the allocation of available processing resources toward the most informative components of an input signal [11]. It has achieved promising results in the area of neural language processing (NLP) [12], [13] and image recognition [11], [14]- [17]. However, existing attention methods in RS field [18], [19] mainly concentrate on enhancing the global features description ability. It has been shown that multiscale local features are also important for RS image classification [20]- [24]. Intuitively, different layers have different This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 2. Visualization of the change of interest regions using 10 randomly selected images from the AID dataset [10]. S-Feature, M-Feature, and D-Feature represent the features on the shallow, middle, and deep layers, respectively. These features are from the last convolutional layer of the conv2-x, conv3-x, and conv4-x blocks in the ResNet50 networks [9]. As the neural network goes deeper, the interest regions change from local texture to global profile. regions of interest, as shown in Fig. 2. As the network goes deeper, the regions of interest grow from local textures to global profiles. These multiscale features are all essential to RS image classification. Therefore, it is nontrivial to incorporate attention mechanism in multiscale feature extraction for more powerful representations. To achieve effective aggregation of informative areas, pioneering works either directly concatenate multiscale features sequentially [25] or impose an adaptive factor on these features [26] to perform weighted summation. These methods do not fully exploit the spatial relationship and contextual dependency of these features. Actually, these widely distributed areas generally have rich spatial relationship and contextual dependency, which is essential for accurate classification.
To address the first problem, we design a multilevel attention module to focus on regions of interest at multiple scales, as shown in Fig. 3. High-level semantic information extracted by global features can be used to guide local features to focus on informative cues. If we directly add the multiscale local features and global features to generate attention map, the huge magnitude difference among multiscale features and global features will weaken the guidance of global features. Therefore, we introduce an adaptive convolution to adjust local features during feature aggregation. Inspired by the effectiveness of recurrent neural network (RNN) in modeling long-range dependency [12], [13], we introduce RNN to exploit the relationship among different locations. We re-arrange multiscale features as spatial sequences and then sequentially process them using a deep RNN.
In summary, the contribution of this article can be summarized as follows.
1) We propose a gated recurrent multiattention neural network (GRMA-Net) to address the problem of weak representation for local informative areas and weak dependency among widely distributed informative features. 2) A multilevel attention module and a gated recurrent unit (GRU)-based feature aggregation module are proposed to assign discriminative weights for multiscale local features and exploit the spatial dependency of features at different locations, respectively. As shown in Fig. 1, our method can increase the response of informative areas and meanwhile suppress other areas. 3) It is demonstrated that our GRMA-Net has achieved the state-of-the-art performance on the UC Merced (UCM), AID, NWPU and Optimal. The remainder of this article is organized as follows. Section II discusses the related work on VHR RS image classification and attention mechanism. Section III introduces the details of our GRMA-Net. Section IV presents the experimental results. Section V gives the conclusion.

II. RELATED WORK
In this section, we briefly review the related work for VHR remote sensing scene classification and attention mechanism.
A. Scene Classification for VHR Remote Sensing Images 1) Hand-Crafted Feature-Based Methods: Hand-crafted feature-based methods have been extensively investigated before the wide application of deep learning. These methods mainly focus on human-designed feature extractors. Typical features include histogram of oriented gradient (HOG) [27], scale invariant feature transformation (SIFT) [28], local binary pattern (LBP) [29] and median robust extended local binary pattern (MRELBP) [30]. Then, post-encoding methods have been proposed to improve the discriminativeness of low-level semantic descriptors, including hierarchical coding vector (HCV) [31], spatial pyramid match kernel (SPMK) [32], and randomized spatial partition (RSP) [33].
Although these methods have achieved good performance, they are essentially low-level descriptors. Compared to deep features extracted by pretrained CNNs, these features are lack of high-level semantic information and suffer from limited performance 2) Deep Learning-Based Methods: Hu et al. [34] first used pretrained networks such as (e.g., VGG [35], AlexNet [36]) to extract high-level semantic features. Cheng et al. [37] and Li et al. [38] proposed multiple post-encoding methods (e.g., bag of visual word, fisher vector) to optimize extracted features. Afterward, Castelluccio et al. [39] adopted a pretrained GoogLeNet [40] and then fine-tuned it on the target RS dataset. Similarly, Li et al. [41] activated baseline CNNs layer by layer to search for the optimal activation strategy. These methods [34], [37]- [39], [42] transfer existing baseline CNNs without any modification for RS target dataset. Hence, they are inferior in high-level semantic representation as compared to recent deep-learning-based methods [43]- [45].
Subsequently, complex networks have been developed in deep-learning-based method in RS. Zhao et al. [43] proposed a multilayer perception structure to reduce the over-fitting problem. Liu et al. [44] adopted adaptive deep pyramid matching to enhance the multiscale representation ability. In [45], the cross entropy loss was replaced by the metric learning regularization to make baseline CNNs more discriminative. Because of the limited data of RS datasets, it is hard to train very deep networks with only thousands of images. Many deep networks (e.g., DenseNet [46], InceptionNet [47]), which perform well on the ImageNet dataset, cannot be well transferred into RS image classification.
Apart from traditional VHR RS image classification, some new subfields have drawn increasing attention recently, e.g., ship species classification [48], tree species classification [49] in fine-grained image classification, and high-dimension RS images retrieval [50] in multilabel image classification. These methods further explore rich details in RS images, which may ultimately contribute to RS image coarse classification.

B. Attention Mechanism in CNNs
The pioneering work of attention mechanism was developed for natural language processing (NLP). Later, attention mechanisms were introduced to solve different computer vision tasks such as image classification [16], [51], [52], fine-grained visual categorization [53], and image super-resolution [54]- [58]. Generally, attention mechanism in computer vision can be divided into three main categories: spatial, channel, and hybrid attention. Jaderberg et al. [15] proposed the first spatial attention-based learning method, named spatial transformer network (STN). Although STN is simple and shallow, it performs patch-level attention to achieve significant improvements over traditional classification methods [59], [60]. Wang et al. [61] proposed a refined pixel-level spatial attention network, in which nonlocal operations are employed to capture long-range dependencies to achieve further improvements over STN. Afterward, Hu et al. [11] proposed the first channel attention-based method, i.e., squeeze and excitation networks (SENet), to adaptively recalibrate channel-wise feature responses by explicitly modeling interdependencies among different channels. Subsequently, several attention mechanisms are developed to fuse both spatial and channel information. Wang et al. [51] proposed the first hybrid attention-based method (i.e., residual attention network). Specifically, residual attention learning was used in both spatial and channel domains to achieve further improvements over SENet. Similarly, Woo et al. [14] proposed a more general hybrid attention module, i.e., convolutional block attention module (CBAM), which can be integrated into any CNN architectures. CBAM consists of a channel and a spatial attention module. It helps the CNN to learn what and where to emphasize or suppress in images. Therefore, CBAM achieves further improvements over SENet. Although sophisticated attention modules have achieved better performance, they inevitably increase model complexity. Recent works [62], [63] pay more attention to lightweight design. Wang et al. [62] proposed a local cross-channel interaction-based method, i.e., efficient channel attention (ECANet), to generate channel attention through a fast 1-D convolution. In this way, the trade-off between network performance and complexity is achieved. Then, Hou et al. [63] proposed coordinate attention (CA) to factorize channel attention into two fast 1-D feature encoding processes, which aggregate features along the two spatial directions. In this way, CA achieves significant improvements with nearly no computational overhead.
Attention mechanism also achieves excellent performance in RS image classification. Wang et al. [18] imposed a spatial attention map on the last feature map of backbone CNNs to improve their global representation ability and thus obtained significant improvements over traditional classification methods [34], [38]. Afterward, Tong et al. [64] proposed a channel attention-based learning method, i.e., channel attention-based densenet (CAD). They used DenseNet121 as the backbone and adopted a channel attention module to strengthen the important channels. Following CAD, Zhao et al. [65] proposed a hybrid attention-based method, i.e., enhanced attention module (EAM). They use ResNet101 as the backbone and adopt spatial and channel attention modules to enhance the features in both domains. Therefore, EAM achieves further improvements over [18]. Different from these works that plug attention modules into the backbone networks, some works try to use attention as a post-processing module at the end of networks. Li et al. [66] proposed multiinstance learning (MIL) by adding a spatial attention pooling module into the end of the network. Chen et al. [19] proposed an attention-guided sparse filter (SGSF) by embedding a spatial attention module into deep sparse filter networks. These methods achieved substantial performance improvements.
Although the performance is continuously improved by recent attention-based methods, the weak representation of local informative areas and weak dependency among widely distributed informative features have not been well addressed in literature. Therefore, our GRMA-Net first combines multilevel attention module and deep GRUs to both selectively enhance informative local features and capture contextual relationship of these widely distributed features. In this way, the informative areas can be given more attention and meanwhile the long-range dependency of these widely distributed features can be captured.

III. METHODOLOGY
In this work, we develop a multilevel attention module to enable the network to pay more attention to informative areas and suppress irrelevant areas. Besides, we propose a recurrent module to exploit the spatial relationship and contextual dependency among informative areas of an RS image. The overall architecture of the proposed method is shown in Fig. 4.

A. Overall Architecture
Section III-B introduces our multilayer feature extraction approach. Input images are first preprocessed and then fed into the backbone CNN to extract multiscale local features L s ∈ R C s ×H s ×W s and global feature G ∈ R C g ×1×1 . Section III-C presents the multilevel attention module. Features L s ∈ R C s ×H s ×W s (s ∈ {1, 2, 3, . . . , S}) at single scale are fed into a transition convolution to generate L s 0 . The global feature G ∈ R C g ×1×1 is fed into a 1 × 1 convolution to generate G 0 ∈ R C s ×1×1 and then is stretched to the size of G 1 ∈ R C s ×H s ×W s . After element-wise sum between L s 0 and G 1 , the obtained score map F s is fed into softmax operation to generate corresponding attention map α s at scale s. After element-wise multiplication L s en = α s ⊗ L s , the enhanced multiscale features L en = {L 1 en , L 2 en , . . . , L S en } are obtained. Section III-D shows the GRU optimization. Multiscale features are arranged as spatial location sequences L en = { 1 , 2 , . . . , N all }. These sequences are fed into deep GRUs to search for the optimal spatial relationship and contextual dependency. The image label is obtained by Y = GRU(L en ).

B. Multiscale Feature Extraction
The multiscale feature extraction module consists of several cascaded layers. As shown in Fig. 2, as the neural network goes deeper, the interest region of the network changes from local textures to global profiles. Because these features are all important to RS image classification, we design a multilevel attention module to improve the multiscale representation ability of backbone networks.
In our module, we first extract multiscale local features as the input of the attention operation. Here, the local feature at scale s is given as where C s , H s , W s denote the number of channels, height, and width of L s , respectively. l s n represents the value of local feature L s at spatial location n ∈ {1, 2, 3, . . . , N s }, at a given convolutional layer s ∈ {1, 2, 3, . . . , S}. Then, global feature G ∈ R C g ×1×1 is also generated by the first nonconvolutional layer before the softmax layer. C g denotes the channels of G.

C. Multilevel Attention Module
Assume L denotes the local coarse feature, G is the global discriminative feature. High-level semantic information extracted by global features can be used to guide local features to focus on informative cues. If we directly add multiscale Therefore, we first feed the local features L s ∈ R C s ×H s ×W s into transition convolution Conv_t to adaptively adjust their magnitudes at scale s, resulting in L s The global feature G is fed to a 1 × 1 convolution to generate G 0 ∈ R C s ×1×1 . Then, G 0 is stretched to the size of G 1 ∈ R C s ×H s ×W s . After element-wise sum between L s 0 and G 1 . The score map F s at scale s can be generated according to where σ is ReLU activation function. Once F = {F 1 , F 2 , . . . , F S } is generated, a softmax layer is used to obtain the normalized attention map where f s n denotes the score map F s n at location n, at a given scale s ∈ {1, 2, 3, . . . , S}.
Finally, we perform element-wise multiplication between the normalized attention weight value α s n and corresponding local features l s n . That is, L s en = { 1 , 2 , . . . , N s } is generated as the final descriptor for the image at each scale s.

D. Feature Aggregation Using GRU
In the multilevel attention module, we have extracted sufficient multiscale features, which are scattered throughout the images with long spatial ranges. How to better fuse these widely distributed features is a problem to be solved. RNN can naturally capture the mutual dependencies of information. As a special kind of RNN, as shown in Fig. 5, GRU can memorize long-range information to achieve better performance than normal RNN structures. To fully exploit long-range dependency among these local and global information, we use GRU in our network to sequentially process these multiscale features and automatically find the optimal combination through continuous iteration.
Similar to the application of GRU in NLP, which arranges features in time series, feature extracted by multiattention module can be considered as spatial series. As shown in Fig. 5, we first used an 1 × 1 convolution operation to squeeze the channel of multiscale features L en = {L 1 en , L 2 en , . . . , L s en } ∈ R C en ×H en ×W en into a single channel and generated L en ∈ R 1×H en ×W en . Then, the single-channel features are stretched into an one-dimension sequence L en = { 1 , 2 , . . . , N 1 , 1 , 2 , . . . , N s , 1 , 2 , . . . , N all } ∈ R 1×(H en W en ) . For feature n at the nth spatial location, mth recurrence and lth layer, the operation of GRU can be formulated as Note that, h m n,l , m n,l , o m n,l are the hidden state, input feature, and output feature at the nth spatial location, the mth recurrence, and the lth layer, respectively. u and r represent the update gate and reset gate, respectively. In each spatial step, these parameters determine whether the hidden state h m n,l should be memorized or forgotten.
Then, as shown in Fig. 6, the hidden state h m n,l is passed through all the layers and spatial locations to generate last-layer hidden state h m and output o m at the mth recurrence where the last-layer hidden state h m at the mth recurrence is treated as the initial hidden state at the (m + 1)th recurrence. After M iterations, the output o M at the Mth recurrence is generated as Finally, o m from all M iterations are summed and passed through a fully connected layer to generate the final output

IV. EXPERIMENT
The performance of our GRMA-Net is comprehensively evaluated in this section. We perform VHR remote sensing scene classification and attention map visualization experiments on the UCM [32], AID [10], NWPU [67], and Optimal [18] datasets. Our method is compared to several state-of-the-art methods.

A. Datasets 1) UC Merced Land-Use Dataset:
The UCM dataset [32] is the most popular dataset in the area of VHR remote sensing scene classification. This dataset consists of 21 land-use classes. Each class contains 100 images of 256 × 256 pixels with an aerial-to-ground spatial resolution of 0.3 m per pixel. The challenge of the UCM dataset lies in its high intraclass, low interclass variations and highly overlapping land-use classes. 2) Aerial Image Dataset: The AID [10] dataset is a large dataset for aerial scene image classification. It contains 30 common scene classes, while each class contains different number of images ranging from 220 to 420. The size of each image is 600 × 600 pixels with different aerial-to-ground spatial resolutions ranging approximately from 0.5 to 8 m. Variation of multiscale images and multicategory images are the two main challenges of this dataset.
3) NWPU-RESISC Dataset: The NWPU dataset [67] is the largest RS dataset. It contains 45 scene classes. Each class contains 700 images with a resolution of 256 × 256. The aerial-to-ground spatial resolution ranges from 0.2 to 30 m. Large image scale, rich spatial resolution variations, high intraclass diversity, and interclass similarity make this dataset really challenging.

4) OPTIMAL-31 Dataset:
The OPTIMAL [18] is a small dataset with 31 classes. Each class contains only 60 images with a resolution of 256 × 256. Small size and multiple classes make it difficult for end-to-end training.

B. Evaluation Metrics 1) Overall Accuracy:
Overall accuracy represents the ratio of correctly predicted images to overall images. In this article, we use the K-fold cross validation as the final classification result.
2) Inference Time: Inference time measures the computational efficiency of different algorithms. In this article, we use the inference time per image as the evaluation metrics.
C. Training Protocol 1) Data Augmentation: All input images with different initial sizes were first resized to a resolution of 256 × 256. Then, we randomly cropped these images into patches of size 224 × 224, performed randomly horizontal and vertical flipping, and randomly scaling for data augmentation. Afterward, we used color jitter to enrich image contrast. Finally, to accelerate the network convergence, these images were normalized by Z-score to ensure that their values are centered at zero.
2) Parameter Setting: We used ResNets (i.e., ResNet18, ResNet50, ResNet101) as backbone networks, which was pretrained on the ImageNet [69] dataset. The parameters of our designed modules were all initialized using the Xavier method [70]. We set the batch size to 64 and the learning rate to 0.001. Our model was trained using the stochastic gradient descent (SGD) optimization algorithm. The L2 weight decay regularization coefficient was set to 0.01, and the momentum was set to 0.9. The learning rate was decayed by a factor of 0.1 if the training loss does not decrease within 30 epochs.
3) Implementation Details: We modified the ResNets by adding three attention branches into corresponding convolutional layers. Given that ResNets was composed of four convolution blocks, we chose the final layer of conv2-x, conv3-x, and conv4-x block as shallow, middle, and deep layers, respectively. Moreover, the training process has two phases. We first trained the backbone network by 100 epochs on the RS dataset and then performed end-to-end training (including backbone network, multiple attention models, and deep GRUs) until convergence. Experiment results show that the network achieves promising performance with this training strategy.
4) Hardware and Software Platforms: All models were implemented in PyTorch [71] on a computer with an Intel i7 7700H @ 2.80 GHz CPU and an Nvidia GeForce1080Ti GPU.

D. Comparison to the State-of-the-Art Methods
To demonstrate the superiority of our methods, we compare our GRMA-Net to several state-of-the-art (SOTA) methods on the UCM [32], AID [10], NWPU [67], and OPTIMAL [18] datasets. As summarized in Table I, our GRMA-Net outperforms state-of-the-art methods on four benchmark datasets except for the UCM dataset (under a training ratio of 80%).
The parameter settings of five main attention-based compared methods are summarized in Table II. The introduction of these compared methods are listed as follows: 1) ARCNet [18]: It is the first work to combine attention mechanism and RNN. It used VGG-16 as backbone to extract global features and then optimized these features by LSTM.
2) MAN [26]: This article used VGG-16 as backbone to extract multilayer features. Then, this model aggregated these features and enhanced them by a channel attention module. 3) CAD [64]: This article used DenseNet121 as backbone and inserted SENet to adaptively strengthen the weights of the important feature channels. 4) EAM [65]: This article used ResNet101 as backbone and added CBAM to achieve hybrid attention. In this way, both informative spatial and channel features are enhanced. 5) MIL [66]: This article used VGG16 as backbone and replaced the max pooling with an attention mechanism, which considered the contribution of each instance to the bag label and achieved better performance.

1) Quantitative Results:
Quantitative results are presented in Table I. Our GRMA-Net achieves the highest OA scores on four datasets (i.e., UCM [32], AID [10], NWPU [67], and OPTIMAL [18]). It is also worth noting that the improvements of OA scores achieved by our GRAM-Net on the AID and NWPU datasets are significant. That is because the spatial resolution of the AID and NWPU datasets vary significantly. Previous methods can generate the global representation by cascaded convolutions, they fail to assign discriminative weights to the informative local areas. Our GRMA-Net can capture long-range dependency to better exploit spatial cues over long distances by using the multilevel attention module and deep GRUs. Moreover, our method achieves much better results than existing RNN-based methods [18]. GRMA-Net-ResNet101 achieves an improvement of 7.44%. Our method achieves consistent improvements (1.44%, 0.46%, 1.93%, Fig. 7. Visualization of attention maps. We randomly selected ten images from the AID dataset [10]. S-Attention, M-Attention, and D-Attention denote the attention maps from shallow, middle, and deep layers in our GRMA-Net, respectively. S-M-D is the weighted average of all multiscale attention maps. Fig. 8.
Visualization of the attention maps produced by ARCNet [18], MAN [26], and our network. Our GRMA-Net can capture more informative areas and thus achieve higher confidence scores than previous attention-based methods. P means the classification accuracy of this subclass.
1.26% higher than [26], [64], [65], and [66], respectively) compared to the attention-based method on AID under a training ratio of 20%. Similar results are observed with the other datasets and training ratios. This demonstrates that the combination of multilevel attention and deep GRUs is effective.
2) Qualitative Results: We visualized the attention maps of 10 randomly selected images from the AID dataset in Fig. 7. It shows that shallow, middle, and deep attention maps have different interest regions. Specifically, the shallow, medium, and deep layers focus on local textures, key parts of objects, and central objects, respectively. It is also worth noting that, by comparing Fig. 7 and 2, the GRMA-Net captures more informative areas than the baseline method [9].
As shown in Fig. 8, when we compared GRMA-Net with previous attention-based methods [18], [26], our method can produce visualization maps containing more informative areas under higher confidence values. The irrelevant areas are suppressed, while the informative areas are assigned discriminative weights. That is because, our designed GRMA can effectively fuse multiscale informative features and fully exploit the spatial dependency of informative features at different locations. In this way, our GRMA-Net can achieve better performance. Comparative results are shown in Fig. 9. It can be observed that the statistical significance difference between GRMA-Net and recent attention-based methods is significant.
3) Computational Efficiency: We compared our GRMA-Net to several competitive methods (i.e., ADFF [38], ARC-Net [18], MIL [66], BAM [41]) in terms of the number of parameters (i.e., #Params) and FLOPs. Our GRMA-Net-ResNet18 achieves the best OA score with a small number of parameters and lower FLOPs. Because the deep GRU module is hard to converge, it takes more time to train the network. The time cost of both the first and second training phases are summarized in Table III. Although the training time of our network is longer than previous methods, the test time of our GRMA-Net-ResNet18 is the shortest. That is because, we adopt a lightweight RNN structure to capture long-range dependency. Compared to BAM, our network (GRMA-NetResNet18) achieves much better performance with a comparable model size.

E. Ablation Study
In this section, we compare our GRMA-Net with several variants to investigate the potential benefits introduced by our network modules and design choices.
1) Different Backbones: Because of the promising performance of ResNets in classification, we adopt three ResNet variants (i.e., ResNet18, ResNet50, ResNet101) as backbone networks in our GRMA-Net. As deeper networks generally achieve better classification accuracy, but introduce high computational burden, we evaluate the performance of different backbone networks to achieve a good trade-off between computational efficiency and classification accuracy. In this part,  III   COMPARISON TO SOTA METHODS IN TERMS OF PARAMETERS, FLOPS,  TEST TIME, AND TRAINING TIME ON THE AID DATASET UNDER  TRAINING RATIOS OF 50%. + MEANS TWO-PHASE  TRAINING METHOD   TABLE IV OA VALUES ACHIEVED BY GRMA-NET AND ITS VARIATIONS ON THE AID DATASET UNDER TRAINING RATIOS OF 20% AND 50% we gradually removed the multilevel attention module (MAM) and the deep GRU module (DGM) to evaluate the performance improvements introduced by the above modules for three backbone networks. Experimental results on the AID dataset are summarized in Table IV. GRMA-Net-ResNet101 achieves the best performance. It introduces an improvement of 1.61%/0.79% in terms of OA scores than GRMA-Net-ResNet18 under training ratios of 20%/50% and introduces 0.76%/0.45% improvements than GRMA-Net-ResNet50 under training ratios of 20%/50%, respectively. It demonstrates that deeper backbones introduce larger classification improvements to GRMA-Net. Moreover, our MAM and DGM also introduce significant improvements on all backbone networks, resulting in an improvement of 2.45% and 2.29% in terms of OA scores on GRMA-Net-ResNet101 under training ratios of 20% and 50%, respectively.
Although deeper networks introduce larger classification performance improvements, they also cause a higher computational burden. We can see from Fig. 10 that as the network goes deeper, the improvements brought by two modules tend to be saturated, but the network parameters and computational cost increase significantly. For example, the improvements of GRMA-Net-ResNet101 over GRMA-Net-ResNet18 are about 1.61% and 0.79% in terms of OA scores under training ratios of 20% and 50%, respectively. But the network parameters and computational cost increase 2.6 times and 3.7 times, respectively. It demonstrates that excessively increasing the depth of the network is not a good choice. GRMA-Net-ResNet18 achieves a better trade-off between classification accuracy and computational efficiency. Therefore, we use it as our basic model in the subsequent ablation study.
2) Multilevel Attention Module (MAM): As the core module of our GRMA-Net, MAM makes our network to pay more attention to informative areas at multiple levels. Here, we use attn S, attn M, and attn D to represent the attention modules at different stages and evaluate the effectiveness of MAM by introducing the following five variants: 1) GRMA-Net w/o MAM: We removed the multilevel attention module in this variant to investigate their contributions. Specially, we gradually replaced the attention modules with simple channel squeeze operation to keep the dimension identical as before. 2) GRMA-Net w/o Score F: We mainly investigate the benefit of score map F. Specially, we replace the fused score map with simple self-scale score map, which means we do not introduce the global features G to instruct the distribution of multiscale local features L.

3) GRMA-Net w/o Conv_t:
To investigate the benefit introduced by the transition convolution Conv_t, we replaced the transition convolution a constant value (value = 1). It means the huge magnitude difference between local and global features cannot be adaptively adjusted by Conv_t. 4) GRMA-Net With Channel Attention: We used the channel attention operation of [14] to replace the spatial attention operation in this variant to investigate the effectiveness of channel attention. 5) GRMA-Net With Hybrid Attention: We replaced the spatial attention operation with hybrid attention [14] in this variant to investigate the effectiveness of hybrid attention.   Table V summarizes comparative results achieved by GRMA-Net and its variants. It can be observed that the OA value of GRMA-Net w/o attn_S&M suffers a decrease of 1.34% and 1.38% compared to GRMA-Net on the AID dataset under training ratios of 20% and 50%, respectively. That is because multilayer features contain rich local informative cues. Multilevel attention module helps to enhance the representation of these local features and thus achieve better performance. Moreover, the performance degradation is also significant for GRMA-Net w/o score F. It results in about 1.51% and 1.82% decrease. That is because the global feature G can help local features L to generate better distribution, which is important for the fusion of multiscale features.
It is worth noting that GRMA-Net w/o Conv_t suffers decreases of 0.39% and 0.69% on AID compared to GRMA-Net. Without conv_t, the huge magnitude gap between local and global features hinders our GRMA-Net to exploit mutual information. In contrast, conv_t can effectively alleviates this gap to facilitate our network to achieve better performance.
As summarized in Table V, GRMA-Net with channel attention suffers a decrease of 0.26% and 0.48% on AID as compared to GRMA-Net. That is because complex spatial distribution of RS images requires powerful spatial representation ability. Although channel attention help to capture informative feature channel, it cannot replace spatial attention. When we replaced the spatial attention with hybrid attention, this new variant introduces minor improvements, which is 0.39% and 0.23% on the AID dataset compared to GRMA-Net. That is because both informative spatial areas and representative feature channels are enhanced by hybrid attention. In this way, GRMA-Net with hybrid attention achieves better performance. Because the objective of this article is to demonstrate the effectiveness of the proposed combination of multilevel attention module and deep GRU-based feature aggregation, we try to make our network architecture simple and did not use the delicately designed hybrid attention module for this minor performance improvement.

3) Deep GRU Module (DGM):
Deep GRU module is used in our GRMA-Net to capture long spatial range dependency. Here, we validate the effectiveness of DGM by introducing the following three variants: 1) GRMA-Net With GCN: In this variant, we replaced DGM with a graph convolutional network (GCN) [72] to capture the spatial dependency of features at different locations. 2) GRMA-Net w/o DGM: We removed the DGM in this variant to investigate its contribution to GRMA-Net. Specially, we replaced the DGM with a fully connected layer to generate the predicted labels. 3) Depth vs Width in DGM: We investigate the two main components (i.e., recurrence number and layer number) in the experiments, where recurrence number represents the depth of DGM and layer number represents the width of DGM. The hidden size is fixed to 500. As summarized in Table VI, both GRMA-Net with GCN and GRMA-Net-ResNet18 achieve obvious improvements in terms of OA scores over GRMA-Net w/o DGM. These spatial re-arrangement operations result in improvements of 0.59% and 0.37% for GRMA-Net-ResNet18 and GRMA-Net with GCN in term of OA values under AID dataset with 50% training ratio. That is because, the spatial re-arrangement operation can help to capture long-range dependency among multilevel features. Then, when we compare GRMA-Net with GCN with GRMA-Net-ResNet18, GRMA-Net with GCN suffers a decrease of 0.22% in terms of OA scores and increases of 3.43 h, 1.09 ms in terms of training time and test time over GRMA-Net-ResNet18. Although the GRMA-Net with GCN is hard to converge and needs longer test time, the comparable OA scores also demonstrate the effectiveness of GCN. The potential of GCN is worthy of further exploring.
As summarized in Table VII, GRMA-Net achieves an improvement of 0.59% (97.05% vs 96.46%) in terms of OA scores over GRMA-Net w/o DGM. This is because our DGM can better capture long-range dependency to achieve better performance. Moreover, we test the performance of our network with different numbers of GRU layers and recurrence. It can be observed that our network achieves the best performance with three GRU layers and ten iterations. It demonstrates that excessive recurrence and layer number can increase the difficulty of network fitting, leading to degraded performance.

V. CONCLUSION
In this article, we propose a GRMA-Net for VHR remote sensing scene classification. By incorporating multiscale attention module, our GRMA-Net can focus on informative regions at multiple scales to extract discriminative features. Moreover, our GRMA-Net uses GRUs to better exploit the spatial dependency and contextual relationship of features at different locations. Experimental results demonstrate the superiority of our GRMA-Net over state-of-the-art methods on four benchmark datasets.