Scene Classification of Remote Sensing Images Based on Saliency Dual Attention Residual Network

Scene classification of high-resolution Remote Sensing Images (RSI) is one of basic challenges in RSI interpretation. Existing scene classification methods based on deep learning have achieved impressive performances. However, since RSI commonly contain various types of ground objects and complex backgrounds, most of methods cannot focus on saliency features of scene, which limits the classification performances. To address this issue, we propose a novel Saliency Dual Attention Residual Network (SDAResNet) to extract both cross-channel and spatial saliency information for scene classification of RSI. More specifically, the proposed SDAResNet consists of spatial attention and channel attention, in which spatial attention is embedded in low-level feature to emphasize saliency location information and suppress background information, and channel attention is integrated to high-level features to extract saliency meaningful information. Additionally, several image classification tricks are used to further improve classification accuracy. Finally, Extensive experiments on two challenging benchmark RSI datasets are presented to demonstrate that our methods outperform most of state-of-the-art approaches significantly.


I. INTRODUCTION
With the rapid development of remote sensing technology and satellite sensors, a great number of high-resolution RSI have become readily available [1]- [3]. It is highly desirable to automatically interpret high-resolution RSI. Scene classification of RSI, i.e. automatically extracting valuable information from each scene image and categorizing them into different classes based on their semantic information, has become a research hotspot in RSI interpretation [1], [4], [5]. Scene classification of RSI has a wide range of applications, including urban planning, natural disaster detection, landcover/land-use classification, environment monitoring and so on [6], [7].
High-resolution RSI is quite different from natural images due to their unique imaging perspective and capture mode, which results in images with various types of ground The associate editor coordinating the review of this manuscript and approving it for publication was Yongqiang Cheng . objects and complex backgrounds. Effective feature representation plays a crucial role in scene classification of RSI. Over the past decades, considerable efforts have been made to solve this problem and numerous approaches have been proposed. Existing scene classification methods are usually divided into two categories according to the used features: (a) handcrafted-based feature methods; and (b) learned-based methods, especially deep learning-based methods [5]. In recent years, with the fast development of Convolutional Neural Network (CNN), a variety of CNN-based methods have been dominating the field of scene classification mainly due to its capacity to learn hierarchical representation to describe the image scenes [5], [8], [9]. Most CNN-based methods usually tend to generate a global representation of image with the same contribution of each part and leverage the output of Fully Connected (FC) layer as the global representation to classify scene images. However, RSI commonly contain various types of ground objects and complex backgrounds, but not all objects are useful for scene classification. Therefore, it is very crucial to recognize the critical objects and regions in scenes, focus on them and abandon the useless ones when representing scene images [5]; in other words, more attention should be paid to saliency object information [10]. Therefore, improving the feature representation by capturing saliency scene information is mainly discussed in this paper.
To solve the aforementioned problems, the Visual Attention Mechanism (VAM) is proposed, which not only presents where to focus but also improves the representation of interests [11]. Recently, there have been several attempts to integrate attention mechanism into CNNs to improve the performance in many tasks, such as image classification [12]- [14], saliency object detection [15], semantic segmentation [16] and so on. Wang et al. [14] propose Residual Attention Network (RAN) which generates attention-aware features by stacking attention module embedded into deep residual network to improve classification performance. Hu et al. [13] introduce ''Squeeze-and-Excitation'' block by using global average-pooled features for modelling interdependencies between channels explicitly to improve performance. Furthermore, Woo et al. [12] propose an attention module to sequentially infer an attention map along channel and spatial dimensions for adaptive feature refinement. Zhao and Wu [15] propose a saliency object detection method by using channel-wise attention and spatial attention to obtain saliency object location. The spatial attention on low-level feature maps is used to obtain saliency location information, while channel-wise attention on high-level feature maps is used to capture saliency regions. Fu et al. [16] propose a Dual Attention Network (DANet) to adaptively integrate position attention module and a channel attention module for semantic segmentation.
All the above mentioned works are proposed for natural image processing, there have been also some proposals on scene classification of RSI involving VAM [4], [5] currently. Li et al. [4] propose a method called Deep Discriminative Representation Learning with Attention Map (DDRL-AM), which resorts to Gradient-weighted Class Activation Mapping (Grad-CAM) method to produce attention map to extract discriminative features for scene classification of RSI. Guo et al. [5] propose a global-local attention network (GLANet) to capture both global and local information for aerial scene classification. The above mentioned works mainly apply the attention mechanism to the high-level features to enhance feature representation, but they have no consideration to filter the background of the low-level features, so the representation of saliency areas in high-level features is insufficient. Therefore, mainly inspired by some proposals in [12], [15], we propose a saliency dual attention residual network named SDAResNet to extract saliency object features for scene classification. More specifically, we introduce spatial attention and channel attention into ResNet101 to construct residual attention network, in which spatial attention is integrated into low-level features of shallow layer in ResNet101 to highlight saliency region and suppress background information, while channel attention is embedded into high-level features of deep layer in ResNet101 to enhance the scene saliency for extracting saliency sematic information.
In addition, in the past few years, some tricks for improving image classification accuracy have been proposed [17]- [24]. Especially, He et al. [24] summarize some tricks for natural image classification with CNN in detail. Extensive experimental results show that several tricks and their combination can improve significantly the classification performance of natural images. However, due to the huge differences between RSI and natural images in terms of resolution, imaging perspective, object size and objects quantity, the availability of these tricks on scene classification of RSI needs be investigated. In this paper, we have carefully selected several simple tricks for practical verification, such as Xavier initialization [17], mixup training [18], learning rate cosine decay [20], learning rate warmup [21], no bias decay [22], etc. Empirical evaluations are adopted to verify which tricks are effective for scene classification of RSI.
We summarize the main contributions of our work as follows: (1) We propose a Saliency Dual Attention Residual Network (SDAResNet) to extract saliency scene information and improve the feature representation of RSI, it can simultaneously utilize cross-channel information and spatial information to capture discriminative saliency scene information to boost the performance of scene classification.
(2) A spatial attention block embedded into low-level feature is proposed to suppress background and highlight saliency location information, while a channel attention block integrated in high-level features is used to extract saliency semantic information. The two attention blocks can focus on the most critical part of a scene image to improve saliency feature representation of RSI.
(3) Several image classification tricks and effective combinations are introduced into the proposed SDAResNet to improve training process for better scene classification performance.
(4) Extensive experimental results show that our proposed SDAResNet outperform most of recent state-of-the-art methods on two benchmark RSI datasets, while several image classification tricks and effective combinations can further improve classification accuracy.
The remainder of this paper is organized as follows. Section II gives a brief review of related works. Section III describes the proposed method in detail. Section IV introduces experimental designs and results. Experimental results are discussed in Section V. Finally, Section VI draws the conclusion and states future work.

II. RELATED WORK
In this section, we review the related work including visual attention mechanism, residual attention network and image classification tricks. VOLUME 8, 2020 A. VISUAL ATTENTION MECHANISM It is well known that visual attention mechanism (VAM) stems from the studies of human vision, it plays an important role in human perception [11]. In the field of cognitive science, one does not attempt to process a whole scene at once, but selectively focuses on saliency parts because of bottlenecks in information processing [25], which is called VAM.
Actually, the attention mechanism is involved in learning the weight distribution of different parts, corresponding to different degrees of concentration [12]. The benefits of this property have been proven in many tasks, such as machine translation [25], object recognition [26], [27], semantic segmentation [16], image captioning [28] and image classification [13], [29], [30].
Recently, there have been several attempts to integrate attention processing into CNNs to improve the performance in large-scale image classification tasks [12]- [14], [29]. Wang et al. [14] propose Residual Attention Network which generates attention-aware features from different modules by stacking attention module, which can improve classification performance. Hu et al. [13] introduce ''Squeeze-and-Excitation'' block, which can bring significant improvements in performance for existing state-of-the-art CNN by using global average-pooled features to explicitly modelling interdependencies between channels. Furthermore, in [12], [29], researchers propose a simple and effective attention module, which can sequentially infer an attention map along channel and spatial two separate dimensions, then attention maps are multiplied to the input feature map for adaptive feature refinement.
More recently, some research works have attempted to focus the feature extraction process on saliency image regions. Zhao and Wu [15] propose a novel saliency object detection method named Pyramid Feature Attention network. A Context-aware Pyramid Feature Extraction (CPFE) module is designed for high-level feature maps, furthermore, channel-wise attention on CPFE feature maps is adopted to capture saliency regions and spatial attention on low-level feature maps is used to obtain saliency location information. Fu et al. [16] propose a Dual Attention Network (DANet) to adaptively integrate local features with their global dependencies for scene segmentation. A position attention module and a channel attention module are introduced to capture global semantic interdependencies.
All the previous works are proposed for natural image processing, and such works have shown excellent performances in image classification, detection and so on. Due to the strong feature representation ability of VAM, it is also suitable for the field of scene classification of RSI. Currently, there have been also some proposals on scene classification of RSI involving VAM [4], [5], [31], [32]. Li et al. [4] propose a novel method called Deep Discriminative Representation Learning with Attention Map (DDRL-AM), which resorts to Gradient-weighted Class Activation Mapping (Grad-CAM) method to produce attention map, then takes attention maps as input to encode by CNN, finally attention map encoded and original image are fused to extract discriminative features for scene classification of RSI. Guo et al. [5] propose a novel global-local attention network (GLANet) to capture both global and local information for aerial scene classification, they replace FC layers in the VGGNet by the global attention (GA) branch and local attention (LA) branch, one of which learns the global information while the other learns the local semantic information via VAM. Wang et al. [31] propose a novel attention recurrent convolutional network (ARCNet) for scene classification of RSI. The aforementioned works mainly integrate the attention mechanism to the high-level features to enhance the feature representation, but no consideration to filter the background of the low-level features, so the representation of saliency areas in the high-level features is insufficient. To solve these issues, we propose SDAResNet using channel attention and spatial attention to extract saliency scene regions to improve classification accuracy.

B. RESIDUAL ATTENTION NETWORK
With the development of deep neural network, He et al. [33] proposed residual learning network (ResNet), which greatly increases the depth of neural network. ResNet consists of multiple building blocks, structure of building block is shown in Fig.1. Fig.1 (a) is mainly used for ResNet below 50 layers, while Fig. 1 (b) is mainly used for ResNet above 50 layers. As described in [33], ResNet101 consists of conv1, conv2 x (3 BottleneckBlocks), conv3 x (4 BottleneckBlocks), conv4 x (23 BottleneckBlocks), conv5 x (3 BottleneckBlocks) and a FC layer.
Recently, some attempts have been made to combine attention mechanism into deep ResNet to improve the classification performance [12], [14], [29]. Wang et al. [14] design residual attention network (RAN) by stacking multiple attention modules to generate attention perception features. Attention module which is constructed by using pre-activated residual units, ResNeXt and Inception as the basic units of RAN is divided into two branches: mask branch and trunk branch. In [12] and [29], researchers propose a simple and effective attention module, which decompose the calculation process of three-dimensional attention map into two separate pathways along the channel and spatial axis to sequentially infer attention map, and then integrate the attention map into residual block to perform adaptive feature refinement. Nevertheless, channel attention and spatial attention are sequentially integrated into each residual block, which not only increases the number of parameters and computation, but also takes no account of the characteristics of different convolution layers and different attentions. Unlike the above mentioned methods, instead of sequentially incorporating channel attention and spatial attention into each BasicBlock (Bottle-neckBlock) of the ResNet, we integrate the spatial attention into conv2 x layer of ResNet to capture saliency location information, and channel attention into conv3 x, conv4 x and conv5 x layer of ResNet to obtain saliency semantic information, which not only reduces the number of parameters and computation, but also strengthens the saliency representation to improve the scene classification performance.

C. IMAGE CLASSIFICATION TRICKS FOR IMPROVING TRAINING PROCESS
With the introduction of AlexNet [34], deep neural network, especially CNN, has become the dominant method of image classification. In just a few years, various new network architectures have emerged, including VGG [35], Inception [36], ResNet [33], DenseNet [37] and NASNet [38]. These new architectures steadily improve classification accuracy of large-scale natural images. However, these improvements do not merely come from the improved model architectures, the improved training process also play an important role [24]. Many tricks have been applied to tasks such as natural image classification and detection. He et al. [24] summarize in detail the tricks for natural image classification with CNN. Empirical evaluation show that several tricks and their combinations lead to significant accuracy improvement. However, due to the huge differences between RSI and natural images in terms of resolution, imaging perspective, object size and objects quantity, tricks applicable to natural image classification may not be well applied to scene classification of RSI. In this paper, we mainly investigate a few simple tricks for improving training process to verify their effectiveness on scene classification of RSI. The following is a brief review of several tricks.
It is well known that data augmentations are the effective strategy to increase the model generalization. Zhang et al. [19] propose a data-agnostic and straightforward data augmentation approach: mixup training, which trains on virtual examples constructed as the linear interpolation of two random examples from the training set and their labels. Mixup can regularize the neural network to increase the robustness and stabilize the training process. When mixed training is used, more training epochs are generally required to obtain better training results. Zhong et al. [39] propose a new data augmentation method named ''Random Erasing'' (RE), which randomly selects and erases a rectangle region in input image to obtain the training images with occlusion to reduce overfitting and improve the robustness of model.
In Stochastic Gradient Descent (SGD) algorithm, the learning rate determines the speed at which the parameter moves to the optimal value, even the performance of the algorithm. If learning rate is too large, it is likely to miss the optimal value; if learning rate is too small, the efficiency of optimization may be too low, and convergence time is extremely long. Therefore, the learning rate scheduling is proposed. Generally, at the beginning of the training, a higher learning rate is used to achieve fast convergence; as training progresses, the learning rate is gradually reduced, which is helpful to find the optimal solution. In literatures [20], [33], the researchers propose Multi-step (MultiStepLr) and cosine (CosLr) learning rate decay strategy respectively, the former adjusts the learning rate by setting step intervals, and the latter adjusts learning rate on the basis of cosine curve. In addition, the parameters of model are typically initialized randomly at the start of training, directly using a large initial learning rate will affect the training effect. Priya et al. [21] propose a gradual warmup (Warmup) trick that gradually ramps up learning rate from 0 to the initial learning rate linearly to avoid a sudden increase of learning rate and achieve gradual convergence at the beginning of training. After Warmup, it goes back to initial learning rate schedule. In this work, we use the first m epochs (m = 5) to warm up, and the initial learning rate η (η = 0.1), then at batch i (1 ≤ i ≤ m), the learning rate is i*η/m.
The weight initialization method has a crucial influence on the convergence speed and performance of the model. Glorot and Bengio [17] propose Xavier initialization (XavierInit) to solve the problem of random initialization. The main idea is to get the input and output to follow the same distribution as possible, so that the output value of the activation function at the next layer may not approach 0. Jia et al. [22] propose No Bias decay strategy (NoBias), it is recommended to only apply regularization to the weights in convolution and FC layers, while other parameters including the biases, γ and β in BN layers are left unregularized. NoBias strategy can alleviate overfitting and boost classification performance. Szegedy et al. [23] propose a Label Smoothing (LS) algorithm to regularize the classifier layer by estimating the marginalized effect of label-dropout during training. This algorithm obtains the cross entropy by calculating weighted average and average distribution of hard target in the dataset, which effectively improves the accuracy. It is actually a constraint method to minimize the overfitting degree of the model by adding noise to the output label y.

III. THE PROPOSED METHOD
For extracting saliency features to boost scene classification performance, a novel SDAResNet is proposed for scene classification of RSI. The framework of SDAResNet is VOLUME 8, 2020 FIGURE 2. The framework of our proposed SDAResNet and original ResNet101. Fig.2 (a) shows the original ResNet101 framework. It consists of conv1, conv2 x (3 BottleneckBlocks), conv3 x (4 BottleneckBlocks), conv4 x (23 BottleneckBlocks), conv5 x (3 BottleneckBlocks) and a FC layer and a sigmoid layer. Fig.2 (b) represents the framework of proposed SDAResNet. Similar to the original ResNet101, the red dotted box corresponds to conv2 x, and the three blue dotted boxes correspond to conv3 x, conv4 x and conv5 x. In red dotted box, F l represents a low-level feature map extracted from conv2 x layer, firstly, spatial refined feature F S is generated by SA, then attention residual learning is performed to extract feature maps with SA; in blue dotted box, F h represents high-level feature map extracted from conv3 x, conv4 x or conv5 x layer, firstly, channel refined feature F C is obtained by CA, and then attention residual learning is performed to extract the feature map with CA. Finally, the feature vector is extracted from FC layer. Where ⊕ denotes element-wise addition and denotes dot multiplication.
illustrated in Fig. 2(b). ResNet101 [33] is used as the backbone, original ResNet101 framework is shown in Fig. 2(a). Since convolution operations extract informative features by blending cross-channel and spatial information together [12], and considering the different characteristics of different levels' features [15], a dual attention is adopted to emphasize meaningful features along channel and spatial two principal dimensions. The global context-aware information are contained in high-level features of deep layers typically, which are suitable to locate the saliency regions correctly. The low-level features in shallow layers contain the spatial structural details, which are suitable to detect the location and suppress unnecessary background information. Therefore, inspired by the approach adopted in [15], Channel-wise Attention (CA) is embedded into high-level features to generate saliency features, while Spatial Attention (SA) is integrated into low-level features to suppress background information and strengthen saliency location information.
More specifically, SA is integrated with conv2 x layer of ResNet101, while CA is embedded into conv3 x, conv4 x and conv5 x layer of ResNet101. Our proposed SDAResNet is different from literature [12], which directly integrates the combination of channel and spatial attention to each layer; it is also different from the approach proposed in [15], which fuses higher-level features containing CA and lower-level features containing SA for saliency detection. As a result, our model efficiently uses the saliency information by learning which information to emphasize or suppress. In this section, spatial attention and channel attention firstly are introduced, then attention residual learning is elaborated.

A. SPATIAL ATTENTION
In standard CNN structures, each convolutional feature map commonly consists of three dimensions: the channel, width and height. Remote sensing images usually contain a wealth of ground details and complex backgrounds. In general, the saliency map from low-level features contains a lot of background details which easily influences saliency detection [15]. In order to suppress noise disturbance to extract saliency features for improving performance of scene classification, we need focus more on the foreground saliency regions and suppress background. While spatial attention aims to obtain the inter-dependencies of spatial dimension and then learn the most significant region of an RSI. Woo et al. [12] apply maximum pooling and average pooling to generate spatial attention map integrating with all layers. Zhao and Wu [15] proposed the embedment of spatial attention with the low-level features because high-level features contain abstract semantics which has no need to filter spatial information. Inspired by the proposals of [12] and [15], we design spatial attention block as shown in the dotted box of Fig.3 (a). The process will be described in detail below.
As shown in Fig. 2 and Fig. 3 (a), a spatial attention block integrated with low-level feature of ResNet101 mainly contains two convolution layers and a softmax layer. In order to increase receptive field and get global information without increasing parameters, similar to [15], we apply two convolution layers instead of maximum and average pooling to generate spatial attention weight matrix. Firstly, the low-level feature map F l is fed into two consecutive convolution layers f 1 and f 2 , and two feature maps are extracted by using different combinations of two convolution layers, where kernel size of f 1 is 1*k and the other kernel size is k*1. Secondly, the two feature maps are concatenated to generate an efficient feature descriptor. Finally, softmax operation is performed on the concatenated feature descriptor to capture spatial concerns and generate spatial attention weight matrix F SA ∈ R W ×H . In brief, the spatial attention weight matrix is derived as follows: where W refers to the weight parameters in convolution operation, σ refers to softmax operation, F l ∈ R W ×H ×C . represents low-level feature from conv2_x layer of ResNet101, f 1 and f 2 refer to 1*k and k*1 convolution layer respectively and we set k = 9 in experiment. C 1 represents the results of two convolution operations of the top branch in Fig.3(a), i.e., firstly performing 1*k convolution operation, subsequently performing k*1 convolution operation on low-level feature map F l . Similar to C 1 , C 2 represents to the results of two convolution operations of the bottom branch in Fig.3(a). F SA refers to spatial attention weight matrix which is computed through performing softmax operation on the results of the C 1 , C 2 concatenation operation. For a detailed understanding the spatial attention block, according to Eq.1, Eq.2 and Eq.3, the following formulas can be derived: where F SA is a matrix, which represents the inter-spatial relationship of each two positions of low-level feature maps, N is the numbers of spatial features. Specifically, F SAi,j means the i th position's impact on j th position. If the set of spatial points is defined as R = {(x, y) | x = 1, · · · , W ; y = 1, · · · , H } then i = (x, y), j = (x, y) is the i th and j th position coordinate, respectively. Here, the probability of the corresponding feature can be obtained through the softmax operation. A larger probability value indicates that the feature is more important, and vice versa. So the final result of F SA is actually a weight matrix. The final output F S is obtained by dot multiplication operation between F l and F SA . The formula F S is derived as follows: where denotes dot multiplication.

B. CHANNEL ATTENTION
The high-level features in CNNs usually contain meaningful semantic information, while different channels of features correspond to different semantics [13]. In standard CNN framework, since each channel of feature map is computed by convolution operation, the redundancy information among different channel is inevitable. To reduce the inter-channel redundancy of feature map and focus on key part, Woo et al. [12] generate channel attention map by exploiting the inter-channel relationship to focus on 'what' As illustrated, SA integrated with conv2 x layer utilizes two convolution layers (one's kernel is 1*k and the other's is k*1) to low-level feature (F l ) to generate spatial attention feature (F SA ); CA integrated with conv3 x,conv4 x or conv5 x layer utilizes both max-pooling and average-pooling outputs with a shared MLP to generate channel attention features (F CA ). ⊕ denotes element-wise summation, ∼ denotes softmax on each row of the matrix, and represents dot multiplication.
is meaningful in image, but they didn't take into account the difference between low-level and high-level features.
In order to enlarge the saliency from high-level features, Zhao and Wu [15] add channel-wise attention module to high-level features for saliency detection. Moreover, in the literatures [12], [13], the researchers conduct comparative experiments on the use of average and maximum pooling to construct channel attention, the results show that average pooling is slightly better than maximum pooling, but Woo et al. [12] verify that using average and maximum pooling together to construct channel attention can obtain better performance. According to the characteristics of maximum pooling and average pooling, for the high-level feature maps of deep layer, average pooling can retain the overall data features to obtain global high-level semantic information for promoting classification performance, while maximum pooling can capture the salient part of the feature (i.e., the saliency feature). So such two pooling methods are used together to better improve classification performance. Inspired by above mentioned proposals, we generate a channel attention weight matrix only for high-level features using maximum and average pooling as well as a shared network, which not only increases the saliency of ground objects in RSI but also reduces the computation cost. The channel attention aims to learn the inter-channel relationships of remote sensing image feature maps. The detailed structure of channel attention block is shown in the dotted box of Fig. 3 (b). The process will be described in detail below. As shown in Fig.2 and Fig. 3 (b), a channel attention integrated with high-level features of ResNet101 mainly contains two pooling layers and a shared Multi-Layer Perceptron (MLP) with one hidden layer. Firstly, two different spatial context descriptors of a high-level feature map (F h ), i.e., average-pooled features (F avg c ) and max-pooled features (F max c ), are generated by using average pooling and maximum pooling along channel VOLUME 8, 2020 axis respectively, and spatial information of F h is aggregated by using element-wise addition operation. Secondly, F avg c and F max c are separately forwarded to a shared MLP network to generate two output feature vectors. To reduce parameter overhead, we only use one hidden layer in the shared MLP network and set the hidden activation size to R C/r×1×1 , where r is the reduction ratio and is set to 9 in the experiment. Then, the output feature vectors are merged by using element-wise addition. Finally, softmax function is used to take the normalization measures to generate the channel attention weight matrix F CA ∈ R C×1×1 . In short, CA is derived as follows: (6) where F CA denotes the channel attention weight matrix, σ denotes the softmax function which is performed on each row, W 0 ∈ R C/r×C and W 1 ∈ R C×C/r denote weights of the MLP, which are shared for both two inputs and ReLU activation function (δ) followed by W 0 .
For a detailed understanding the channel attention, we unfold Eq.6 to derive the following equation: where F CAi,j means the relationship between the i th channel and the j th channel, N is the numbers of spatial features. After softmax operation, the final output with channel attention F C is obtained by dot multiplication between F h and F CA . The F C is derived as follows: where denotes dot multiplication.

C. ATTENTION RESIDUAL LEARNING
Because of the different characteristics of SA and CA, SA is integrated with conv2 x layer of ResNet101, while CA is integrated with conv3 x, conv4 x and conv5 x layer of ResNet101. Similar to ideas in residual learning, attention mechanism is introduced to residual learning. As Batch Normalization (BN) is a widely used technology to stabilize the training process, BN is added to attention residual learning for fast convergence [40]. In addition, according to the pre-activation idea proposed in [41], BN is placed in front of ReLU, which can further stabilize training and improve classification performance. For an intuitive understanding, the attention residual block of our proposed SDAResNet is plotted as shown in Figure 4, two attentions are integrated into residual learning. The structure of original residual block is shown in Fig. 4 (a), our proposed attention residual block is shown in Fig. 4(b), in which attention feature is added to residual block to perform attention residual learning, at the same time, BN and ReLU are added sequentially before the attention residual learning. As can be seen from Fig. 4, after the computation of attention feature, the corresponding refined feature can be extracted by matrices multiplication and element-wise summation. The process will be described in detail below. For a given intermediate feature map X l , BN and ReLU are first used pre-activation, then attention feature matrix is generated by using CA or SA and weighted feature is get by matrices multiplication, and then the refined attention feature is generated using element-wise summation. The following is the derivation of the corresponding formula. After using pre-activation, residual learning formula is as follows: wheref represents pre-activation function, F(·) represents residual function, W l represents weight matrix. According to (9) and (5), (8), the output of spatial attention residual block is calculated as follows: The output of channel attention residual block is calculated as follows:

IV. EXPERIMENTS
In this section, extensive experiments are conducted to evaluate the effectiveness of the proposed SDAResNet.
In addition, ablation studies are performed to verify the effectiveness of some tricks and their combinations, which are effectively used in natural image classification and detection.
To thoroughly evaluate the effectiveness of our final model, two challenging remote sensing scene datasets (NWPU-RESISC45 [1] and PatternNet [3]) are first introduced. Then, the details of experimental setting and evaluation metrics are explained. Finally, the experiments results and analysis are presented to evaluate the effectiveness of proposed SDARes-Net and some tricks.

A. DATASET DESCRIPTION
In order to validate our work, we conduct experiments on two public benchmark remote sensing scene datasets(NWPU-RESISC45 [1] and PatternNet [3]). The NWPU-RESISC45 dataset [1], which is created by Northwestern Polytechnical University from Google Earth, includes 31,500 images, covering 45 scene classes with 700 images in each class. 1 Each image is with a size of 256 * 256 pixels in the red-green-blue (RGB) color space, while the spatial resolution of those images varies from about 30 to 0.2 m per pixel. To the best of our knowledge, NWPU-RESISC45 is currently the most challenging large-scale remote sensing scene dataset.
The PatternNet [3] is another large-scale high-resolution RSI dataset, in which the images are collected by Wuhan University from Google Earth or the Google Map API for US cities. 2 There are 38 classes and each class contains 800 images of size 256 * 256 pixels. Similar to the NWPU-RESISC45 datasets, PatternNet contains images with varying resolution, the highest spatial resolution is around 0. 062m and the lowest spatial resolution is around 4.693m.
The NWPU-RESISC45 dataset is randomly split into 10% (and 20%) for training and the rest 90% (and 80%) for testing. For the PatternNet dataset, the training proportions are set 20% and 50%, the rest for testing. These proportions have been selected in accordance with the previous studies in the literature [5], in order to facilitate our comparisons with stateof-the-art approaches.

B. EXPERIMENTAL SETTING AND EVALUATION METRICS 1) EXPERIMENTAL SETTING
Our proposed method is implemented by using pytorch framework. The parameters in the training stage are set as follows. The mini-batch size in SGD optimizer is fixed as 40 to cater to the GPU memory, the weight decay is set 0.00001 and momentum is set 0.9. The initial learning rate is fixed as 0.1 to accelerate convergence. The total training epochs are set 220. The parameters of some tricks mentioned in section 2.3 are as follows: the warmup is 5, step interval of MultiStepLr is [30,80,130,180], the parameter α of Mixup is 0.4. All experiments are evaluated on an Ubuntu 16.04 sever with Intel(R) Xeon(R) Gold6048CPU @2.36 GHz with 8 cores and four TITAN V GPUs.

2) EVALUATION METRICS
To evaluate the results of the proposed SDAResNet for scene classification and the effectiveness of some tricks, the overall accuracy and the confusion matrix are adopted as evaluation metrics in the following experiments.
Overall Accuracy (OA): It is defined as the number of correctly classified samples divide by the total number of samples, without taking into account the category to which they belong. The formula is as follows: where T represents the number of correctly classified samples, F represents the number of misclassification samples. Confusion Matrix (CM): To show the performance of the algorithm in a visual way, the CM is an informative table used to analyze the errors and confusions between different classes, and it gets by counting each class of correct and incorrect classification of the test images and accumulating the results in the table. In this matrix, each row represents actual categories, and each column represents predicted value. Therefore, it can be very easy to show that whether these multiple categories have been confused or not.
In addition, to obtain reliable results, all of experimental results are obtained as the average of ten repeated experiments using randomly selected training samples.

C. EXPERIMENTAL RESULTS AND ANALYSIS 1) ABLATION STUDY
In this section, the proposed SDAResNet is used as the baseline to evaluate the effectiveness of some tricks under Training Ratio (TR) of 10% and 20% on NWPU-RESISC45 dataset. In order to analyze the influence of different tricks, experiments with different tricks are conducted and the results are reported in Table 1.
By comparing the performances of Baseline and each trick under the training ratio of 10% and 20%, we find that CosLr can improve the accuracy of 2.231% and 2.274%, which may be because the cosine learning rate decay can make the gradient much better approach the optimal value and thus improve the classification performance; LS can improve the accuracy of 0.867% and 0.884%, it is probably because that the label smoothing can effectively prevent the model from overfitting and improve the generalization ability of the model to obtain better classification performance; Regarding NoBias, it may leave some parameters such as bias no regularization during training process to effectively prevent the model from overfitting and obtain better classification performance, which may justify the accuracy improvement of 0.624% and 0.636%; Warmup can improve the accuracy of 0.603% and 0.614%, which might be because the warmup trick can gradually ramps up the learning rate at the training beginning achieving stable convergence and better classification performance; RE can improve the accuracy of 0.328% and 0.334%, which is probably because the RE trick by increasing the training image with occlusion can prevent the model from overfitting and obtain better classification performance; Mixup can only improve the accuracy of 0.063% and 0.065%, the possible reason is that mixup trick generally requires more training epochs to achieve better classification performance. Furthermore, the accuracies of MultiStepLr and XavierInit are reduced by about 1%. XavierInit trick degrades the accuracy because it is used with sigmoid activations (not symmetric around 0), which will yield poor learning dynamics and initial saturation of the top hidden layer. MultiStepLr trick decreases, which may be caused by the unreasonable step interval and fewer training epochs.
To indicate the effectiveness of trick combinations, we verify the combinations of several effective tricks under the training ratio of 10% and 20% on NWPU-RESISC45 dataset. We can observe that the combination of Baseline, CosLr, Warmup, LS and RE improve the accuracy by 3.48% and 3.21% respectively, the combination of Baseline, Mixup, CosLr, Warmup, LS and RE improve the accuracy by 3.58% and 3.32%, while the combination of Baseline, Mixup, CosLr, Warmup, NoBias, LS and RE improve the accuracy by 3.75% and 3.71%, the best accuracy reaches 93.15% and 94.86% respectively. From the above experimental results, it can be seen that the superimposition of several effective tricks can greatly improve the accuracy, which may be because such effective combinations can make full use of the advantages of each trick to achieve better classification performance. Based on this we can see the effectiveness of some tricks and effective combinations for scene classification of RSI.

2) COMPARATIVE RESULTS ON NWPU-RESISC45 DATASET
The Comparative results on NWPU-RESISC45 dataset are showed in Table 2, which includes our proposed SDAResNet and several state-of-the-art representative approaches. All these methods all use CNN-based features. Among them, Alex, GoogleLeNet and VGGNet are CNN-based single feature methods; LASC-CNN (multiscale) [42], TEX-TS-Net [43], SAL-TS-Net [43], SCCov [45] and DCNNs [44] are CNN-based multiple feature methods; ResNet+AM [4], ResNet+AM+CL [4] and GLANet [5] are attention-based feature methods. As can be seen in Table 2, CNN-based multiple feature methods outperform CNN-based single feature methods in very big margins under the training ratios of 10% and 20%, which demonstrates that the huge superiority of CNN-based multiple features for scene classification of RSI. In addition, attention-based methods are superior the other CNN-based methods, ResNet+AM+CL [4] and GLANet [5] obtain better accuracy compared with CNN-based multiple feature methods, as attention mechanism is utilized. Last but most important, three combinations of several effective tricks outperform all other methods of comparison, which achieves the overall accuracy 92.88%, 92.98%, 93.15% and 94.36%, 94.43%, 94.86% under the training ratio of 10% and 20% respectively. Figure 5-6 show the confusion matrix generated by the best combination of proposed SDAResNet and several effective tricks under training ratios of 10% and 20% on NWPU-RESISC45 dataset respectively. As shown in Fig. 6, 40 categories among all 45 achieve classification accuracies greater than 90%. It is well known that there exist similar spatial distribution and identical objects among ''medium residential'', ''dense residential'' and ''sparse residential'', Fig. 6 shows that our method can accurately classify these large inter-class similarity scenes (accuracies are 92% and 95% for ''dense residential'' and ''sparse residential''). Furthermore, for ''railway'', ''freeway'', ''runway'' and ''raiway station'', although there are large intra-class diversities, the accuracy can still be achieved 93%,94%,93% and 89% respectively. All these indicate that our approach is reasonable. Nevertheless, both ''church'' and ''palace'' are still difficult to recognize (accuracy is 82% and 80% respectively). As can be seen from the comparison between figure 5 and figure 6, with the decrease of training samples, the accuracy of almost all classes decreases correspondingly. On the whole, our proposed SDAResNet and effective tricks can achieve better accuracy than the current state-of-the-art methods.

3) COMPARATIVE RESULTS ON PATTERNNET DATASET
Extensive experiments on the PatternNet dataset are conducted by the combination of proposed SDAResNet and several tricks. As can be seen in Table 3, our results outperform other comparison approaches [5], [46], [47] with overall accuracy of 99.30% and 99.58% under training ratios of 20% and 50%, respectively. As shown in Table 3, LANet [5] and GLANet(SVM) [5] (98.64% and 98.91% under training ratio of 20%) are superior to all CNN-based single feature methods, which indicates that using attention mechanism is a good strategy to further improve classification accuracy, but still inferior to our SDAResNet. Furthermore, our proposed  SDAResNet without any tricks is better than other methods under training ratio of 20%, and the result under training ratio of 50% is comparable to GLANet (SVM).
The confusion matrix generated by the best combination of proposed SDAResNet and effective tricks under training ratio of 20% and 50% on PatternNet dataset are shown in Figure 7-8, respectively. As shown in Figure 8, 37 of 38 categories can achieve the accuracy of over 99%, and 32 scene classes can be fully recognized. Among these scenes, the ''sparse residential'', which is the most easily confused scene class, achieves the accuracy of 95%. All of these indicate that our method can obtain discriminative feature representation and boost classification performance very well on PatternNet dataset.

V. DISCUSSIONS
Compared with other methods, our method can achieve competitive result on the account of the fact that two important concepts are introduced: (1) In SDAResNet, spatial and channel attention is introduced to low-level feature and high-level features respectively to extract saliency scene information; (2) Various effective tricks combinations are used to assist SDAResNet to further improve the accuracy.
Two variants of proposed SDAResNet first are investigated: (1) SDAResNet-SA is the variant without spatial attention; (2) SDAResNet-CA is the variant without channel attention. Besides, the backbone of proposed SDAResNet named ResNet101 is investigated, as well as the two closest approaches (BAM+ResNet101 and CBAM+ResNet101). The comparisons are illustrated in Table 4. As can be seen from comparisons, our proposed SDAResNet exceeds all the other methods. Especially, it surpasses its counterparts: BAM+ResNet101 and CBAM+ResNet101, although all three methods involve spatial and channel attention. It shows that using different attention to different layers is superior to combining two attentions into all layers. It also indicates that our proposed SDAResNet is more effective.
In order to further illustrate the effectiveness of our proposed SDAResNet, numbers of network parameters, training time and overall accuracy are compared with the standard ResNet101 [33], BAM [29] and CBAM [12], and the results are shown in Table 5. Compared with CBAM, under the same training epochs (220 epochs) our method has fewer parameters, shorter training time, but higher accuracy. Compared with BAM and ResNet101, although our method has more   parameters, longer training time, but higher accuracy. All of these show that our proposed SDAResNet is effective.
For some tricks of natural image classification proposed by predecessors, as can be seen from Table 1-2, CosLr has an obvious effect on performance improvement; it can boost the accuracy to 91.63% and 93.42% respectively under the training proportion of 10% and 20%, and the accuracy of 93.42% under the training proportion of 20% has surpassed most of the current state-of-the-art methods. LS, NoBias and Warmup tricks can also greatly boost the classification performances. The three combinations of SDAResNet and several tricks exceed all the comparison methods and yield the best results. As mentioned above, some effective tricks and combinations can significantly improve the accuracy.
Of course, there are few tricks that can only slightly improve or even degrade classification performance. As can be seen from Table 1, Mixup can slightly improve classification accuracy, which may be because it generally requires more training epochs to achieve better classification performance. What's more, MultiStepLr and XavierInit reduce classification accuracy about 1%. XavierInit trick degrades the accuracy because it is used with sigmoid activations (not symmetric around 0), which will yield poor learning dynamics and initial saturation of the top hidden layer. MultiStepLr trick decreases the accuracy that may be due to the unreasonable step interval and fewer training epochs.

VI. CONCLUSION
In this paper, a novel SDAResNet is proposed for scene classification of RSI. The proposed SDAResNet uses ResNet101 as backbone and contains SA and CA, in which SA is integrated in low-level feature (conv2 x layer of ResNet101) to emphasize saliency location information and suppress background information, while CA is embedded in high-level features (conv3 x, conv4 x and conv5 x layer of ResNet101) to extract saliency meaningful information. Overall, our proposed SDAResNet can make full use of the attention mechanism to represent the saliency features of remote sensing scene, leading to substantial improvement in the classification accuracy. In order to further improve the performances of scene classification, the effectiveness of several tricks are investigated through ablation study. A comprehensive set of experiments on two challenging large-scale scene datasets (NWPU-RESISC45, and PatternNet) are conducted to demonstrate the effectiveness of our method, the results reveal that the combinations of proposed SDAResNet and several effective tricks outperform the current state-of-theart approaches. The final result shows that not only attention mechanism can improve the accuracy of scene classification, but also the reasonable use of effective tricks can greatly enhance performances. In our future work, we plan to study the combination of attention mechanism and transfer learning to improve the feature representation of remote sensing scenes to obtain higher classification performance. On the other hand, we plan to summarize various tricks of image classification and validate them in remote sensing scene classification.