An Attention-Enhanced End-to-End Discriminative Network With Multiscale Feature Learning for Remote Sensing Image Retrieval

The discriminative ability of image features plays a decisive role in content-based remote sensing image retrieval (CBRSIR). However, the widely-used convolutional neural networks cannot focus on the discriminative features of important scenes, resulting in unsatisfactory retrieval performance in complex contexts. In this article, an attention-enhanced end-to-end discriminative network with multiscale learning for CBRSIR is proposed to solve this issue. First, a multiscale dilated convolution module is embedded into some of ResNet50’s residual blocks to increase the perceptual field and capture the multiscale features of remote sensing image scenes. Then, a lightweight and efficient triplet attention module is added behind each residual block to capture the salient features of remote sensing images and establish the interdimensional dependencies using residual transform. In addition, the end-to-end training approach is performed using an online label smoothing loss to reduce the intraclass variance of features and enhance interclass differentiability. Experimental results on four publicly available remote sensing image datasets show that our network achieves state-of-the-art or competitive performance, especially on complex scene dataset UCMD with an average retrieval precision improvement of 3.23% to 29.35% compared to other new methods.


I. INTRODUCTION
W ITH the development of earth observation technology, the number of remote sensing images (RSIs) has increased exponentially [1]. Therefore, rapid and accurate search for scenes of interest from large RSI archives has been one of the keys to constrain the sharing and effective utilization of RSIs [2], [3].
At present, content-based remote sensing image retrieval (CBRSIR) can mine visually or semantically similar scenes from RSI archives by comparing the similarity of image features [4]. Their retrieval performance depends largely on the effectiveness and accuracy of feature extraction, which fully characterize the information of RSI content [5]. Previous feature extraction methods mainly focus on handcrafted low and mid-level RSI features (i.e., color and texture) [6], [7]. These methods are low-complexity and computationally simple, but they are highly subjective and have a direct semantic gap with the real image features [8].
In recent years, convolutional neural networks (CNNs) designed to simulate human cognitive processes have been used to extract high-level semantic information from RSIs, and have achieved encouraging retrieval results in CBRSIR [9]. However, CNNs perform poorly in some complex scenes with varying scales [4]. This is mainly because that CNNs focus on global features and have difficulty in accurately capturing features of important objects that occupy a relatively small portion in complex scenes [10], [11]. Therefore, attention mechanisms (i.e., DANet and residual attention) that focus on the important objects are often used to improve this problem, and have achieved better retrieval performance [11], [12], [13]. However, these attention modules introduce many external parameters, which increase the computational complexity. Besides, current attention-based methods take less account of the scale factor, which is also an important factor in complex scenes and affects the retrieval accuracy [14].
To address the abovementioned issues, a new attentionenhanced end-to-end discriminative network with multiscale learning is proposed for CBRSIR in this article. We introduce a lightweight and efficient triplet attention [15] to capture distinguishing features of important objects and use the residual transform to establish interdimensional dependencies. Before the attention module, a multiscale dilated convolution module is used to increase the reception field and capture the multiscale features. In addition, the network replaces the regular crossentropy loss function with an online label smoothing loss function [16] to reduce intra-class differences and enhance interclass differentiability. In general, our network contains the following three important parts: triplet attention, dilated convolutions with multiscale, and online label smoothing loss function, so our network is simply called TDO-Net.
The main contributions and innovation of our work is to redesign the baseline ResNet50 network using three new modules of triplet attention, hybrid dilated convolutions, and online label This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ smoothing loss function, making the novel network better suited for image retrieval of complex scenes. The experimental results on four public datasets indicate that our TDO-Net achieves better retrieval performance in complex scenes with fewer number of parameters and floating point of operations (FLOPs), compared to other state-of-the-art networks.
The rest of this article is organized as follows. Section II reviews related work about low and mid-level features for CBR-SIR, CNN-based high-level features for CBRSIR, and attention mechanism. Section III outlines our TDO-Net, including framework of TDO-Net, feature extraction module with multiscale dilated convolution and triplet attention, and online label smoothing loss function. Experimental results are presented in Section IV. Finally, Section V concludes this article.

A. Low and Mid-Level Features for CBRSIR
Previously CBRSIR methods mainly relied on some handcrafted global image features such as color, edge, texture, and so on [17], [18], [19]. For example, Martins et al. [20] constructed a Gabor texture-based neural classifier for CBRSIR. Yao et al. [21] applied Gabor transformations with different scales and orientations for CBRSIR. Because a single low-level feature has limited ability to describe RSIs, some scholars have tried to combine multiple low-level features to enhance the retrieval performance [22]. For example, Maheshwary and Srivastava [22] designed a prototype CBRSIR system based on color moment and grayscale co-occurrence matrix. These basic low-level features facilitated the development of early RSI retrieval, but the retrieval accuracy of RSIs has been low because these features are easily affected by lighting conditions, occlusion, truncation, and other factors. Therefore, the focus of later research gradually shifted to the mid-level features represented by local feature aggregation, such as bag of visual words (BoVW) and vector of local aggregated descriptors (VLAD) [23], [24]. Compared with low-level features, the retrieval performance based on mid-level features has been greatly improved. However, because it does not consider the spatial association of local features and still relies on manual feature selection, there is still a high room for improvement in its retrieval performance and applicability [25], [26].

B. CNN-Based High-Level Features for CBRSIR
As the most representative deep learning algorithm, CNN has achieved an excellent performance in the field of computer vision [27]. It uses a multilayer network structure combining convolutional layers and pooling layers to extract features step by step [28]. These extracted high-level semantic features have been widely applied to CBRSIR. For example, Zhou et al. [3] retrained the mainstream CNN models on remote sensing benchmark dataset, and showed that their retrieval results were significantly better than the traditional low/mid-level features. Since there are significant differences between remotely sensed images and ordinary natural images in terms of camera angle and scale dependence, some scholars have tried to improve the basic CNN model by combining the characteristics of RSIs. For example, Fan et al. [29] proposed a new distribution consistency loss function for CBRSIR to solve the problem of nonuniformity in the distribution of RSIs sample data. Liu et al. [30] proposed a slice feature depth hashing algorithm for CBRSIR based on the smaller interclass distance feature of RSIs. The abovementioned methods have improved retrieval performance to some extent, but the retrieval performance in complex scenes is still poor because complex scenes often contain a large amount of redundant and complex background information [31], while CNNs mainly focus on global features and ignore local features [10].

C. CBRSIR Based on Attention Mechanisms
The visual attention mechanism is a resource allocation mechanism inspired by human perception, which focuses on the important region in the global image and then devotes more attention resources to this region [32]. In other words, the attention mechanism can suppress useless background information and improve visual information processing [33]. Therefore, the attention mechanism is often used to improve the retrieval performance of CNNs in complex scenes. For example, Imbriaco et al. [34] explored different attention mechanisms to select the most relevant features for CBRSIR. Xiong et al. [13] used different attention modules to focus on important features in spatial and channel dimensions for CBRSIR, respectively. Wang et al. [35] designed an integrated channel and spatial attention mechanism to extract globally consistent features of foreground objects for CBRSIR. Although these methods bring performance gains in complex scenes, they also introduce many external parameters that increase the complexity of the network. Besides, these methods combine spatial attention and channel attention in parallel or in series, respectively. However, the two types of attention in the human brain tend to work in tandem [36]. In addition, the embedding of multiple attention modules tends to reduce the stability of the model. Therefore, we choose a triplet attention to focus on important regions, and encode both channel and spatial information with negligible computational overhead.

A. TDO-Net Framework
To solve the abovementioned problems and further improve the retrieval accuracy of complex scenes, we design an attentionenhanced end-to-end discriminative network TDO-Net. In our TDO-Net, the ResNet50 [37] based on a stack of residual structures with powerful feature learning capability and easy optimization, was chosen as the backbone network. The ResNet50 network structure consists of a single convolutional layer, four residual structures, an average pooling layer, and a fully connected layer. In the baseline ResNet50, the fully connected layer tends to produce global features of the image, while the convolutional layer describes local features [8]. Therefore, in our TDO-Net, the third and fourth residual structures are designed as multiscale feature extraction modules, and a triplet attention module is embedded after the last convolutional layer of each residual block. Finally, end-to-end training is performed using online label smoothing losses. In the testing phase, the output of the last average pooling layer of the feature extraction network is selected as the image features for CBRSIR. The framework of our TDO-Net is shown in Fig. 1.
As can be seen from Fig. 1, unlike previous networks, our TDO-Net not only focuses on the feature extraction of important region, but also considers the scale factors affecting the feature representation. The multiscale feature learning module is composed of hybrid dilated convolution. That is, the dilated convolution with different expansion rates is assigned in the continuous residual structure to make it adaptively capture scene information at different scales. Besides, online label smoothing loss is used to replace the regular cross-entropy loss function. This is because online label smoothing loss can constrain the boundaries of different classes, enabling the network to discriminate visually similar but semantically irrelevant images more effectively.

B. Feature Extraction Network 1) Multiscale Feature Learning Module:
In the convolutional operation, the reception field is used to represent the size of the perceptual range of different neurons inside the CNN for the original image. The size of the reception field of the standard convolution is consistent with its convolution kernel size, and the dilated convolution injects "zeros" into the standard convolutional map, making it possible to increase the reception field of the convolution with the same parameters and computational effort [38]. Therefore, output of the dilated convolution contains a larger range of feature information than the original convolution, which is beneficial to obtain global information of RSIs [39], [40].
The dilated rate of the convolution defines the spacing of the convolution kernel sampling, which determines the size of the reception field. Fig. 2 shows the reception field at different dilated rates, and the reception field increases as the dilated rate increases. However, dilated convolution increases the information reception field while the convolution space becomes discontinuous, which can cause the loss of local information and the lack of dependency among neighbouring pixels, resulting in inconsistent local information. Therefore, for RSIs with diverse  scene scales, a network using dilated convolutions with only a fixed dilated rate is highly prone to information loss of small objects.
To address the shortcomings of dilated convolution, a multiscale feature learning module based on hybrid dilated convolution is designed in our TDO-Net, which aims to enable the network to flexibly capture multiscale contextual information of RSIs. Specifically, to ensure the continuity of image information, it is stipulated that the dilated rate of the superimposed convolution layer is specified to be only divisible by 1 and itself. After the pooling operation, the allocation of the dilated rate follows a sawtooth-like heuristic structure. As shown in Fig. 3, the dilated rate is designed as a rising group of [1], [2], [5], [9] in a residual structure of our TDO-Net, which can adaptively extract feature information of different sizes. In this hybrid dilated convolution structure, the convolution with smaller dilated rate is used to capture the near feature information, and the convolution with larger dilated rate is used to capture the long-distance information, and the topmost convolution layer acquires information from a larger spatial range without causing sampling discontinuity in the case of constant perceptual field.
2) Attention-Enhanced Module: RSIs contain a large amount of background information, which affects the high-level feature discriminative ability of CNNs, while the attention mechanism focusing on important regions of images is often used to compensate for this deficiency of CNNs. However, currently widely used attention obtains better performance at the cost of higher model complexity. While these attention-based network bringing performance gains, they introduce many external parameters that increase the complexity of the network. In addition, the stacked use of multiple attention modules reduces the stability of the network. As an efficient attention module with almost no parameters, triplet attention uses residual transformations to establish interdimensional dependencies, encodes interchannel and spatial information with negligible computational overhead, and models both channel attention and spatial attention weights. Therefore, triplet attention is chosen as the basic component of the attention-enhanced module in our TDO-Net to highlight features of the important region and suppress cluttered background. Specifically, the triplet attention is embedded into each residual block of ResNet50 in our TDO-Net.
The structure of triplet attention is shown in Fig. 4. Specifically, given an input feature map X ∈ R H×W ×C , whose size is H × W × C (H: Height, W: Width, C: Number of channels). The first branch is the spatial attention computation branch, where the input features are first subjected to channel pooling, i.e., Z-Pool, which reduces the zeroth dimension of X to two. The pooling process can be expressed as where 0d denotes the 0th dimension where the maximum pooling and average pooling operations are performed. In this way, the input feature map X changes its shape to 2 × H × W after Z-Pool, and the feature depth and computational effort are effectively reduced. After that, the input features are passed through the convolution and batch normalization layers in turn, and then the spatial attention weights are generated by the sigmoid activation function. The second branch is the channel C and spatial W dimensional interaction capture branch, where the input features X are first transposed into H × C × W dimensional features, followed by pooling in H dimension, 7 × 7 convolution and sigmoid activation function, and finally transposed into a C × H × W feature. The third branch is the channel C and space H dimensions interaction capture branch, and the specific operation is basically the same as the second branch. Finally, the final attention-based features are obtained by feature aggregation of the features extracted from the three branches using average pooling.

C. Online Label Smoothing Loss
Compared with natural images, the complex and diverse backgrounds of RSIs tend to cause greater intraclass variation and higher interclass similarity. This leads to the problem of large intraclass gaps and unclear interclass boundaries among the extracted high-level features. In order to solve the abovementioned problems, it is necessary to increase the interclass separability and intraclass tightness during the training process, so that similar images are divided into more compact clusters. However, the most widely used cross-entropy loss obtains interclass distances that are tiny and susceptible to noisy labels [41]. To address this problem, Zhang et al. [16] proposed an online label smoothing loss and it has been shown to be effective in the classification of natural pictures. Therefore, we try to introduce it into our CBRSIR network TDO-Net for further differentiating interclass differences and narrowing intraclass differences in RSIs.
In the online label smoothing loss, different smoothing thresholds are given for different categories, and the smoothing weight matrix is dynamically generated during the training process to constrain the variance distance for images of different categories. The specific formula is shown as follows [16]: where L hard is the cross-entropy loss, x i denotes the input image, y i denotes the true category of the input image, k is the predicted category of the input image, K is the total number of image categories, p(k | x i ) denotes the probability that the input image x i is predicted to be category k, and q denotes the L soft is the online label smoothing loss, t is the number of training iterations, S t−1 y i ,k is the weight of label smoothing. At the end of one iteration process, the weights are updated according to (4) and (5) to obtain the smoothing threshold S t y i ,k for the next iteration.
Initially, the parameters of the online label smoothing loss are initialized according to the following equation: where c is the correct category of the input image, μ is the initial smoothing parameter, generally set between 0.9 and 1.0. Due to the lack of hard labels, the model is difficult to converge. Therefore, cross-entropy loss and online label smoothing are used to jointly constrain the training of the model. The total training loss can be expressed as (7), where α is used to balance the cross-entropy loss and the online label smoothing loss

IV. EXPERIMENTS AND RESULTS
To verify the effectiveness of our TDO-Net, three types of experiments are carried out on four RSI benchmark datasets. All experiments are conducted on Ubuntu 20 OS with an Intel 3.7 GHz i9-10900K processor and an NVIDIA GeForce RTX3090 graphics card. Our TDO-Net and other comparative networks are implemented based on the Pytorch library in Python language. To ensure the fairness of the performance evaluation, all experiments are conducted in the same environment.
Among these four datasets, the UCMD and NWPU datasets are originally designed for RSI scene classification and some images contain a large amount of background information [45]. This redundant background information poses a challenge for accurate and reliable similar image search [3]. The remaining datasets with different sources and the same classification systems are designed for CBRSIR and contain less background information. That is, the abovementioned four datasets can be  2) Implementation Details: In the experiments, the ResNet50 network without pretrained weights was chosen as the baseline model. In the training phase, the training batch is 40 epochs, the batch size is 32, the optimizer is Adam, the initial learning rate is 3e−4, and the weight decay is 3e−4. In all experiments, the size of the input image has been resized to 224 × 224 pixels. To make the model converge more efficiently, we set the weight value of balanced cross-entropy loss and online label smoothing loss to 0.4.
For the benchmark datasets, we randomly divide the images of each category into training set and test set in the ratio of 8:2. Besides, the training set is redivided into two parts, with 80% of the images used for training and the remaining 20% for validation.
In the testing phase, the Euclidean distance is used to measure the similarity of features. The closer the distance between the  visual features of the query images and other images, the more similar these images are, and vice versa. Three standard metrics of average retrieval precision (mAP), average normalized modified retrieval rank (ANMRR), and precision at k (Pk, k is the number of images queried) are used to evaluate the performance [46]. In addition, class-level mAP was also used as another evaluation criterion, which can reflect the differences between different classes. It is worth noting that the lower the ANMRR value, the better the retrieval performance, while the higher the mAP and Pk values [47].

1) Ablation Experiments:
To verify the effectiveness of our proposed method, we conducted ablation experiments using the baseline ResNet50 network combining different modules. It should be noted that the abbreviation MFE stands for multiscale feature extraction module, Triplet stands for attention-enhanced module and OLS stands for online label smoothing loss. These ablation experiments are conducted on one complex scene dataset UCMD and one simple scene dataset VArcGIS. Table II shows the retrieval performances of these methods on the two datasets.
It can be seen that the networks of ResNet50 with one of MFE, Triplet, or OLS modules outperform the baseline network ResNet50 on the two datasets. However, on the complex scene dataset UCMD, the mAP of the networks with a single module improves only 3.16% to 13.13% compared to the baseline network ResNet50, while our TDO-Net improves the mAP by 24.46%. That is, our TDO-Net still improves the mAP by 11.33% to 21.3% compared to these networks with a single-module. On the simple scene dataset VArcGIS, our TDO-Net still achieves an improvement of 2.07% to 9.67% compared to the abovementioned networks. These experimental results show that our TDO-Net effectively combines the three modules of triplet attention, hybrid dilated convolutions, and online label smoothing loss function to significantly improve the image retrieval performance under scene complexity.
2) Comparisons With the Baseline ResNet50: To further validate the effectiveness of our TDO-Net, we comprehensively evaluated the performance difference between the TDO-Net and the baseline ResNet50 on four benchmark datasets. Table III shows their retrieval performances on the four datasets.
It can be seen from Table III that our TDO-Net improves the mAP by 24.46% and 33.84% on the complex scene datasets UCMD and NWPU, and by 6.57% and 9.67% on the simple scene datasets PatternNet and VArcGIS. The trends of other evaluation metrics are consistent with the abovementioned changes. The experimental results show that our TDO-Net can improve the performance of CBRSIR, especially for complex scenes.
In addition to these quantitative comparisons, we also perform some visual comparisons. Fig. 9 shows the top five images of the retrieval results on the NWPU dataset. The incorrect and correct results are tagged in red and green, respectively. It can be seen that our TDO-Net can return more correct results. For example, for the retrieval of stadium and sparse building images, the baseline method incorrectly retrieves semantically irrelevant images, such as golf courses and dense houses, while all images retrieved by our network are correct. This is mainly because our network has more and stronger feature discrimination ability.  To clearly demonstrate the impact on different categories, the retrieval performances of each category on the four datasets are shown in Figs. 10-13. It can be seen that the performance improvement for different categories is irregular, but there is an overall trend of improvement. The retrieval improvement of our TDO-Net is significant in the categories with complex or redundant background (e.g., airplane, intersection, cemetery, sparse residential, storage tank, etc.).

3) Comparisons With Other Methods:
To further validate the performance of our TDO-Net, six newly proposed methods of discriminative feature learning (DFL) [13], Attention boosted bilinear pooling (ABP) [35], tree-triplet-classification network (T-T-C) [48], context attended graph convolutional network (GCN), and context attended Siamese graph convolutional network (SGCN) [10], and deep feature learning with latent relationship embedding [49] are selected as the comparison methods. All the Fig. 11. mAP for each category in the NWPU dataset.  abovementioned methods use a pre-trained network (PT_Net) as the backbone network, so we also use PT_ResNet50 to train our TDO-Net on the UCMD dataset (PT_TDO-Net). In addition, seven low/mid-level feature-based method of color moment, color histogram, Gabor texture, GLCM texture, GIST, BOVW, and VLAD are also selected as comparisons. The retrieval performances of the 15 different methods are shown in Table IV.
It can be seen from Table IV that that the high-level feature-based method significantly outperform the low/midlevel feature-based methods in CBRSIR. It also can be seen that our TDO-Net without pre-trained ResNet50 achieves better retrieval performance than the four PT_Net DFL, T-T-C, CA-GCN, and CA-SGCN, with an mAP improvement of 1.37%, 13.2%, 18.6%, and 12.15%, respectively. After adding pretrained ResNet50, our network outperforms all other pre-trained methods, with an mAP improvement of 3.23% to 29.35%. The abovementioned experimental results demonstrate that our TDO-Net achieves state-of-the-art or competitive results.
It is apparent from this table that compared to the baseline network PT_ResNet50, our PT_TDO-Net increases the mAP by 17.6% with only 0.02% increase in parameters and 1.12% increase in FLOPs. Besides, it also can be seen from the table that compared to multiple attention fusion-based methods PT_W-CAN and PT_CEL, our network has improved retrieval performance with 78.3% and 58.6% reduction in the number of parameters, and 51.9% and 43.3% decrease in FLOPs, respectively. Although the FLOPs of the PT_ABP method is slightly better than our model, our mAP is 5.56% higher than its mAP. This is because hybrid dilated convolution does not change the number of parameters, while the rotation operation and residual transformation of triplet attention can reduce part of the convolution computation process. The abovementioned experimental results show that our method improves the retrieval performance but not at the cost of reducing the efficiency.

4) Visual Interpretation of the Network:
In order to better understand the superiority of our TDO-Net, the output feature heatmap of our TDO-Net is visualized. Fig. 14 shows the feature heatmap of some example images in the UCMD dataset using the Grad-CAM++ [52] tool. It is worth noting that the more the color leans towards red, the more sensitive the model is to the pixel value at that location, i.e., the higher the level of attention.
As shown in Fig. 14, the focus of the heatmap generated by ResNet50 is generally inaccurate or even mislocated. For example, on the image in the fourth column, the heatmap is incorrectly positioned to cover the background area, while the heatmap of our TDO-Net completely covers the building. In the second row and second column, the heat map focus of the ResNet50 method is broad, while the heatmap of our TDO-Net better covers the important objects with higher level of detail. The results indicate that our TDO-Net can better focus on important scenes with higher discriminative power, which is especially important for improving the performance of image retrieval in complex scenes.

5) Parameter Sensitivity Analysis:
The setting of hyperparameters has a significant impact on retrieval performance. The loss function weight in (7) and the dilated rate of the hybrid dilated convolution are the important hyperparameters of our proposed TDO-Net. Therefore, to verify the effect of hyperparameters, we do some additional experiments on the UCMD dataset. In these experiments, to minimize the randomness of the trained model, we repeat each set of experiments 10 times and randomly divide the training and validation sets for each group.
Table VI represents the experimental results on the sensitivity of the loss function weight. It can be seen from the table that the retrieval performance increases and then decreases as the weight increases, and the best performance is achieved at a weight of 0.4. The lowest retrieval performance is achieved when only the cross-entropy loss function (α = 1) is available due to the small distance between feature classes. Therefore, we choose 0.4 as the loss fusion weight in the abovementioned experiments.
To demonstrate the power of hybrid dilated convolution in TDO-Net, we also experimentally analyze the effect of convolution kernels with different dilated rates. Specifically, we compare the dilated convolution setups studied by CEL [4], HDCFE-Net [53], and LDFN [54]. The experimental results for different dilated rates are shown in Table VII. It can be found that the dilated convolution is better than the standard convolution. Furthermore, the hybrid dilated convolution of [1], [2], [5], [9] used in our TDO-Net has superiority compared to others.

V. CONCLUSION
In this article, a novel attention-enhanced end-to-end discriminative network with multiscale learning is proposed for CBR-SIR. In this network, hybrid dilated convolutions with different dilated rates are employed to replace the regular convolution in the last two residual blocks to obtain the multiscale features of RSI scenes. Besides, a triplet attention module is embedded to each residual block to learn more discriminative spatial and channel information adaptively through cross-dimensional interaction of spatial and channel. Finally, online label smoothing loss is used for end-to-end training to reduce intraclass variance and enhance interclass differentiability of features.
Extensive experiments are conducted on four benchmark datasets to evaluate the effectiveness of our network. The experimental results indicate that our network achieves the best retrieval performance, especially in complex scenes, compared to the baseline ResNet50 and other recently proposed networks.
Our future work will be concentrated on the effect of noisy samples on attention mechanisms and multisource RSI retrieval.