Dual-Attention-Driven Multiscale Fusion Object Searching Network for Remote Sensing Imagery

Object search is a challenging yet important task. Many efforts have been made to address this issue and achieve great progress in natural image, yet searching all the specified types of objects from remote sensing image is barely studied. In this article, we are interested in searching objects from remote sensing images. Compared to person search in natural scenes, this task is challenging in two factors: One is that remote image usually contains a large number of objects, which poses a great challenge to characterize the object features; another is that the objects in remote sensing images are dense, which easily yield erroneous localization. To address these issues, we propose a new end-to-end deep learning framework for object search in remote sensing images. First, we propose a multiscale feature aggregation module, which strengthens the representation of low-level features by fusing multilayer features. The fused features with richer details significantly improve the accuracy of object search. Second, we propose a dual-attention object enhancement module to enhance features from channel and spatial dimensions. The enhanced features significantly improve the localization accuracy for dense objects. Finally, we built two challenging datasets based on the remote sensing images, which contain complex changes in space and time. The experiments and comparisons demonstrate the state-of-the-art performance of our method on the challenging datasets.


I. INTRODUCTION
C URRENTLY, searching for designated objects in natural images has become increasingly popular in computer vision [1], [2], [3]. For example, searching a specific person through video surveillance is a very important computer vision application. There are similar needs in the remote sensing field. Searching designated objects from remote sensing images at different times and locations is of great significance to the national defense security fields and land resource management. However, there is also a lack of effective methods for remote sensing object searching.
In recent years, many person searching methods have been proposed [1], [3]. For example, Yan et al. [4] proposed a person searching network with contextual information to effectively improve the robustness of search results. Xiao et al. [5] proposed an individual aggregation network and accurately localized persons by learning to minimize variations in internal features. To share the research results of reidentification with person searching, Liu et al. [6] transferred the advanced person reidentification knowledge to the person searching model through a teacherguided disentangling network, which significantly improved the person search performance. Most of the above methods improve the search performance by a shallow backbone or preserving shallow information.
As shown in Fig. 1, there is similarity between person searching and remote sensing object searching [7], [8]. They aim to search the given object from a larger number of images. In the meanwhile, there are some differences between person search in natural image and object search in remote sensing image. Specifically, the remote sensing object searching task often requires locating all the specified types of objects from a gallery containing many similar objects, rather than searching a single specific object. In addition, the objects in the remote sensing scene are numerous and dense. Person searching methods are prone to localization errors and detection confusion in remote sensing scenes.
In summary, there are still some problems in solving the remote sensing object searching problem as follows. datasets only contains a few persons. However, remote sensing images contain multiple objects. Therefore, the existing person searching methods cannot work well in characterizing the significant features of objects, which is prone to produce unsatisfactory search performance. 2) Different from person search in natural image, remote sensing image object search is a challenging task since the distance between objects in remote sensing images is very small. In this situation, it is difficult to locate all the dense objects by using person search methods. The main reason is that these methods fail to emphasize informative features of dense objects.
3) The dataset for object search is scarce in remote sensing field. Although there are many datasets for object detection or segmentation from remote sensing images, such as DOTA [9], DIOR [10], and SZTAKI-INRIA [11], these datasets cannot directly applied in the object search problem, since most of objects in these datasets are not of the same type. To address these problems, we design a new remote sensing object searching network and propose two modules for it. The contributions of this study can be summarized as follows.
1) A dual-attention-driven multiscale fusion object searching network is proposed for remote sensing images. To the best of our knowledge, this is the first time to study the object searching task in remote sensing community. 2) The proposed searching network is comprised of two modules: a) the multiscale feature aggregation (MSFA) module enhances the detail representation of features by fusing multilayer features, which improves the performance of object search in remote sensing images; and b) the dual-attention object enhancement (DAOE) module is proposed to focus on essential salient objects and suppress unnecessary ones, which contributes to making more accurate and precise landmark searching. 3) To verify the effectiveness of the proposed method, two object searching datasets are built by ourselves, which help to promote the development of object searching methods in remote sensing field.
The rest of this article is organized as follows. In Section II, we briefly review related work about several relevant topics. In Section III, we present the dual-attention-driven multiscale fusion object searching network and discuss its architecture and loss function. In Section IV, we present extensive experimental results. Finally, Section V concludes this article.

II. RELATED WORK
The proposed object searching method organically integrates detection and reidentification. Thus, we briefly introduce object detection and reidentification.

A. Object Detection
Most detection algorithms can be divided into two-stage and single-stage methods according to the strategy of generating proposals [10]. The two-stage detection algorithm first generates a series of candidate boxes as samples and then classifies samples through the convolutional neural network. Common algorithms include R-CNN [12], Fast R-CNN [13], and Faster R-CNN [14]. The second category is represented by the YOLO series algorithm [15], [16], [17], [18] and SSD [19], which transform the positioning problem of the object box into a regression problem for processing. In addition, because candidate boxes are not needed, these methods have a marked advantage in inference time during testing. However, these object detectors are designed for images in natural scenes, the resulting object's direction of remote sensing images typically exhibits high uncertainty, and object scales varies widely. These problems lead to these methods not having good adaptability to remote sensing images.
For remote sensing image object detection, many scholars have proposed solutions to address these problems [20], [21], [22], [23]. The problem of uncertain object direction can be solved by rotating the frame scheme [24], [25], [26]. Cheng et al. [27] proposed a new optimization function by introducing rotation-invariance regularization and Fisher discriminant regularization to CNN features to solve the problem of low detection accuracy caused by object rotation in remote sensing images. Zhou et al. [28] proposed an encoder-encoder structure, where the rotation-sensitive feature maps are used for regression and the rotation-invariance feature maps are used for classification. Chen et al. [29] proposed a new pixel-IoU loss to effectively improve the detection performance. Xie et al. [30] proposed a remote sensing object detector. The accuracy can be comparable to that of the two-stage detector and its speed can be comparable to that of the one-stage detector. However, these rotating object detection methods only solve the problem of remote sensing object rotation, particularly improving the detection accuracy in dense scenes. However, they do not involve the negative impact of common factors such as illumination and weather in remote sensing images. Conversely, the feature pyramid networks (FPNs) proposed by Lin et al. [31] provide a good solution to the problem of scale disunity and environmental impact. In some studies, more complex pyramid structures are constructed to integrate multiscale feature layer information [32], [33], [34]. To improve detection accuracy, Yang et al. [35] propose a sampling fusion network by fusing a multilayer feature with effective anchor sampling, which effectively improves detection accuracy.
However, due to the doubling of the number of anchor points, the model's efficiency is low. Liu et al. [36] improve the feature representation ability of the backbone, adaptively combining multiscale features, and effectively reducing the interference of the background to the object, but this method has little effect on small objects.
Thus, most of these methods require a deep network structure to extract high-level semantic information, leading to a lack of low-level information for reidentification. However, the particularity of the searching task requires the unity of opposites between high-level and shallow information.

B. Reidentification
In recent years, due to the wide application of reidentification tasks in video surveillance and object tracking, many scholars have investigated reidentification in detail [37], [38], [39]. However, their research objects are more focused on pedestrians in natural image scenes. For example, by considering the spatial dependence in both interimages and intraimages, Si et al. [40] constructed a new spatially driven network that achieves good performance on multiple classic key indicators. Huang et al. [41] proposed a new full-scaled deep discriminant learning model, which considered the three concepts of depth, width, and cardinality concurrently. Under the condition of obtaining considerable accuracy, the structural complexity of the model and the difficulty of training were reduced. However, that study lacks background interference, leading to a large gap compared with the real scene, which limits the application scope. Therefore, to explore the real-world applications of pedestrian reidentification, researchers proposed a person searching task that aims to simultaneously locate and identify a person from the raw image [42], [43], [44]. For example, Han et al. [45] designed a trident network by dividing the person searching task into three parts: detection, reidentification, and part classification. Concurrently, the reidentification and part classification network weighted the gradient of backpropagation based on the quality of person detection. However, the network structure of this method is complex, which reduces computational speed. Li and Miao [46] account for the fact that the detection and reidentification in person searching is a gradual process through two subnetworks for sequential processing, and the contextual information is used to enhance reidentification. Although this method improves the searching speed, it fails to unify detection and identification tasks, and the two-step structure is still too complex.
Existing searching methods often perform one-to-one positioning and reidentification of pedestrian objects; only a single object with the same id appears in the image to be searched. However, objects with the same id often appear repeatedly in remote sensing scenes, and remote sensing images also pose new challenges, such as scale changes and weather effects.

A. Method Overview
In this section, the proposed object searching method is introduced in detail. As shown in Fig. 2, the proposed object searching model consists of the following components: an MSFA module and a DAOE module. Specifically, we use the MSFA module to strengthen the feature representations by fusing more low-level features, and then, the extracted features are fed into the reidentification task and the detection task. In addition, the DAOE module is used to select task-related features and capture more spatial details, which contributes to making more accurate and precise landmark search. The following subsections elaborate on the details.

B. MSFA Module
As far as we know, the FPN structure is widely used to extract multiscale features of images because it can fuse feature maps with strong high-level semantic information and feature maps with weak low-level semantic information but rich spatial information. Although the FPN structure can fuse different levels of features, the simple merger is suboptimal due to there being a conflict between low-level and high-level information in the object searching. Therefore, the AFA feature aggregation module [1] inspires us to find a more suitable feature extraction module for remote sensing images.
Thus, we use the MSFA module, as shown in Fig. 2. The primary idea of this module is to fuse high-level semantic information to low-level semantic information, thus obtaining the low-level semantic information that can adapt to both reidentification and detection. Specifically, we use the {S 2 , S 3 , S 4 } feature maps from the Res-50 backbone, and MSFA outputs {C 2 }. We only use {C 2 } to reidentification and detect, instead of using the characteristics of each layer as in the original FPN. Although this design will affect detection performance, it unifies the reidentification and detection tasks. We will show in Section IV that the proposed method achieves a good tradeoff between reidentification and detection subtasks.
Due to the broad imaging range of remote sensing imaging scenes, there will likely encompass a large number of dense objects. The reidentification subtask requires more detailed information to identify the objects. In the proposed method, we designed an MSFA module to improve feature representation ability. Specifically, we use 3 × 3 deformable Conv to extract features. The primary function of 3 × 3 deformable Conv is to reduce the channels of feature maps and adaptively adjust the receptive field on the obtained features that can pay more attention to the object itself, thereby reducing background interference. And then, a concatenation operation is used to fuse the top-down feature maps, which is an important step to connect high-level semantic information with low-level semantic information. Finally, we use 3 × 3 Conv to fuse the connected feature maps, thus generating feature mappings containing more detailed information for the reidentification and detection tasks. With the above three steps, we obtain the fusion features with more attention on object details.

C. DAOE Module
After processing the MSFA module, we obtain a multiscale feature map of the input remote sensing scene, which is used for both the reidentification and detection subtasks. In the object Fig. 2. Architecture of the proposed method, which includes three primary steps. First, we use the MSFA module to extract features for reidentification and detection. Second, the feature is flattened directly for the reidentification module. Third, the features are used for detection after the DAOE module. searching task, accurate detection results will markedly improve the searching speed and accuracy. However, the uniqueness of the searching task leads to fewer feature layers for detection subtasks, which affects the accuracy of the detection subtask. Therefore, to obtain accurate search results in complex remotely sensed images, we propose new strategies to enhance the accuracy of object feature representation, thus obtaining more precise landmark. Inspired by Yang et al. [47], extracting rich global context information from multiscale maps is conducive to improving the ability to distinguish different elements in the scene. Therefore, we use the DAOE module to optimize feature representations from both the channel and spatial perspectives. As shown in Fig. 2, the DAOE module is composed of two parallel branches: the channel domain (CA attention) and the spatial domain (SP attention).
CA attention: After extensive testing, we found that CA attention [48] has better feature optimization capabilities. Thus, CA attention is used as the channel attention module in the proposed model. We first revisit CA attention. Specifically, the global pooling is broken down and converted to a one-to-one feature code. Given input X, the pooling kernel of size (H, 1) or (1, W ) is used to encode each channel along with horizontal and vertical coordinates, respectively. Therefore, the output of the cth channel at height H can be formulated as Similarly, the output of the cth channel at width w can be written as Then, this transformation is concatenated, and the 1 × 1 convolutional transformation function F 1 is used, yielding where [·, ·] is the concatenation operation along the spatial dimension and δ is a sigmoid function. Another two 1 × 1 convolutional transformations F h and F W are used to separately transform f h and f w to tensors with the same channel number as the input X, yielding where δ is the sigmoid function. The outputs g h and g w are then expanded and used as attention weights. Finally, the output of the CA attention module y ca can be written as After obtaining sufficient global context information, the CA attention module also contains certain spatial information, but the spatial information is too weak. Thus, we must add spatial information to improve the optimization effect of the feature map.

SP attention:
We use spatial attention to enhance the optimization effect of the CA attention module on the feature map. To learn the spatial weight relationships effectively, we first generate two feature descriptors of size (H × W × 1) for each spatial position through global average pooling and global maxpooling operations. Next, the above two feature descriptors are concatenated, and then, the 7 × 7 convolutional transformation function F 2 is used, yielding (Avg (x(i, j)) , Max (x(i, j))))) where [·, ·] is the concatenation operation along the spatial dimension and δ is the sigmoid function. The output of the SP attention module y sp can be written as Finally, we combine the maps of the two branches to obtain the output of the module as From these processes, the final feature map is used for the detection subtask, which fuses the strong low-resolution semantic information and features with weak high-resolution semantic information but rich spatial information.

D. Loss Function
To train the proposed module, two parts of the loss function are used for the reidentification and detection subtasks. For reidentification loss, the TOIM loss proposed by Yan et al. [1] shows good performance in the reidentification task. They proposed a specifically designed triplet loss to improve the OIM loss. Specifically, the OIM loss stores the feature centers of all the labeled identities in a lookup table, V ∈ R D×L = {v 1 , . . ., v L }. A circular queue U ∈ R D×Q = {u 1 , . . ., u Q } containing the features of Q unlabeled identities is maintained. At each iteration, given an input feature x with label i, the OIM loss computes the probability of x belonging to the identity i and is calculated as The objective of the OIM loss is to minimize the expected negative log-likelihood For the specifically designed triplet loss, S vectors are sampled from one object, and then, X m = {x m,1 , . . ., x m,S , v m } and X n = {x n,1 , . . ., x m,S , v n } are described by the candidate feature sets for the object with identity labels m and n. Given X m and X n , positive pairs can be sampled within each set, while negative pairs are sampled between the two sets. The triplet loss can be calculated as Finally, the TOIM loss is the summation of these two terms For the detection loss, we used the FCOS loss (L det ) to train the proposed detection head. The details are as follows: where L cls is the focal loss, L reg is the IOU loss, N pos is the number of positive samples, and λ, where 1 is the balance weight for L reg . The summation is calculated over all the locations on the feature maps F i . I c * i >0 is the indicator function, with 1 if c * i > 0 and 0 otherwise. Finally, the total loss is the summation of these three terms Using this loss function, we optimize the parameter settings of the multiscale extraction network based on training data, thus obtaining the multiscale representation of the image scene with a smaller semantic gap.

IV. EXPERIMENTS
In this section, we first introduce the experimental details, including two self-built datasets and some implementation details. Then, ablation experiments are performed for the two modules proposed in this article, and the influence of each module on the final results is analyzed. Finally, the proposed method is compared with several other advanced methods to verify the performance of the proposed algorithm.
A. Experimental Settings 1) Datasets: Remote sensing object searching must find specific subclass objects from a gallery that contains a large number of images. However, this task is novel, and there is no public dataset; thus, to verify the effectiveness of the proposed method, we design many experiments with two self-labeled datasets. The gallery size is set as 50 images (i.e., finding the specified subclass object from 50 remote sensing images). The first annotated dataset is a building dataset from rural to urban areas, in which the scale span of the object is large and the difference between classes is large. The other dataset primarily includes aircraft in various remote sensing scenes, which means that there is interference from the angle, time, and artificial facilities in the second dataset. a) Building dataset: This dataset was collected from Google Earth and includes different places in Fujian Province, China, from typical cities and suburbs to rural areas. It should be noted that we refer to the labeling method of [49], which identifies irregular buildings and connected concrete floors as a whole. This dataset includes 2180 images and more than 100 000 labeled buildings. Some examples of buildings are shown in Fig. 3(a). In this study, we divided the dataset according to 7:1:2. A total of 1526 images were used for training, 218 images were used for verification, and the remaining images were used for testing.  b) Plane dataset: The second dataset collected 500 images of aircraft from Google Earth. The contents of the dataset are shown in Fig. 3(b). Fig. 3 shows that the proposed data are challenging, including object scale changes and unfavorable conditions for object searching (e.g., poor lighting and poor weather). For this small dataset, we used 400 images for training, 62 images for testing, and the remaining images for validation.
In particular, the dataset created in this article contains various scale changes and has a wide range of environmental impacts, which is challenging for remote sensing object searching. More details are shown in Table I.
2) Evaluation Metrics: There is a marked difference between the proposed searching method and object detection. Traditional evaluation metrics cannot be fully applied to this task. Thus, we propose a new evaluation metric for searching tasks. Different from the average precision (AP), the searching method must count the number of false detections and missed detections in the entire gallery corresponding to the query. The AP s index is defined as where TP g , FP g , and FN g denote true positive, false positive, and false negative counts from the entire gallery, respectively. The higher the AP s value is, the better the searching result.

3) Implementation Details:
The proposed model is implemented using PyTorch and MMDetection on an Nvidia RTX 5000 GPU. We set the batch size to 4 and use an SGD optimizer with a weight decay of 0.0005. The initial learning rate is set to 0.001 and is reduced by a factor of 10 at epochs 20 and 22, with a total of 24 epochs.

1) Ablation Experiments:
To evaluate the performance of the proposed method and measure the contribution of each proposed module. The high-quality AlignPS structure proposed in the literature [1] was used as a baseline, and componentwise experiments were performed on the two datasets. The proposed modules were configured on different branches of the baseline, and their respective contributions to the final results were analyzed by comparing the evaluation of these modules before and after being used in Table II. For the building dataset, Table II shows the evaluation results of the ablation experiments on the building dataset. The AP s 50:95 evaluation index of the basic AlignPS network is 63.71%. AP s 50:95 increases by 9.6% with the MSFA module compared to the AFA module in AlignPS [1]. When adding the DAOE module, AP s 50:95 increased by 6.94%, which indicates that each module has a positive impact on the baseline architecture, and the addition of the attention module plays a more effective role in the final result. When the two modules are configured on the baseline network simultaneously, the final AP s 50:95 increases by 10.87%. Fig. 6(a) shows the application effects of these ablation experiments on the building dataset. In Fig. 4(a), we show the precision-recall curve of different ablation experiments. As shown in these figures, the proposed method outperforms the other compared methods for searching accuracy, which contains each proposed module.  Experimental results on the plane dataset are shown in Table II. Compared with the baseline architecture, the AP s 50:95 of the proposed method with the MSFA and DAOE modules increases 4.61% and 5.1%, respectively. The combination of these two modules yields a marked increase in AP s 50:95 of 6.34%. Fig. 6(b) and Table II show that the proposed final method performs better than the comparison method and ablation experiment. These results demonstrate the effectiveness and necessity of each proposed module.
In addition to the above analysis, we visualized the object feature maps of the baseline and the proposed method in the testing process. As shown in Fig. 5, compared with the feature maps used for reidentification in the baseline method, the proposed method not only contains the object region, but also contains more details information. For this figure, the proposed method focuses on not only the features of the fuselage part but also the wing and edge parts. The flattened features extracted from them are more suitable for reidentification tasks. In addition, for the detection task, the proposed method focuses on the object, and the region of interest can be clearly obtained. However, the baseline method is ambiguous for the feature map of dense objects and prone to localization errors.
2) Comparative Experiments: Because the proposed method is novel, comparison with existing methods is difficult. Therefore, to verify the advancement provided by the proposed method, the proposed method is compared with several existing pedestrian searching algorithms, including AlignPS [1], AlignPS+ [1], Roi-AlignPS [50], and SeqNet [46]. The quantitative results from the two datasets are shown in Table III. The proposed method achieves the best results on both the datasets compared to other methods; although the proposed method is not the fastest, it is also real time. Both the proposed method and the comparison method perform better on the building dataset than on the plane dataset. Because the objects of different models in the plane dataset are more similar, it is difficult to obtain more descriptions when the object scale is small characteristics of their information. Thus, more erroneous searching results appear on the aircraft dataset. Roi-AlignPS and SeqNet thus achieve better performances than AlignPS and AlignPS+ because the ROI module in the two-stage algorithm provides explicit feature expression, which reduces the probability of false object detection for subsequent detection and reidentification tasks. Deformable convolution provides a larger receptive field for the AlignPS+ algorithm, but it is unimportant for detection and reidentification tasks. Therefore, the AlignPS+ and AlignPS algorithms produce similar results on both datasets. However, they are all designed for natural scenes, which lack good search results for large numbers and dense objects in remote sensing scenes, leading to the inability to obtain comparable results with the proposed method. In contrast, this article aggregates the object features from top to bottom through the MSFA module to merge deep semantic information and shallow appearance features and improves the performance of the reidentification subtask through the powerful description ability of the fusion features. As shown in Fig. 6, the proposed method can accurately locate the object and reduce misjudgment compared  to other methods. In addition, to address the negative impact of dense location on the remote sensing object searching task, we enhance the accuracy of feature expression from both channel and spatial dimensions, thus improving the performance of the detection subtask with dense object locations. To be specific, the proposed method can obtain more accurate bounding boxes in the searching results. Therefore, the proposed method achieves the best performance among all the studied approaches.
A comparison of the proposed method and other methods with both datasets is shown in Fig. 6. A detailed analysis of the results is provided next. For the building dataset, the comparison algorithm and the proposed method both achieve good  III  COMPARISON OF EXPERIMENTAL RESULTS ON THE BUILDING DATASET AND THE PLANE DATASET performance. However, the many irrelevant objects in the dataset still pose challenges to the searching task. Regarding the comparison algorithms, the features obtained only through the backbone network have difficulty managing many similar objects, resulting in a wide range of false detections and missed detections. Therefore, the MSFA module is proposed to improve the ability of feature description, which can obtain more shallow details. These details will greatly enhance the identification ability in the reidentification subtask. Therefore, the proposed method yields the highest search accuracy. The plane searching tasks need to locate more numbers and dense objects, which makes searching markedly more difficult. Therefore, the comparison algorithm performs poorly in the plane searching task. On the one hand, the lack of sufficient detailed descriptive information leads to missed and false detections. On the other hand, the inaccurate localization of dense objects leads to low accuracy of the bounding box. However, the proposed method achieves good adaptability to these problems. The proposed MSFA module incorporates more shallow detail information, which can better distinguish the objects and improve detection accuracy. In addition, the proposed DAOE module enhances the focus of the object and has better landmark results for dense objects. Thus, both the quantitative and qualitative results show that the proposed method outperforms all the comparison algorithms.

V. CONCLUSION
In this article, we proposed a new deep learning framework for object searching in remote sensing images. Two modules were proposed to enhance the representation of effective features at different levels. Specifically, due to the difficulty of distinguishing too many feature descriptions of objects in remote sensing scenes, we proposed an MSFA module with top-down fused feature mapping to enhance the representation of object details. Then, a DAOE module was proposed to enhance object features from channel and spatial dimensions, which greatly improved the accuracy of localization of dense objects. Finally, the proposed method was tested on two challenging self-manually labeled datasets, and experimental results demonstrated the improved performance of the proposed method. His research interests include computer vision, deep learning, data security, and privacy protection.