Multi-Size Object Detection in Large Scene Remote Sensing Images Under Dual Attention Mechanism

The remote sensing images in large scenes have a complex background, and the types, sizes, and postures of the targets are different, making object detection in remote sensing images difficult. To solve this problem, an end-to-end multi-size object detection method based on a dual attention mechanism is proposed in this paper. First, the MobileNets backbone network is used to extract multi-layer features of remote sensing images as the input of MFCA, a multi-size feature concentration attention module. MFCA employs an attention mechanism to suppress noise, enhance effective feature reuse, and improve the adaptability of the network to multi-size target features through multi-layer convolution operation. Then, TSDFF (two-stage deep feature fusion module)deeply fuses the feature maps output by MFCA to maximize the correlation between the feature sets and especially improve the feature expression of small targets. Next, the GLCNet (global-local context network) and the SSA (significant simple attention module) distinguish the fused features and screen out useful channel information, which makes the detected features more representative. Finally, the loss function is improved to truly reflect the difference between the candidate frames and the real frames, enhancing the network’s ability to predict complex samples. The performance of our proposed method is compared with other advanced algorithms on NWPU VHR-10, DOTA, RSOD open datasets. Experimental results show that our proposed method achieves the best AP (average precision) and mAP (mean average precision), indicating that the method can accurately detect multi-type, multi-size, and multi-posture targets with high adaptability.


I. INTRODUCTION
With the development of remote sensing satellites, unmanned aerial vehicles, and other technologies, the amount of remote sensing image data that can be obtained has exploded. Meanwhile, with the development of Earth observation technology, more and more attention has been paid to object detection in remote sensing images. Multi-size object detection in large-scene remote sensing images aims to automatically, accurately, and efficiently detect interesting targets at different scales and identify the categories and positions of targets The associate editor coordinating the review of this manuscript and approving it for publication was Khoa Luu . at the same time. It plays a vital role in many practical applications such as military operations, national defense construction, urban planning, and environmental monitoring.
Because of the special location of remote sensing observation platforms, the imaging characteristics of remote sensing images are different from those of natural scenes captured by digital cameras. Remote sensing images often contain a large number of complex ground background objects, but the types, scales, and postures of targets to be detected are often uncertain. Object detection of remote sensing images still has many problems and challenges. First of all, remote sensing images are mainly captured at a high altitude, so they cover a wide range of ground objects and complex image backgrounds. This may lead to more false-positive targets and increase the false alarm rate; Secondly, due to the dense distribution and small size of targets, as well as the different types, scales, and postures of the targets to be detected, many positive samples will not be detected, increasing the false-negative rate; Besides, the imaging quality of remote sensing images is not as good as that captured by digital cameras, and the resolution is low. This further increases the difficulty of object detection. If the existing deep learning detection framework is directly applied to remote sensing images for object detection, the ideal detection accuracy cannot be achieved. Considering the characteristics of object detection in remote sensing images, an object detection algorithm is proposed based on the dual attention mechanism of MobileNets, which is used for multi-type, multi-size, multi-posture small object detection in large-scene remote sensing images. The contributions of this paper are summarized as follows: 1. MFCA improves the network's feature expression ability without excessively increasing the number of model parameters. By adding an attention mechanism, every region on the feature map is considered in different degrees. 2. TSDFF is exploited to deeply fuse the feature maps output by MFCA, which maximizes the correlation between feature sets and especially improves the feature expression of small targets. 3. GLCNet and SSA are introduced to distinguish the fused features and screen out useful channel information, which makes the features to be detected more characteristic. 4. The experimental results indicate that the proposed network architecture has significantly improved the object detection AP (average precision) and mAP (mean average precision) on the datasets including NWPU VHR-10, DOTA, and RSOD.
This paper is organized as follows. Section II briefly reviews the related works on object detection in remote sensing images using deep learning methods. In Section III, the network model and techniques used in this paper are introduced. In Section IV, the experimental results are analyzed to verify the effectiveness of our method in improving the comprehensive performance of object detection in remote sensing images. Section V summarizes this paper and presents the future work.

II. RELATED WORKS
Remote sensing image object detection is a branch of object detection. In the field of object detection in remote sensing images, traditional object detection algorithms, such as circle frequency filtering method [1], edge extraction method [2], sparse representation method [3], and deep Boltzmann machine [4] mainly focus on the use of shallow and middle layer features. These algorithms have poor robustness and tedious detection process, and the detection results are easily affected by the quality of artificially designed fea-tures. So, these algorithms are not suitable for multi-size object detection in remote sensing images. At present, deep learning is developing rapidly, and the convolutional neural network (CNN) has become a powerful tool for object detection because of its powerful feature extraction ability. The CNN-based method can fit a large number of complex data. Also, it can automatically learn the most useful features in images and fully extract the image information. Therefore, the deep convolutional neural network has advantages over traditional methods in object detection.
The current mainstream object detection algorithms can be divided into two categories, i.e., two-stage algorithms and single-stage algorithms. As for two-stage algorithms, the candidate frames are first generated by region proposal and then regressed and classified. Typical two-stage object detection algorithms include R-CNN [5], fast R-CNN [6], and Faster R-CNN [7]. Although the detection accuracy is high, these algorithms involve a large number of convolution operations, so the calculation cost is high and the speed cannot meet the real-time requirements. The one-stage algorithms use the whole picture as the input of the network and directly regresses the target frames and the category. The representative one-stage object detection algorithms include SSD [8] and YOLO [9], etc. Although the detection speed of these algorithms is fast, the detection results are not good because remote sensing images have low resolution and usually contain many very small targets.
To solve the problem that the traditional object detection algorithms cannot handle multi-size small targets, many researchers put forward improved methods and frameworks. For example, the Perceptual GAN algorithm proposed by Li et al. [10] reduced the representation gap between small targets and large targets and enhanced the feature expression of small targets. Liu and Huang [11] proposed the RFB structure, which reduced the down-sampling rate of the network and increased the receptive field by introducing dilated convolution [12]. Kisantal et al. [13] exploited an oversampling strategy to handle the samples containing small targets, which improved the detection accuracy of small targets. SNIP (Scale Normalization for Image Pyramids) [14] only selected the targets within a certain scale for learning in the training process, which reduced the influence of domain-shift. Image pyramid [15] scaled pictures at different degrees and extracted features of different scales from each layer of pictures, which achieved high detection accuracy but slow speed; Trident-Net [16] parallelized three different receptive field networks to better cover multi-size object distribution; The FPN [17] (Feature Pyramid Network) algorithm used the high resolution of low-level features and the rich semantic information of high-level features at the same time. It achieved a good prediction effect by fusing these different layers of features.
To shorten the information path and enhance the feature pyramid with low-level accurate positioning information, PANet [18] created a bottom-up path enhancement based on FPN. ThunderNet [19] simplified the FPN structure and introduced the pooling operation to integrate local and global features to enhance the network feature expression ability. These improved methods have significantly improved the accuracy of small object detection.
For object detection in remote sensing images, R2CNN [20] pooled each text box proposed by RPN (region proposal network) with different pooled sizes (7 × 7, 11 × 3, 3 × 11). Meanwhile, it predicted text/non-text scores, axially aligned boxes, and inclined minimum area boxes simultaneously by using the characteristics of connections. Based on RPN, RoI Transformer [21] converted HRoI (horizontal region of interest) output into RRoI (rotating region of interest). In this way, the number of anchor points was not increased, and accurate RRoI can be obtained. CAD-Net [22] designed and integrated GCNet (global context network) and PLCNet (pyramid local context network) to extract context information at the global scene level and local target level, respectively. SCRDet [23] designed a sampling fusion network, which integrated multi-layer features into effective anchor sampling. Also, it designed a supervised multi-dimensional attention module MDA-Net, which improved the detection sensitivity of small targets; SCRDet++ [24] specified a novel InLD component to approximately decouple the features of different target categories into their respective channels. In this way, the features of objects were enhanced, and the features of background in the spatial domain were weakened. Gliding Vertex [25] proposed that a quadrilateral can be located by learning the offset of four points on a non-rotating rectangle to represent an object. Besides, to overcome the defects of deep learning in satellite image object detection, an improved fine-grained object detection network structure called YOLT was proposed in [26], and a lot of data enhancement was made to solve the problem of detection invariance.

III. OVERVIEW OF OUR METHOD
The end-to-end CNN model proposed in this paper is shown in Figure. 1. The network consists of five modules, including the feature extraction backbone network of MoblieNets [27], multi-size feature concentration attention module MFCA, two-stage depth feature fusion module TSDFF, global-local context network GLCNet, significant simple attention module SSA, and subnet module. In the network, the features of different scales extracted from MoblieNets are input to MFCA, which pays attention to various regions in the feature map of the original CNN to reduce the interference of the background and negative sample information. Especially in the shallow feature maps, MFCA can effectively focus on small target objects. Then, the output of MFCA is deeply fused by TSDFF to maximize the correlation between feature sets. Next, the fused features and two groups of memory features learned by GLCNet are input to SSA together. In SSA, different channels of feature maps are distinguished, and the useful channel information is screened out to make the detected features more representative. Finally, the feature maps of each scale are cascaded with the subnet for multi-branch classification and regression. Generally, our dual attention deep feature fusion network DADFFNet can effectively remove complex background noise, enhance the feature representation of different resolutions, especially small-sized targets, and greatly improve the detection accuracy of remote sensing images.

A. MULTI-SIZE FEATURE CONCENTRATION ATTENTION MODULE
The visual attention mechanism is a unique vision signal processing mechanism of the human brain. By scanning the global image quickly, the human brain obtains the target areas that need to be focused on, which is commonly called attention focus. Then, more attention resources are put into these areas to obtain more detailed information of the targets while ignoring other useless information. This mechanism is formed by human beings in long-term evolution. It provides a means for human beings to quickly screen out highvalue information from a large amount of information by using limited attention resources. The human visual attention mechanism greatly improves the efficiency and accuracy of visual information processing. Similar to the selective visual attention mechanism of human beings [28], the attention mechanism in deep learning selects the information that is more critical to the current task from a large amount of information, so as to maximize the usage of limited computing resources [29].
Following the idea of attention mechanism, MFCA improves the network's feature expression ability without excessively increasing the number of model parameters. It mainly includes ARD (attention residual denoising) blocks, dilated convolution blocks, up-sampling operation, etc. The specific connection mode is shown in Figure. 2.
The module can be added to any convolutional neural network. The backbone network MoblieNets consists of five stages, which are denoted as {C1, C2, C3, C4, and C5}. Considering the high spatial resolution of the feature map of the C1 stage and the network model parameters, as well as computational efficiency, starting from the C2 stage, the feature maps obtained through the MoblieNets are input to MFCA, and they are defined as F i .

1) ATTENTION RESIDUAL DENOISING MODULE
As for the feature map F i ∈ R C×W ×H , the C, H, and W respectively denote the channel, height, and width of the feature map. Firstly, the ARD uses the attention mechanism to extract the focused attention targets and integrates the global spatial information through the GAP (global maximum pooling). Then, it processes the extracted features with the Sigmoid function and transforms them into the non-linear attention space. The output can be expressed as: where attention map calculation θ (·) is achieved through GAP. Note that a separate θ (·) is implemented to calculate each scale-specific attention map. θ is Sigmoid function, and Si is the output attention map. The attention map is fused with the output of the original convolution block. The boot output F ∈ R C×W ×H can be expressed as: where i is the index of the feature map, and ⊗ denotes element-wise multiplication. ARD performs element-wise multiplication when it is designed as a dot product attention ratio; otherwise, it performs a summation. The nonlinear feature is increased through the 1 × 1 convolutional layers, and then the attention maps are added to the module with the residual connection. The output of ARD is defined as Y i , which can be expressed as follows:

2) CONVOLUTION OPERATION
As shown in Figure. 3,multi-size processing is performed on the output of ARD to generate more detailed attention information at different scales. In the bottom-up processing, each Y i is processed by a corresponding 3 × 3 dilated convolution that is denoted as D i (·). The output of D i (·) can be expressed as X i . Except for Y 2 , the output of the previous level must be added to each other layers. This process can be expressed as: Since each Y i has a different spatial resolution, a 3 × 3 convolution with a step size of 2 is applied to X i , and the convolution result is then merged with Y i . Next, the operation in the top-down processing stage is conducted similar to that in the bottom-up stage. Before the addition operation, a deconvolution with a step size of 2 is used to expand the space size. This process can be expressed as: where D i (·) is also a 3 × 3 dilated convolution, and X i is the output of the top-down processing stage. The dilated convolution can expand the receiving domain, increase the receptive field of the feature map, and obtain richer context information while preserving the global information. Besides, the bottom-up and top-down connections maintain the flow of attention information between multi-size feature maps. Finally, all the attention weights are generated by 1 × 1 convolution. The attention weight of A i is denoted by w i . The final output can be expressed as: As shown in Figure. 3, the proposed MFCA can treat different areas differently at each scale. This enhances the network's feature representation ability for certain important areas so that each area on the feature map has different degrees of importance. For example, smaller airplanes obtain stronger responses at the lower network layers, and the captured information has more detail. Meanwhile, the MFCA helps to weaken the information interference of background and negative sample targets, such as the terminal in the second sample image.

B. TWO-STAGE DEEP FEATURE FUSION
In feature fusion, features are propagated in a top-down manner, and low-level features can be improved by using strong semantic information of high-level features. However, the features at the highest level lose information due to channel reduction. Since the semantic information has certain inconsistencies, directly fusing these features will reduce the ability of multi-size feature representation. Also, this strategy of fusing feature maps into a single vector may lose spatial relationships and details because multiple targets may appear in an image [30].
Information loss can be greatly reduced by fusing the extracted global context features in two different approaches [31]. The TSDFF module uses two different types of feature fusion, as shown in Figure. 1. Theoretically, the feature maps of adjacent scales have a greater correlation, so fusing these feature maps may reduce the inconsistency between feature targets. The first type of feature fusion  independently upsamples, adds patches, and downsamples adjacent features through the 3 × 3 convolutional layer to achieve the same effect. Then, it splices the three adjacent features in dimensions. As shown in Figure. 1, the yellow and blue arrows respectively represent down-sampling and up-sampling, and the green arrow represents the addition of patches. For the convenience of explanation, three adjacent scale feature maps A 2 , A 3 , and A 4 are taken as examples, and the details of the first fusion process are illustrated in Figure. 4. The output after fusion is: where R 3 is the output of A 3 . W 2 , W 3 , and W 4 are the parameter-sharing convolution kernels corresponding to the three feature maps of A 2 , A 3 , and A 4 . The strides are 2, 1, and 1/2, respectively. ξ 2 3 , ξ 3 3 , and ξ 4 3 are three spatial weights that respectively represent the importance of A 2 , A 3 , and A 4 relative to A 3 . The weight generation process is as follows.
After the uniform scale operation, three 1 × 1 convolution layers are used to generate the weight scalar, and they are denoted as γ 2 3 , γ 3 3 , and γ 4 3 . Taking ξ 2 3 as an example, ξ 2 3 (i, j) represents the spatial weight of A 2 relative to A 3 at point (i, j), which can be expressed as: From equation (8), it can be seen that the sum of ξ 2 3 (i, j), ξ 3 3 (i, j), and ξ 4 3 (i, j) is 1, and their values are all between 0 and 1.
The first feature fusion can utilize the semantic information of feature maps of different scales better. It achieves higher performance by increasing the channel and further reduces the interference of background noise at the same time. 1 × 1 convolutional layers are used to reduce the feature channels, where the huge semantic gaps between these features are not considered.
The second feature fusion first uses a parallel strategy to perform an element-wise add operation on the feature maps after the first feature fusion. Then, it combines two adjacent feature vectors into a complex vector. The add operation does not increase the dimensionality of the feature maps but increases the amount of information under each dimension, which obviously increases the perception of contextual information.
TSDFF performs a weighted combination on the foreground discrimination of remote sensing images and maximizes the correlation between the feature sets through the two feature fusions. Meanwhile, TSDFF enhances the semantic information of small targets, maximizes the difference between different classes, and further eliminates the influence of noise and complex background.

C. GLOBAL-LOCAL CONTEXT NETWORK
Considering the correlation between the background and targets in remote sensing images, a global-local context network is designed, which can learn the global scene semantics and use it as a certain prior to better detect the targets in remote sensing images. GLCNet uses the learned correlation as a specific global-local context to compensate for the missing distinguishable target features. The learned correlation can be expressed as follows: where A i represents the feature mapping from MFAC, and φ G (·) is implemented by the CLSTM module [32] to extract global features. ψ (·) represents a pooling operation that compresses the spatial channels of the feature map into a vector, thereby suppressing the scale change problem. There are two sets of CLSTM modules in the network, and positive A i and reverse A i are respectively input to the modules. The two outputs are input to the two memory modules in SSA.

D. SIGNIFICANT SIMPLE ATTENTION MODULE
Self-attention [33] is significant to various visual tasks. Compared with convolution operation, self-attention can acquire more long-range dependency, thereby learning the features that incorporate global information. However, the self-attention mechanism has several obvious defects. First, the large amount of calculation results in a certain amount of calculation redundancy. Also, the self-attention mechanism only uses the information in its own samples but ignores the potential meaningful connection between different samples. To alleviate these problems, external attention [34] is exploited to easily achieve linear complexity by controlling the size of the memory unit. Meanwhile, the useful information of the fused feature map is further screened out so that the features to be detected are more representative. SSA uses four EA modules as attention modules for extracting effective information from the input. As shown in Figure. 5, external attention can simplify thetime complexity of self-attention through two learnable external memory units. Also, the two external memory units are shared for the entire data, so the correlation between different samples can also be implicitly considered. The two units are linear layers, and they can be directly optimized end to end. In the actual operation process, the outputs provided by GLCNet are taken as the two different memory modules that are called M 1 and M 2 . The former stores the key and the latter stores the value. The calculation is as follows: where F in and F out respectively represent the feature maps of input and output; N orm represents the normalization operation; E represents the transition state after normalization operation.

E. LOSS FUNCTION DESIGN
The subnet structure includes the classification branch and box branch, and they are respectively responsible for anchor label prediction and location regression. Due to the existence of multi-posture targets in remote sensing images, the existing area-based rotating object detection methods describe the rotating bounding box with five parameters, including center point coordinate, width, height, and rotation angle, and these methods use smooth L1 as the loss function. However, there are two problems in this method, i.e., the loss discontinuity caused by angle parameters and the influence of different parameter units on network performance. To handle these problems, the 8-parameter ver- sion of rotation loss proposed by RSDET [35] is used in this study. It describes the position with four clockwise vertices of the rotation bounding box, suppressing the problem of different parameter units. Figure. 6 shows the regression process from the candidate box to the actual position.
The actual regression process consists of four steps: 1) move the four vertices of the prediction frame clockwise; 2) keep the vertex order of the prediction frame unchanged; 3) move the four vertices of the prediction frame counterclockwise; 4) take the minimum value of the above three cases. The loss function used in this process is expressed as follows, where x i and y i respectively represent the coordinate offset of the i-th vertex of the prediction frame and the i-th vertex of the reference frame.
In the proposed algorithm of this paper, due to the addition of the position offset of the anchor box, the corresponding multi-task loss function should be changed during the endto-end training. In addition to the basic classification loss and regression loss, it is also necessary to learn the position of the anchor. The complete loss function is expressed as follows: L = λL mr + L cls + L reg (13) where L cls and L reg represent the classification loss and regression loss respectively, and λ is a constant.

A. DATASET
Our proposed method is tested on three public datasets, i.e., NWPU VHR-10 [36], DOTA [37], and RSOD [38].    Figure 7 and Figure 8, respectively. The detection objects of these datasets are all artificially designed with obvious edge features and strong internal color consistency (e.g., ships, vehicles, and airplanes), while false objects often do not have these characteristics.

B. EXPERIMENTAL SETTING AND PERFORMANCE EVALUATION INDEX
Our proposed method was tested with PyTorch and Tensor-Flow 2.0. The test platform was equipped with Intel Core i7-6700U CPU @ 4.0 GHz, NVIDIA GeForce RTX 4000, and an 8 GB DDR3 memory, and the operating system was Windows 10 64-bit. As for training parameter settings, the initial learning rate was set to 0.01, and it was attenuated to 1/10 of the original value every 50,000 iterations. The stochastic gradient descent method (SGD) of driving quantity was used to optimize the network. The momentum parameter was set 0.9; the weight attenuation regular term was set to 0.0005; the batch size was set to 32; the confidence threshold was set to 0.5, and the dropout was set to 0.5 to prevent over-fitting. The total training iterations of DOTA, NWPU VHR-10, and RSOD were respectively 200,000, 120,000, and 150,000.
In the experiment, AP and mAP were adopted as evaluation indicators to comprehensively evaluate the network. The ground truth was obtained through manual annotation. TP and FP represent the positive examples that are correctly detected and mistakenly detected respectively. FN represents the positive examples that are mistakenly detected as negative examples. Recall indicates the proportion of correct detection results in the actual targets, and the calculation is shown in equation (14). Precise indicates the accuracy of the detected results, and the calculation is shown in equation (15).  In the evaluation results of deep learning, AP represents the average detection accuracy of a certain class of targets, while mAP represents the average accuracy of all classes of targets [39]. The calculation of these two indicators is shown where P (R) represents the precision at the R point on the recall curve; k represents the precision cutoff point; P (k) and R (k) respectively represent the precision and recall range of the k point; n represents the number of precision cutoff points; q indicates a certain target category, and Q indicates the total number of target categories.
C. EXPERIMENTAL RESULTS AND ANALYSIS 1) EXPERIMENTAL RESULTS ON NWPU VHR-10 DATASET Table 1 lists the test results of CAD-Net [22], R 2 CNN [20], SCRDet [23], SCRDet++ [24], YOLT [26], Gliding Vertex [25], RoI Transformer [21] and our proposed method on NWPU VHR-10 dataset. It can be seen from Table 1 that the mAP of our method for detecting the 10 categories of targets in the NWPU VHR-10 dataset is 80.31%, which is 3.99% higher than that of SCRDet++ and is superior to that of other popular methods at present. Longitudinally, the detection effect of bridges is the worst, and it may be caused by the confidence region setting. If the IOU of other targets is greater than 0.7, the anchor frame is considered as a positive sample; if the IOU is less than 0.3, the anchor frame is regarded as a negative sample. However, for the bridge target, its size is much larger than other targets, so its sensitivity to IOU should be more relaxed in the large aspect ratio rectangle. Horizontally, R 2 CNN performs the worst because it doesn't consider the boundary problem of rotating coordinate frame, which is very unfavorable for object detection in remote sensing images. SCRDet achieves the highest AP in tennis court object detection, and SCRDet++ achieves the highest AP in detecting storage tanks and basketball courts. So, SCRDet series networks perform well for the detection of these neutral targets. Gliding Vertex achieves the highest AP in ship detection, which may be related to its positioning method. Our proposed method achieves the highest AP in detecting other categories of targets. It obtains good detection performance whether the target is the small-sized vehicle, the large-sized bridge, or the medium-sized baseball diamond. This shows that our proposed method has an advantage in multi-size object detection. The detection results are shown in Figure 9.

2) EXPERIMENTAL RESULTS ON THE DOTA DATASET
To further evaluate the detection ability of our proposed method for multi-type, multi-size, and multi-posture targets in large-scale databases, experiments are conducted on the DOTA dataset. Our proposed method and other state-of-theart methods are compared, and the comparison results are listed in Table 2. Our proposed method achieves an average precision of 90.11% without any data enhancement. In terms of precision and recall, our method performs much better than the methods of Gliding Vertex [25], RSDET [35], PIOU [40], and SCRDet++ [24]. It is because our proposed method realizes the scale perception of foreground features and the accurate mining of context information by denoising the complex background. This study compares the proposed method with the existing saliency detection methods based on deep learning through the P-R (precision & recall) curve, and the result is shown in Figure 10. It can be observed from the figure that our proposed method obtains the best results. When the recall rate is close to 1, the precision of our method is much higher, indicating that its false alarm is much lower than that of the other methods. Also, as for our proposed method, the resulting visual attention map of the target in the remote sensing image of the large scene with a complex background is closer to the ground truth. For a more rigorous evaluation, the COCO metrics is adopted to compare our proposed method to CAD-Net [22], R 2 CNN [20], SCRDet [23], SCRDet++ [24], YOLT [26], Gliding Vertex [25], RoI Transformer [21], RSDET [35], and PIOU [40] on the DOTA dataset. The comparison result is listed in Table 3. AP S , AP M , and AP L respectively represent the average precision of detecting small, medium, and large targets. AP 50 and AP 75 represent the average precision under an IOU of 0.5 and 0.75, respectively.
It can be observed that the AP of our proposed method in 15 categories reaches 61.14%, which is better than the AP of other methods. Also, our proposed method achieves the best results in detecting small and medium-sized targets, with an AP of 48.21% and 74.83% respectively. Besides, better results can be obtained under IoU = 0.75 (1.68% higher than RSDET). This indicates that our method can draw a more accurate boundary box, which helps to identify various targets more accurately in remote sensing images with dense targets. Figure. 11 illustrates some detection results of our proposed method for remote sensing images with dense targets.

3) EXPERIMENTAL RESULTS ON RSOD DATASET
To further verify the robustness of our proposed method, SCRDet++ [24], RSDET [35], PIOU [40], and our proposed method are exploited to detect all categories of targets on the RSOD dataset. Figure. 12 shows the results of object detection for each category. It can be seen from the figure that our proposed method performs much better than other advanced methods in terms of the correct detection ratio. Specifically, 95.43% of impervious surfaces, 96.67% of aircrafts, 95.27% of playgrounds, 89.92% of overpasses, and 95.62% of oiltanks are correctly detected. Compared with other methods, our proposed method achieves the highest correct detection rate in all categories of targets. Besides, taking GFT and RA as examples, other methods do not perform well in detecting these two targets, leading to a high false detection rate of these two targets. The false detection rate of RSDET is as high as 15.19%. Our proposed method successfully reduces the false detection rate to 9.23%, achieving a great breakthrough. Figure.13 shows the detection performance of our proposed method on the RSOD dataset.

4) DETECTION RUNTIME
To compare the detection time of our method with other methods, 200/50/150 images were randomly selected from DOTA, NWPU VHR-10, and RSOD data sets for the experiment of detection runtime, and the average runtime is listed in Table 4. It can be seen that based on the single-stage detection algorithm YOLO, YOLT has the shortest detection time and the strongest real-time performance. R 2 CNN has the longest detection runtime because it adopts multi-size ROI pooling and oblique frame prediction based on the two-stage detection algorithm Faster RCNN. Our method has a moderate detection runtime among all methods. This is because our method uses two feature fusions, which improves the detection accuracy but leads to slow calculation speed.

D. ABLATION EXPERIMENT
In this section, the influence of each module in our proposed method on object detection performance is investigated on the DOTA dataset. The ablation results of adding the modules (namely DFCA, TSEFF, GLCNet, and SSA) to the MoblieNets framework are listed in Table 5. The MoblieNets backbone network achieves a detection efficiency of 54.83%. DFCA is conducive to obtaining foreground semantics from large scenes complex backgrounds. It consists of bottomup and top-down subnets to circulate low-level/intermediatelevel and high-level semantic information. It increases the AP of detecting small, medium, and large targets by 1.94%, 5.36%, and 1.56%. Then, for small targets, TSEFF further improves the AP by 2.09% because it can enhance the semantic information of small targets and maximize the differences between the targets of different sizes and categories. Finally, with GLCNet and SSA, the useful information of the fused feature map can be further screened out to make the detected features more characteristic. The final AP is 61.14%.
To show the results of the ablation experiment more intuitively, the P-R curves of detecting small, medium, and large targets are compared. It can be seen from Figure. 14 that the effectiveness of our proposed method in detecting multi-size targets is greatly improved, especially in detecting small targets. When the recall rate is 0.6, the precision of small object detection is about 0.57, which is much higher than that of the backbone network. This improvement indicates that the proposed method can further detect small targets from complex backgrounds, showing that our method is effective for object detection in remote sensing images.

V. CONCLUSION AND FUTURE WORK
In this paper, a model is proposed for multi-size object detection of remote sensing images with large scenes. The model uses the MoblieNets network to extract image features and the MFCA module to pay attention to different regions in the feature map. Then, the feature maps are deeply fused by TSEFF, and the features are characterized by GLCNet and SSA. The experimental results show that our method can be used as an effective target detection method in remote sensing images considering the detection accuracy and detection time. In future work, we will improve the model to realize real-time object detection especially for remote sensing images of large scenes.