RoI Fusion Strategy With Self-Attention Mechanism for Object Detection in Remote Sensing Images

In remote sensing image (RSI) object detection, the oriented bounding box (OBB) can accurately locate objects with arbitrary orientation and obtain orientation information. The detection based on OBB is still a challenging task. In RSI, the distribution of objects is extremely uneven, which causes aggregation to occur. Some researchers believe that the characteristic of dense distribution is a reason for the difficulty of object detection. However, there are no in-depth experimental studies on this. This paper proposes an OBB-based dense object determination method, which determines the dense objects in datasets by two conditions consisting of interclass distance, intraclass distance, minimum distance between objects, and minimum edge length of objects. The experimental results of dense and non-dense object detection concludes that the characteristics of dense distribution in RSI do not easily cause the objects to be more difficult to detect. To make full use of the object features, we propose a second-stage detection head named RoIF-Net, in which we extract region of interest (RoI) from the input image and fuse it with the RoI extracted from feature maps to add detail features, and construct a feature induction module based on self-attention mechanism to achieve position regression and category classification. This structure can be used in any two-stage network to enhance detection capabilities. Using our method on three credible and challenging datasets, DOTA, DIOR-R, and UCAS-AOD, we obtained 81.80%, 68.49%, and 90.25% mAP, respectively, reaching SOTA based on OBB detection, proving the effectiveness and advancement of our method.

Object detection in RSI can play an important role in many fields, such as military reconnaissance, postdisaster reconstruction, environmental protection, urban and rural planning, economic evaluation, among others. Since RSI are characterized by complex surface environments and large differences in object scales, it is extremely challenging to perform object detection on them.
In the RSI object detection datasets, two forms are generally used to label the objects. One is horizontal bounding box (HBB), which is the smallest external horizontal rectangular box that can contain the object [1], [2], [3], [4], and the other is oriented bounding box (OBB), which is a rectangular box with corresponding angle according to the rotation direction of the object [3], [5], [6], [7]. In generic scenes on natural images, HBBs are more commonly used. However, in RSI, the special bird'seye view causes the objects in them to have arbitrary rotation directions. In this case, using the HBB introduces unnecessary background information and makes it difficult to obtain accurate object pose, so the OBBs are usually used to annotate and detect the objects in RSI. The visual annotations of HBBs and OBBs are shown in Fig. 1. At present, there have been numerous related research works on the oriented object detection in RSI based on OBB [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18]. However, the complex background of RSI can cause more difficult feature recognition and bring a considerable challenge to the object detection on it, which is still a key problem that needs to be solved for RSI-oriented object detection.
In addition, researchers have pointed out in some works that objects are more difficult to detect when they are densely arranged together [19], [20], [21], [22], [23], [24], [25], [26], [27]. In natural image scenes, dense can lead to overlap between objects, which results in the absence of object features, which greatly affects the detection results. In contrast, overlap almost rarely occurs due to the overhead view in RSI. No research work has made careful experiments on whether closely spaced objects in RSI are more difficult to detect. First, there is no clear and reasonable determination of dense objects, and second, there is no comparative analysis of the detection results of dense and nondense objects. Therefore, we believe that whether dense objects are more difficult to detect in remote sensing object detection is an inconclusive issue.
In order to determine whether dense objects in RSI are more difficult to detect for the network, we design the determination conditions for dense objects based on understanding of dense objects and implement the classification of objects in the datasets into dense and nondense objects. Then we determine whether  dense objects in RSI are more difficult to detect based on the detection results for both dense and nondense objects. We design a dense object determination method with two determination conditions, which is based on OBB labeling. When any object in the image is judged, if this object and any other object in the image satisfy both conditions, both two objects are judged as dense objects. One of the determination conditions is that the ratio of the interclass distance to the intraclass distance of the two objects is less than a specific threshold, and the other determination condition is that the minimum distance between the two objects is less than an expression related to the minimum edge length of that object. After determining whether objects in the datasets are dense objects using the determination method, we detect them. By analyzing the detection results, we found that dense objects have a higher recall compared to nondense objects. It is concluded that dense is not a fundamental factor that makes the object difficult to detect in RSI object detection.
When using OBB for object detection, a two-stage network is often used to obtain better detection results, in which the extracted region of interest (RoI) needs to be fed into the second stage detection head. In order to extract RoI containing rich detail features and make full use of them, we propose RoIF-Net. The classical RoI extractions, such as RoI pooling [28] and RoI align [29], are performed on feature maps output from the backbone network, which is rich in high-level semantic features but lacks such detail information as low-level spatial features after being extracted by layer after layer of convolutional networks. Other improved RoI extraction methods [49], [50] expand the RoI range without taking the issue into account. We propose RoI fusion module to enrich the low-level spatial features by adding RoI extraction on the original images. Noting that the original image has higher resolution compared to the feature map, we perform RoI extraction on the original image by resampling it to a 28 × 28 patch, then generating a 7 × 7 patch by channel rearrangement, then expand it to the same dimension as the feature map, and finally add it to the RoI extracted from the feature map to obtain the final RoI. After obtaining the RoI with more detail information, it is detected in the second stage. The simple fully connection layer structure in most methods [10], [34], [42], [51] cannot fully utilize the more complex feature information. In order to fully exploit and utilize the feature information of the fused RoI, we construct a feature induction module to discriminate and generalize the features of RoI, using the self-attention mechanism of Transformer [30] and combining convolutional and fully connection layers to enhance the discriminative ability of the network for complex features. The aforementioned is our design of RoIF-Net, a new second-stage detection structure, which is used to fully utilize the advantages of the two-stage detection network and improve the classification and positioning accuracy of the second stage detection head.
The main contributions of this article are as follows.
1) The dense object determination method is proposed, which defines dense objects by interclass distance, intraclass distance, minimum distance, and minimum edge length. By this method, the objects are classified into dense and nondense objects. Further, our experimental results show that these defined dense objects are not relatively more difficult to be detected in RSI.
2) The second-stage detection head RoIF-Net consisting of RoI fusion module and feature induction module is proposed, which increases the detail information of RoI by extracting RoI from the original image and improves the feature discrimination and generalization ability through the self-attention mechanism, and this structure can be applied to any two-stage detection network to improve the detection accuracy.
3) The proposed method achieves SOTA for rotated object detection on three strongly credible and highly challenging RSI object detection datasets: DOTA, DIOR-R, and UCAS-AOD with 81.80%, 68.49%, and 90.25% mAP, respectively. The rest of this article is structured as follows. Section II covers the recent work on RSI object detection. Section III presents a detailed introduction of the proposed dense object definition method and RoIF-Net. Section IV is about the demonstration and analysis of the experimental results. Section V presents the conclusion and outlook of this article.

A. Dense Object Detection
Some objects in the image are densely packed together and many works mention that these objects are difficult to detect [19], [20], [21], [22], [23], [24], [25], [26], [27]. In natural image, commodity detection on supermarket shelves [21], [22], face and pedestrian detection in crowded scenes [23], [24], etc., may be performed on a large number of dense objects. The distribution of objects in RSI is extremely uneven, and many scenes such as airports, parking lots, and ship ports have a high number of dense objects clustered together, as shown in Fig. 2. Yingxue et al. [20] select only aircraft, vehicles, and ships in the DOTA [3] dataset for detection, which have a higher probability of dense alignment. Shu et al. [19] obtain object center point before generating accurate object bounding box in order to detect dense buildings. Ming et al. [27] proposed a coordinate attention module to deal with the problem of severe performance degradation caused by minor position deviations in dense small object detection. Li et al. [26] enhance the shallow feature information of small and dense objects by jump connecting the manually extracted shallow features to the deep network after processing. Deng and Yang [25] proposed a multistep sampling strategy to improve the probability of dense objects being sampled during the training process. However, there is no clear determination of dense objects in these works, and no separate analysis of detection results for dense and nondense objects, but only qualitative statements that dense objects are relatively difficult to detect. Whether dense is a factor causing the difficulty of object detection remains to be confirmed. The study of dense objects contributes to the further development of RSI object detection techniques. In this article, a method to determine dense objects is proposed. The objects in the dataset are divided into dense objects and nondense objects. According to their detection results in the experiments, whether they are difficult to detect is analyzed.

B. RSI Object Detection Based on Deep Learning
The annotation method commonly used in object detection is the minimum external HBB. In natural image object detection datasets, horizontal boxes are almost always used to annotate objects. The remote sensing object detection datasets with high credibility, such as NWPU VHR-10 [1], RSOD [2], DOTA [3], and DIOR [4], have horizontal box annotation. The network detects the object by generating a HBB, the position of which is used to locate the object. The position of the rectangular box can be represented by four parameters, which are typically the coordinates of the center point, the length and the width of the rectangular box. It is also necessary to classify the objects in the detection box. Therefore, the final result of the detection network generally consists of a regression part and a classification part. In natural image object detection, classical networks, such as SSD [31], RCNN series [28], [29], [32], [33], [34], YOLO series [35], [36], [37], [38], and RetinaNet [39], are all HBB-based object detection networks. These networks can be directly applied to RSI object detection, but the effect needs to be improved.
RSI are captured by aerial imaging devices from a top-down view, where the object is on the earth's surface, and has an arbitrary rotation angle in this view. If an HBB is used to locate an object, the box will contain a large amount of background when the object direction deviates from the horizontal or vertical angle and will contain other objects when the objects are densely distributed, and the above-mentioned phenomenon is more obvious when the object aspect ratio is large. The OBB with angle can solve the above problem, and the object can be included to the maximum extent when the rectangular box direction is the same as the object direction. In RSI object detection, more and more datasets use OBB to label objects, such as DOTA [3], UCAS-AOD [5], HRSC2016 [6], and DIOR-R [7]. Determining an OBB is simply a matter of adding the angle to the HBB, so adding the angle information prediction branch to a network structure based on HBB detection can achieve OBB object detection. In order to achieve more accurate orientation detection, some detectors improve the network structure specifically for angle prediction [40], [41], [42], [43], [44], [45], [46], [47], [48]. DCL [44] and CSL [45] convert the angle detection from a regression problem to a classification problem. Oriented RCNN [42] determines OBB by the minimum external HBB and the distance from the vertex of the OBB to the midpoint of the edge of that HBB. CenterMap [41] determines the OBB by generating a foreground region heat map that has high heat value in the center region of the object and low heat value in the edge region. GWD [46] and KLD [47] learn OBB with angular information by regression loss based on Gaussian model.

C. Two-Stage Detection Network
The two-stage network is an important object detection strategy based on deep learning. In the two-stage detection network, the first stage performs preliminary localization and classification of the object, then the RoI at the corresponding location is extracted on the feature map based on the initially obtained detection box, and finally, the second stage performs more accurate position regression and classification of the RoI. Two-stage detection network is proposed for the first time in RCNN [32], in which selective search is used to extract RoI. In fast RCNN [28], RoI pooling is proposed to provide uniform size input for the second stage. Then in faster RCNN [33], region proposal network is used instead of selective search for the first-stage detection, and the most commonly used two-stage detection network is built. In RSI object detection, two-stage detection networks are also widely adopted and continuously improved for their high accuracy [10], [42], [49], [50], [51]. Li et al. [49] and Gong et al. [50] obtain object context information by extracting and fusing a wider range of RoI. RoI Transformer [51] adds angle prediction for RoI based on HBB in the second stage, and then extracts RoI based on OBB for final detection. Oriented RCNN [42] directly extracts RoI based on the OBB in the second stage according to the angle prediction in the first stage.
Based on the two-stage detection network, a number of other variant forms have been further developed, which can be categorized as two-stage detection networks in a broad sense. Cai and Vasconcelos [34] proposed Cascade RCNN to discriminate positive samples by incremental intersection-over-union (IoU) thresholds in multiple detection stages, and multistage detection networks were thus generated and developed. The key for the two-stage detector to be able to achieve the second detection is to align the features according to the detection box obtained from the first-stage detection before performing the second detection, which is the function achieved by RoI extraction. In deformable convolution [52], [53], the shape of the convolution kernel is not fixed and can vary. Inspired by this, the refinement detector was proposed. In the refinement detector, the shape of the deformable convolution kernel is set according to the shape of the detection box obtained in the first stage, and feature alignment is achieved by convolution using such a kernel. S 2 A-NET [54] and R 3 DET [55] use this method, avoiding the RoI extraction step, and the detector is implemented by a fully convolutional network.
In the second-stage detection network proposed in this article, RoI extraction on the original image is added in RoI fusion module to enrich the feature information, especially the low-level features with detail information. A feature induction module is designed to discriminate and generalize the features in RoI using a self-attention mechanism, which complements the detail features added in RoI and enhances the network's discrimination of confusable features. The increased capability of the second-stage detection network allows for a higher level of detection accuracy across the detector.

III. PROPOSED METHOD
In this section, we describe the proposed dense object determination method and the second-stage detection head in detail. First, dense object determination method is detailed in Section III-A. Next, the overall network structure of two-stage detector is introduced in Section III-B. Finally, the designed RoIF-Net is introduced in Section III-C.

A. Determination Method for Dense Object
In previous research work, there is no specific definition of dense objects to determine dense objects, let alone to analyze the detection results of dense objects. In order to be able to analyze the detection effect of the network on dense objects, we designed the method for determining dense objects. According to this method, the objects in the dataset can be divided into dense and nondense objects so that the detection effect can be analyzed in the network test using evaluation metrics for both dense and nondense objects. Because the object detection dataset uses rectangular boxes to label the objects, we need to use the object location information from the rectangular box annotations to identify dense objects in the dataset. If the HBB annotations are used to determine dense objects, it will result in a situation where the rectangular boxes are dense or even overlapping while the objects in the boxes are still far away from each other. This is because the HBB does not contain any object pose information, and the box may contain a large amount of background in addition to the object, and the area of the rectangular box cannot be approximated as the object area.
Therefore, we use OBB annotation to determine the dense object. The OBB has the angular information of the object, which can closely contain the object compared to the HBB, which contains less background, and the area of the OBB can be approximated as the area of the object. In the determination method we designed, the OBB represents the object and is used for judgment. When considering how to perform dense object determination, if only the minimum distance between objects is used to determine, it cannot represent the complex position relationship between objects and the judgment result is not satisfactory. If the number of objects in a region is counted to determine the dense area, it is not possible to quantify the relationship between an object and its surrounding objects. In order to quantify whether an object is dense or not and to take into account the position relationship of all pixels between objects as much as possible, we designed two conditions for the determination. If an object and any other object in the same image satisfy these two conditions, the object is considered as dense.
The first determination condition is based on the ratio of the interobject distance to the intraobject distance. In pattern discriminant analysis, for a pattern class {a i } i = 1,2,...,Ka , the intraclass distance is (1) the smaller the intraclass distance is, the higher the degree of aggregation of this pattern class. If there is another pattern class {b i } i = 1,2,...,Kb , its interclass distance with the previous pattern class is which can be used as a measure of the separability of these two pattern classes, and the larger interclass distance indicates that their separability is better. We consider each object on the image as a pattern class, and each pixel within the object as a sample in this pattern class. We calculate the intraclass distance of each object and the interclass distance of every two objects, and use the ratio of the interclass distance to the intraclass distance to determine the denseness, and the smaller the interclass distance relative to the intraclass distance, the more intensive the two objects are. Since we can only use the rectangular box annotation in the dataset to identify the object, we recognize the OBB as the object to calculate the intraclass and interclass distance. The distance of sample points within one object class and the distance of sample points belonging to different object classes are shown in Fig. 3. In the determination process, for any object on the image, its intraclass distance is calculated as D w . If there is another object on the image, their interclass distance is D b , which is less than the threshold value we set, then both two objects meet the first determination condition for dense objects. The first determination condition of our design can be expressed by In this formula, the larger the threshold T is, the larger the ratio of interclass distance to intraclass distance that satisfies the condition is, and the less intensive it is. Conversely, the smaller the threshold T, the higher the denseness. The determination of the dense object is relatively subjective, and the threshold T can be set autonomously according to the different demands on the denseness. The second criterion is related to the minimum distance between objects and the minimum side length of objects. The minimum distance is the value of the closest distance between two objects. We think that the minimum distance needs to be limited when determining dense objects. Considering that the larger the object size is, the larger its feature scale is, the minimum distance restriction in the determination condition should be relaxed for that object. We use the minimum side length of the object as the factor limiting the minimum distance. In addition, we consider that the minimum distance limit should not increase in equal proportion to the object size because the number of pixels increases with the object size, and the more pixels between objects, the less dense they are. So, we square the minimum side length of the object to reduce the rate of increase of the minimum distance limit. We still consider the OBB as the object to calculate the minimum distance and the minimum edge length. The minimum side length of an object and the minimum distance between two objects are shown in Fig. 3. In the determination process, for an object on the image whose minimum edge length is l, if there exists another object on the image and the minimum distance between them is d, and these two objects satisfy each other with then both two objects satisfy the second determination condition.
In this formula, similar to T in (3), a is used as a moderator to adjust the severity of this condition to meet different subjective needs. The smaller a is, the stricter the minimum distance restriction and the more dense the object is. An object in an image is considered as a dense object when it satisfies both of the determination conditions we designed. The first condition, (3), determines whether an object is dense or not by comparing the dispersion of the object with another object and its own dispersion, and this form of determination considers the denseness from the totality of the two object areas. The second determination condition, (4), ensures that the closest points between two objects can be within a threshold value. The combination of these two conditions considers the denseness both from the whole object area and from a single pixel point, which can determine dense objects in a more reasonable way. The severity of the determination conditions can be adjusted according to different subjective requirements. We divide the objects in the DOTA [3], DIOR-R [7], and UCAS-AOD [5] datasets into dense and nondense objects according to the designed determination method with different values of T and different values of a. The results are given in Table I. As can be seen from the table, the number of dense objects in the datasets increases with increasing threshold T and moderator a. This is due to the fact that the larger the threshold T or the moderator a is, the more lenient the determination conditions are, as described earlier. Furthermore, in Tables II and III, we give the division between the DOTA and the DIOR datasets for each category of objects when the threshold T is 7.75 and the moderator a is 5. The results of dividing dense and nondense objects in some images with three different thresholds T and three different moderators a are shown in Figs. 4 and 5. The comparison of the results visually demonstrates that the dense object determination condition is more relaxed when T or a is larger. And it can be seen that the determination method we designed can ideally distinguish dense objects from nondense objects. The effectiveness of this method is proved.

B. Two-Stage Detection Network as Baseline
In this article, we propose the second-stage detection structure RoIF-Net, which is part of a two-stage detection network. The two-stage detection network detects the object in the image twice, and the first detection generates relatively coarse localization and classification results, followed by further localization correction and accurate classification in the second-stage detection network based on the first generated coarse results. The overall structure of the classical two-stage detection network is shown in Fig. 6. The image to be detected is fed into the network as an input, and first, the backbone network is used for feature extraction, then the extracted feature map is fused in the neck network to generate a multiscale feature map [56], next the first-stage detection head is used to detect on the multiscale feature map to obtain preliminary results. The detection results of the first stage generally include the regression and classification results of the object boxes. During the training process, the generated detection boxes need to be matched with the annotated real objects, and the position regression losses are calculated based on the mutually matched detection boxes and real boxes. These detection boxes that can match to the real boxes are classified as foreground and those that are not matched are classified as background, and the classification loss is obtained according to the foreground and background, and the network parameters are updated by these losses so that the network gradually learns how to detect the objects. RoI extraction is performed based on the object box obtained from the first-stage detection, and the extracted RoI is fed into the second-stage detection head for adjustment, which includes position regression adjustment and accurate category classification. During the training process, the regression loss is calculated again in the same way, and the classification loss is calculated according to the object class, from which the network parameters are then updated. The detection result after the second-stage detector head adjustment is the final result of the whole two-stage detector. In a two-stage detection network, the loss function generally consists of two major parts, one for the losses generated by the first-stage detection head and the other for the losses generated by the second-stage detection head. As mentioned above, the loss generated at each stage contains regression loss and classification loss, and the network learns the object location and size through regression loss and the object class through classification loss. The overall loss function is given by The equation consists of two parts, which are the first-stage loss and the second-stage loss. In the losses of these two stages,

C. Structure of RoIF-Net
The second stage outputs the final detection results, which plays an important role in the two-stage detection network and is a key factor for the two-stage detection network to obtain high accuracy. Noting the importance of the second stage, in order to give full play to its role in the overall detection network, we designed the second-stage network structure RoIF-Net, as shown in Fig. 7. RoIF-Net is divided into two parts. One is the RoI fusion module, which simultaneously performs RoI extraction on the feature map and the original image and fuses them together. The other is the feature induction module based on the self-attention mechanism, which is able to discriminate and generalize the features and generates the final adjustment results.
In the RoI fusion module, we extract RoI not only on the feature map but also on the original image, which is to be able to obtain more detail features. The feature map obtained after backbone extraction has sufficient high-level features; however, the most original detail information will disappear after the complex network, and the supplement of detail information is beneficial to the second-stage detection network for more accurate object localization and classification. In this structure, according to the detection boxes generated by the first stage, RoI extraction is performed on the feature map obtained from backbone and the original image as the input to the whole network, respectively. When extracting RoI on the feature map, as in the classical two-stage detection network Faster RCNN [33], we resample its corresponding range into a 7 × 7 patch, regardless of the detection box size. When extracting RoI on the original image, we resample the range corresponding to the detection box into a 28 × 28 patch in order to avoid losing too much information since the original image size is much Fig. 6. Overall framework of a two-stage detection network. Classical two-stage detection network consists of backbone, neck, the first-stage detection head, and the second-stage detection head. Image to be detected is input into the backbone, and the second-stage detection head generates the final results. larger relative to the feature map (at least four times in general). Next, in order to fuse the two RoIs, we inverse the subpixel convolution strategy [57], reducing the size of the RoI on the original image by expanding the number of channels without losing information, then expanding the dimensionality to that of the RoI on the feature map with a 1 × 1 convolution kernel, and finally fusing the extracted two RoIs by summing operations.
In the feature induction module, we constructed it using the transformer structure based on the self-attentive mechanism [30], as shown in Fig. 8. This part makes full use of RoI with fused detail features to enhance the discrimination of confusable features and perform feature generalization, which improves the classification and localization accuracy of the second-stage detection network. In this structure, first we expand the RoI obtained in the previous structure into 49 256-dimensional feature vectors and use them as 49 tokens of the transformer. Three feature matrices Q, K, V are generated from the input token, in which each 256-dimensional feature vector is decomposed into four 64-dimensional feature vectors. Q and K T are multiplied to get the self-attention weight matrix. Since there is a positional relationship between feature points in an image, it is important to add positional information to the feature points used as token inputs. We use the positional encoding matrix in Swin Transformer [58], which contains the relative positional information between every two feature points and helps the network to judge the spatial location of features. The self-attention matrix is obtained by adding the weight matrix with the position matrix and then multiplying it with the V matrix after performing softmax. The four 64-dimensional vectors are synthesized into a 256-dimensional feature vector, and then the expanded spatial dimension is restored to obtain the RoI after the self-attentive mechanism. Then using the residual mechanism [59], it is summed with the original RoI. Finally, the final regression and classification results are obtained after passing through two convolution layers and two fully connected layers.
RoIF-Net adds detail features by fusing the RoI extracted from the original image to provide more information for the final regression and classification, uses a self-attention mechanism to enhance the discrimination of confusing features, and finally performs feature generalization to achieve high-quality adjustment of the detection box. Since the second stage is relatively independent in the detector, RoIF-Net can theoretically be applied to any two-stage detection network. Simply replacing the second-stage detection network with RoIF-Net can make the original two-stage detector a step up in detection effectiveness.

IV. EXPERIMENTS AND ANALYSIS
A. Datasets 1) DOTA-v1.0: DOTA-v1.0 [3] is a large aerial remote sensing dataset, which contains 2806 aerial images collected from Google Earth, satellite JL-1, etc. It has 188 282 ground objects annotated on it for object detection tasks, some of which are arranged very densely. These objects cover 15 common categories, namely plane (PL), baseball diamond (BD), bridge (BR), ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccer-ball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP), and helicopter (HC). There are two types of annotation in this dataset: HBB and OBB. In this article, we use OBB annotation for dense object determination and all experiments. The whole dataset has been divided into three parts by the publisher: the training set, the validation set, and the test set, and their ratio is 3:1:2.
The images in the DOTA dataset vary in size, with a large gap between the minimum size of 800 × 800 pixels and the maximum size of 4000 × 4000 pixels. To avoid the loss of image information caused by resizing, we cropped the original images into a series of 1024 × 1024 patches as input. The experiments performed on this dataset in this article all use the multiscale data augmentation method, using three scale factors (0.5, 1.0, 1.5) to resize the original image and the crop step is 512. If the instances are segmented at the time of cropping, we decide whether to use them or not according to the method in [3]. In the test, we map the detection results to the original size image before evaluation. 3) UCAS-AOD: UCAS-AOD [5] is a publicly available highdefinition aerial photography dataset for object detection, in which images are captured in selected regions of the world using Google Earth. The dataset contains 1510 images of approximately 1300 × 700 in size, which are labeled with two types of objects: car and plane. The labels are in the form of HBB and OBB, and in this article, we use OBB annotations for our experiments. The images are randomly divided into a training set, a validation set, and a test set in the ratio of 5:2:3.

B. Implementation Details
The backbone network used in our following experiments is first pretrained on ImageNet [60], and the network is initialized using the parameters obtained from the pretraining. In the training phase, two Nvidia RTX3090 GPUs are used to perform the experiments, and the batch size of a single GPU is set to 2, for a total of 4. When ResNet50 [59] is used as the backbone, the SGD optimizer is used to perform gradient updates of the model parameters, where the initial learning rate is set to 0.005, the learning rate is reduced to 1/10 of the original at each decay, and the momentum and weight decays are set to 0.9 and 0.0001, respectively. When ConvNeXT [61] or Swin [58] is used as the backbone network, the AdamW [62], [63] optimizer is used to perform gradient updates, where the initial learning rate is set to 0.0001, the learning rate is reduced to 1/10 of the original at each decay, and the weight decay is set to 0.05. When DOTA dataset is used for experiments, the number of training times is set to 12 epochs, and the learning rate is decayed after the 8th and 11th epochs, respectively. In the training process, we use data enhancement strategies, such as random flipping, random rotation, and multiscale scaling, to increase the complexity of the dataset. When using DIOR-R, the number of training times is still 12 epochs, and the learning rate decay is still performed after the 8th and 11th epochs, and data enhancement strategies such as random flipping and random rotation are used. When using UCAS-AOD, the number of training times is 36 epochs, and the learning rate decay is performed after the 24th and 33th epochs. In the testing phase, we use a single Nvidia RTX3090 GPU for inference. We keep the bounding boxes with confidence scores greater than 0.05 and set the IOU threshold of NMS to 0.1. At the same time, considering that an image contains a limited number of objects, we set the maximum number of objects in each image to 2000.

C. Evaluation Metrics
When evaluating the effectiveness of detection networks, a uniform set of criteria is needed. Average precision (AP) is the most authoritative evaluation metric in object detection, and the calculation of this value is related to two basic and credible evaluation metrics: precision and recall. To judge that the network correctly detects the object, two conditions need to be satisfied. The first condition is that the IoU between the detection box and the ground truth box is greater than 0.5, and the second is that the network classifies the object in the detection box correctly. On this basis, the precision and recall are determined by the formulas where TP is the number of objects determined by the network to be positive samples and detected correctly, FP is the number of objects determined by the network to be positive samples but detected incorrectly, and FN is the number of ground truth objects determined by the network to be negative samples. When evaluating the detection results, as the confidence score decreases, more and more objects are detected, and the recall is higher. Generally speaking, the precision decreases with the increase of recall. AP is the precision integral of the recall from 0 to 1, which can be expressed as The values of 11 recall rates (0, 0.1, …, 1) are generally used for calculation. AP can make a valid evaluation of the detection results of one class of objects. And when evaluating the detection results of multiple classes of objects, mAP, which is the average of AP of multiple classes of objects, can be used.
In this article, we are going to evaluate the detection results of dense and nondense objects separately. The determination of density involves objects other than the one to be determined. When determining dense objects on the detection results, other objects are not necessarily ground truth objects, so the determination results are not credible. When calculating the precision of dense objects and nondense objects separately, it is necessary to know the number of dense and nondense objects in the detection results. Since the dense objects determined on the detection results are not credible, the precision in this case does not provide a valid assessment of the detection results. Instead, the number of dense and nondense objects in the ground truth needs to be known when calculating the recall separately. The determination of dense or nondense objects on the ground truth is credible, so the recall can still effectively evaluate the detection results. As aforementioned, we evaluate the detection results of dense and nondense objects using only the recall without using the precision and the AP that includes the precision.

D. Analysis of Dense Object Detection Results
We classify the objects in the DIOR-R [7] dataset into dense and nondense objects according to the determination method proposed in Section III-A with three thresholds T representing different densities: 5.25, 7.75, and 10.25. RoI Transformers [51] are used to detect and calculate their recall, respectively. The results of the experiment are given in Table IV. The recall rate of dense objects is 8.20% higher than that of nondense objects when T is 5.25, 6.58% higher when T is 7.75, and 4.76% higher when T is 10.25. Similarly, we performed the same experiments using three different modulators a, 3, 5, and 7, and the results are given in Table V. The recall rate of dense objects is 5.60% higher than that of nondense objects when a is 3, 6.58% higher when a is 5, and 8.93% higher when a is 7. It can be seen that the recall rate of dense objects is higher under both relatively strict and lenient determination conditions. In addition, for the dense and nondense objects discriminated in the DOTA [3], DIOR-R [7], and UCAS-AOD [5] datasets under the relatively moderate T value of 7.75 and a value of 5, we use two different detection  [33] and RoI Transformer, to detect and calculate their recall separately. In this experiment, we use the training set for training and validation set for testing on the DOTA dataset, and the trainval set for training and test set for testing on the DIOR-R and UCAS-AOD datasets. In order to avoid the imbalance of training samples between dense and nondense objects in the datasets, which will lead to the imbalance of network learning and affect the judgment of results, we balance the training samples to make the number of dense and nondense objects the same before training. The results of the experiment are given in Table VI. On the DOTA dataset, the recall of dense objects obtained by Faster RCNN is 3.91% higher than the nondense objects, and the recall of dense objects obtained by RoI Transformer is 5.21% higher than the nondense objects. On the DIOR-R dataset, the recall of dense objects obtained by Faster RCNN is 6.14% higher than the nondense objects, and the recall of dense objects obtained by RoI Transformer is 6.58% higher than the nondense objects. On the UCAS-AOD dataset, the recall of dense objects obtained by Faster RCNN is 3.78% higher than the nondense objects, and the recall of dense objects obtained by RoI Transformer is 2.68% higher than the nondense objects. It is known from this experiment that the overall recall of dense objects is somewhat higher compared to nondense objects when tested on different datasets using different networks. This result is different from our intuitive understanding and from other works that describe dense objects as harder to detect. In other works, it is only qualitatively stated that dense objects are more difficult to detect without relevant experimental proof, while our results rely on experiments and are relatively more credible. As shown in Fig. 9, we display the feature maps extracted by backbone and sent to the detection head in the hot map. In order to show the characteristics of the feature maps more comprehensively, we average the values on all feature channels and convert them into a hot map. From the circles marked in the figure, we can see that the area with dense objects has higher values, while the area with sparse objects has lower values, indicating that the network has a higher response to the area with dense objects. The network achieves the object detection task based on the recognition of various different features. The features in the region with a large number of objects are richer and denser, so the network has a high response to this region. In addition, in a natural image with an imaging perspective of front or side view, multiple objects on it may be at different depth positions, resulting in mutual occlusion phenomena that cause the loss of object features. This phenomenon is more likely to occur in the area with dense targets, which has a great adverse impact on the detection of dense objects. Due to its special overhead view in RSI, the imaging targets are objects on the ground surface and rarely exist to obscure each other, and the object features are basically complete with few missing cases. From the above-mentioned analysis, it can be concluded that the densely distributed objects in the object detection of RSI are less likely to be difficult to detect.

E. Ablation Experiments of RoIF-Net
In this section, we perform ablation experiments to verify the effectiveness of the proposed second-stage detection head RoIF-Net. The experiments are all trained on the training set of the DOTA [3] dataset and tested on the test set. We use mAP as a criterion to evaluate the performance of the method.

1) Ablation Experiments of RoI Fusion Module and Feature
Induction Module: The overall structure of RoIF-Net is divided into two parts: the RoI fusion module and the feature induction module. In the RoI fusion module, the RoI are extracted from the original image and the feature maps, and they are summed and fused together to increase the detail features. The feature induction module uses a self-attentive mechanism to discriminate and generalize the features to produce the final regression and classification results. In this section, we present and analyze the results of the ablation experiments of these two modules. In the experiment, RoI Transformer [51] is used as the baseline and ConvNeXT [61] as the backbone. We design four groups of experiments, in which the first group uses the traditional second-stage detection structure in the original method, the second group uses only the RoI fusion module, the third group uses only the feature induction module, and the fourth group uses both two modules, and each group is trained and tested the network separately. The test results are given in Table VII. It can be seen from the table that the RoI fusion module does not improve mAP when used alone and even has a 0.05% drop, the feature induction module has a slight 0.19% improvement in mAP when used alone, and only when these two modules are used together does mAP improve significantly by 0.64%. When the RoI fusion module is used alone, the added detail features on the original image are not further extracted and generalized, which is hardly helpful for the final regression and classification of the second stage network. The feature induction module mainly consists of a self-attention mechanism, which works poorly on the rich high-level semantic features extracted from the feature map, while it works better on the detail features extracted from the original image. The two modules complement each other and are used together to fully utilize the capabilities of the RoIF-Net.
2) Ablation Experiments of RoIF-Net in Different Two-Stage Detector: As described in Section III-C, the RoIF-Net we designed can be placed in an arbitrary two-stage detector. In this section, we use different two-stage detection networks as baseline, and change the second stage to RoIF-Net for ablation experiments. In this experiment, we use three two-stage detection networks, i.e., Faster RCNN [33], Oriented RCNN [42], and RoI Transformer [51], with different backbone. The experimental results are given in Table VIII, from which it can be seen that the detection results of the network improved by 0.65%, 0.41%, 0.56%, and 0.64% of mAP after using the RoIF-Net we designed, respectively. This result indicates that the RoIF-Net stimulates the potential of the second-stage detection network, which can be effective in improving the accuracy in different two-stage detection networks with strong universality.

3) Analysis of the Computational Complexity of RoIF-Net:
We use Faster RCNN as the baseline for experiments and analyze the computational complexity of the proposed method using the number of model parameters, floating point operations (FLOPs), and frames per second (FPS). The experimental results are given in Table IX, from which it can be seen that using our proposed second-stage detection structure RoIF-Net on the basis of Faster RCNN, the number of model parameters and FLOPs increase by 0.86M and 27.04G, respectively, and the FPS has a reduction of 2.46, while the mAP improves by 0.65%. This is due to the addition of RoI extraction and convolution operations in the RoI fusion module and self-attention and convolution operations in the feature induction module. These two modules improve the detection effect, but reduce the detection efficiency.

F. Comparison With Advanced Methods
In this section, we compare the method proposed in this article with other classical and advanced methods on the internationally credible and challenging public datasets DOTA [3], DIOR-R [7], and UCAS-AOD [5]. In the experiments of this section, our method uses the two-stage detection network RoI Transformer [51] as baseline and replaces the second-stage detection head with RoIF-Net. The datasets are described in Section IV-A, and the experimental parameters are set in Section IV-B.
1) Comparison Results on DOTA: On the DOTA dataset, we compared with a variety of advanced methods as well as classical methods and the results are given in Table X. As can be seen from the table, our proposed RoIF-Net is able to achieve 81.80% mAP when using ConvNeXT [61] as the backbone network, which outperforms all the results in the table to the current SOTA level. Out of all 15 detection categories, we have the best or the second best results in 7 categories, which proves the advantage of our method in OBB object detection. In addition, when using Swin [58] as the backbone, it was also able to achieve 81.20% mAP, which is still better than other methods and in the next best level. The above-mentioned results show the progressiveness of our method. Some visual detection results on the DOTA dataset are shown in Fig. 10. It can be seen from the figure that in the DOTA dataset, although the background in the image is complex, the size difference of the object is large, and the object has arbitrary direction, each type of object can still be detected well, and the visualization has achieved satisfactory results.
2) Comparison Results on DIOR-R: We also compare with several classical and advanced methods on the DIOR-R dataset, and the results are given in Table XI. The DIOR-R dataset has 20 categories and is relatively more challenging. As can be seen from the table, our proposed method RoIF-Net is able to achieve 65.12% mAP results when using ResNet50 [59] as the backbone, which is better than all other methods and reaches the SOTA. In addition, it was able to achieve an impressive 68.49% mAP when using ConvNeXT [61] as the backbone, a result that is significantly ahead of all other results in the table. Among all the 20 detection categories, we have the best results in 11 categories. In some categories such as APL, ESA, GTF, STA, and TC, we have a large mAP lead compared to other methods. These aforementioned results illustrate the advancement of our method. Some visual detection results of our method on this dataset are shown in Fig. 11. It can be seen from the figure that although the categories are diverse and the detection is difficult, our method produces few errors in the detection of objects with arbitrary directions, and the visualization achieves desired results.

3) Comparison Results on UCAS-AOD:
Similarly, on the test set of the UCAS-AOD, we compared with other methods and the results are given in Table XII. This dataset has only two types of objects, so the detection difficulty is relatively small. As it is shown in the table, our proposed RoIF-Net can achieve an excellent mAP result of 90.25%, which is better than the other methods. In both detection categories, our method is the best in the detection results for plane and the second best for car. Some of the visualized detection results on this dataset are shown in Fig. 12. As can be seen from this, good results are obtained for the detection of cars and planes with arbitrary orientations in different scenarios.

V. CONCLUSION AND DISCUSSION
In this article, we design a dense object determination method based on OBB annotation, according to which the objects in the datasets are classified as dense and nondense objects. Their detection results show that dense objects in RSI are not more difficult to detect compared to nondense objects. Our work still has certain limitations: the determination method of dense objects can further be optimized and improved, and the effect of dense distribution on object feature recognition under different environmental conditions can be studied in depth. The important contribution of this work is to provide an idea to quantify the denseness of objects, which hopefully will help to enrich and deepen the study of dense objects in future work. We propose the RoIF-Net to improve the detection effectiveness of two-stage network based on OBB, which adds detail information by fusing the RoI extracted from the original image and the feature maps, and constructs a feature induction module to realize the final position regression and category classification. We demonstrate the effectiveness of our proposed method through extensive experiments on the DOTA, DIOR-R, and UCAS-AOD datasets, and the OBB detection experimental results achieve SOTA on these datasets. However, this method is only applicable in the two-stage detection method and increases the computational complexity, which causes a decrease in detection efficiency. In the future work, how to efficiently utilize detail features to make the network avoid the background influence in identifying the object features is of great research value.