Context-Based Oriented Object Detector for Small Objects in Remote Sensing Imagery

Object detection in remote sensing imagery is a challenging task in the field of computer vision and has high research value. To improve the classification accuracy and positioning accuracy of object detection, we propose a new multi-scale oriented object detector suitable for small objects. Firstly, the feature fusion network based on information balance (IBFF) is proposed to reduce the reuse of different layers’ features from the backbone network and reduce the interference of redundant information based on the premise that the output features have sufficient information, and retain enough shallow detail information. Secondly, to efficiently utilize deep and shallow features, enhance important features, and reduce background noise interference, different attention-based context feature fusion modules (DACFF) are designed according to the characteristics of different feature fusion stages. Finally, an improved strategy of oriented bounding box regression is proposed to obtain the oriented bounding box with a simpler and more effective strategy. The proposed method was evaluated on two public remote sensing datasets, DOTA and HRSC2016, and their mAP values are 80.96% and 95.01%, respectively, which verified the effectiveness of the proposed algorithm.

remote sensing imagery, the object detection algorithms in 23 this field still face great challenges. 24 Remote sensing datasets mainly have the following charac- The associate editor coordinating the review of this manuscript and approving it for publication was Hossein Rahmani . bounding box cannot tightly surround the object. (2) Small 30 and dense objects. The numerous small objects in remote 31 sensing imagery account for a large proportion of overall 32 objects and have dense distribution. These objects interfere 33 with each other, making it very easy to miss the detection 34 of objects. (3) Objects with large size changes. The sizes 35 of objects in different categories or within the same cat-36 egory vary greatly under different resolutions, resulting in 37 large scale changes and great difficulty in detection. Because 38 of the above characteristics, it is difficult to detect remote 39 sensing images. Therefore, appropriate methods should be 40 selected according to the characteristics of remote sensing 41 datasets, and the traditional target detection method should 42 be improved in order to apply it to object detection in remote 43 sensing imagery. 44 Earlier optical object detection algorithms in remote sens-45 ing imagery are based on manual design features. Although 46 these algorithms benefit from strong interpretability, they 47 and the geometric details of the feature map will be dif-   105 4) To make the prediction box realize directional encir-106 clement, the strategy of oriented bounding box regression is 107 improved, and the oriented bounding box is obtained with a 108 simpler and more effective strategy that has obvious improve-109 ment effects on slender objects. 110 The rest of this paper is organized as follows. The rel-111 evant work is outlined in Section II. Section III describes 112 the proposed algorithm in detail, and Section IV verifies the 113 effectiveness of our proposed algorithm by comparing it with 114 other methods. Section V provides a summary. You Only Look Once [38] and Single Shot Multibox 125 Detector [5] are classic one-stage algorithms. Although 126 one-stage algorithms have advantages in speed, their accu-127 racy is lower than that of two-stage algorithms. The 128 region-based detection algorithm [3], [4] generates a group of 129 more accurate proposals in the first stage and sends them to 130 the RCNN network. In the second stage, the algorithm carries 131 out classification and regression, which greatly improves 132 the accuracy of object detection. Due to the particularity of 133 remote sensing imagery objects, the effects when directly 134 using these general algorithms are not good; thus, these algo-135 rithms need to be adjusted and improved. The ROI trans-136 former [14] converts the horizontal region of interest from 137 RPN output to the rotating region of interest. This strategy 138 does not need to increase the number of anchors and can 139 obtain an accurate rotating region of interest. Although the 140 detection accuracy of this method has been improved, it still 141 needs to be improved greatly, especially for small objects and 142 very large objects. SCRDet [31] is designed as a sampling 143 fusion network that integrates multi-layer features into effec-144 tive anchor sampling to improve the detection sensitivity of 145 small objects. The attention mechanism has a certain effect 146 on the improvement of object detection, and it also has a 147 good effect on the detection of remote sensing images when 148 the attention mechanism is added to the feature extraction. 149 CAD-Net [26] uses spatial attention to learn object collab-150 oration, and integrates global context information in object 151 detection. While improving detection accuracy, improving 152 efficiency is also very important. R 3 Det [36] uses horizontal 153 anchors in the first stage to obtain faster speeds and more 154 proposals. In the refinement stage, this algorithm uses refined 155 rotated anchors to adapt to dense scenes. It is found that 156 rotation feature plays an important role in detection and 157 classification. ReDet [24]      but it also has important reference value for other object 212 detection. In the first stage of R 2 CNN [51], the region pro-213 posal network still uses the horizontal bounding box to extract 214 the region of interest. In the second stage, the oriented bound-215 ing box is regressed based on the horizontal candidate region 216 to reduce memory consumption. The RoI transformer [14] 217 then inserts an RROI learner between RPN and RCNN to 218 convert the horizontal region generated by RPN into a rotat-219 ing region to ensure high efficiency and low complexity. 220 Although the RRoI learner can capture rotational features, 221 it cannot give the generated feature map rotational invariance. 222 Therefore, the authors in [24] used ReDet to add a rotation-223 equivariant network to the detector to extract the rotation-224 equivariant features and thus accurately predict the direction 225 and reduce the size of the model. To avoid confusion about 226 the sequential label points for oriented objects, the authors in 227 [27] located the quadrilateral by learning the offset of four 228 points on a non-rotating rectangle and used the quadrilateral 229 to determine the position of the object.

232
The overall framework of the network is shown in Fig. 1. 233 Firstly, the image is input into the backbone network 234 (Resnet101) for feature extraction. Then, the features of the 235 three layers from the conv3, conv4, and conv5 modules are 236 extracted. We next send the three features into the IBFF 237 network for cross-layer fusion to obtain three scale feature 238 maps with richer positioning data and semantic information. 239 Then, we use the RPN network to obtain the region of interest 240 and expand the threshold of NMS to reduce the risk of missing 241 dense objects [20]. Finally, according to the classification and 242 regression results, the final oriented bounding box is gener-243 ated using the improved oriented bounding box algorithm.

245
There are mutil-scale objects in the imagery, which makes the 246 information content of each pixel on the imagery significantly 247 different. This situation makes the optimal network depth 248 for dealing with small and large objects different. As the 249 number of network layers increases, the edge and other detail 250 information of the detected object will gradually lose, and the 251 semantic information will gradually lose as the network depth 252 exceeds a certain threshold. Compared with large objects, the 253 loss of detail information of small objects is more severe with 254 the increase of model levels, and the optimal level of semantic 255 information extraction is also relatively shallow. Therefore, 256 the previous work shows that the fusion of different layers 257 of features can perform better in classification and regres-258 sion. In order to fully retain the detail information of small 259 objects and extract rich semantic information, a feature fusion 260 network structure is designed, as shown in Fig. 1. In this 261 way, the information of different layers is more evenly used 262 in the process of feature fusion, so that each object retains 263 enough semantic information and detail information as much 264 as possible, which is conducive to the detection of small 265 objects. relatively balanced information, without excessive redun-295 dancy and with less interference noise, which is conducive 296 to improving the performance of object detection. The attention mechanism emphasizes or selects the important 300 information contained in the processing object by redistribut-301 ing the weight parameters. This mechanism also suppresses 302 some irrelevant detail information and focuses attention on 303 useful information. The shallow feature map has higher reso-304 lution and contains more geometric detail information, which 305 is beneficial to small object detection. Deep feature map 306 where f 7×7 represents the convolution operation with a filter 339 size of 7 × 7. F s avg ∈ R 1×H ×W , and F s max ∈ R 1×H ×W 340 represents the results obtained after average pooling and 341 maximum pooling along the channel direction, respectively. 342 After connecting them, effective feature descriptors are gen-343 erated. After using convolution and the sigmoid function, the 344 feature descriptor is generated as a spatial attention map that 345 emphasizes or suppresses the location information.   The idea of this method is intuitively depicted in Fig. 4.  is as follows: where L h is the loss of horizontal bounding box regression, 394 which is the same as that in [4]; L α is the loss of length ratio 395 (α 1 , α 2 , α 3 , α 4 ) regression; L r is the loss of obliquity factors 396 r regression; and λ 1 , λ 2 , and λ 3 are super parameters that 397 balance the importance of each loss term.

399
In this section, we demonstrate the effectiveness of our 400 method on DOTA and HRSC2016. Then, we compare our 401 method with the state-of-the-art methods. As shown in Fig. 5, statistics are provided on the sizes of 414 objects in the training set of DOTA. This dataset presents 415 significant challenges. As shown in Fig. 5(a), most objects 416 in the dataset are small-and medium-size objects, with 417 small objects accounting for 62.01% and medium objects 418 accounting for 31.06%. The dataset also contains many slen-419 der objects. The model is trained through 180000 iterations, 420 in which the learning rate starts from 0.0005 and decreases 421 to 0.0001 and 0.00001 at 120000 iterations and 150000 iter-422 ations, respectively.

423
HRSC2016 Dataset: HRSC2016 is a challenging ship 424 detection dataset with arbitrary orientation. The dataset is 425 obtained from Google Earth, and all images are taken at six 426 famous ports. The size of each image in the dataset ranges 427 between 300 × 300 and 1500 × 900, with a total of 2,976 428 instances. As shown in Fig. 5(b), the dataset contains a large 429 number of slender objects with a large aspect ratio. The 430 training set and test set include 617 images and 444 images, 431 respectively. For experiments on HRSC2016, we scale the 432 image to 512 × 512 for training and testing. Here, the model 433 is trained through 120000 iterations, in which the learning 434 rate starts from 0.0005 and decreases to 0.0001 and 0.00001 at 435 80000 iterations and 100000 iterations, respectively.

B. EXPERIMENTAL SETUP AND PARAMETER EVALUATION 437
The experiment is conducted under the PyTorch deep learning 438 framework using a GeForce RTX 2080Ti GPU along with the 439 random gradient descent method (SGD) to train the network. 440 Here, the momentum parameter is 0.949, and the initial learn-441 ing rate is 0.0005. In the training phase, hard negative mining 442 is used, and random rotation, random clipping, and random 443 VOLUME 10, 2022  in all categories. The calculation formula is as follows:

462
Results on the DOTA Dataset: As shown in Table 1, Compared with the state-of-the-art methods, our method 474 can achieve the highest AP in five categories. For example, 475 the AP of bridges is 63.94%, the AP of planes is 91.62%, the 476 AP of ships is 89.03%, the AP of large vehicles is 88.69%, 477 and the AP of small vehicles is 80.61%. These results show 478 that our proposed method has good detection performance 479 for some small, dense, and slender objects. The experimental 480 results in Fig. 6 further qualitatively demonstrate that the 481 proposed model offers good detection performance for such 482 objects, highlighting the effectiveness of the proposed model. 483 However, a small-scale physical object does not corre-484 spond directly to a small-scale detection object. In short, 485 when the spatial resolution of the visual sensor is low, 486 an object still has the opportunity to obtain a larger pixel 487 detection object and thus more complete information expres-488 sion in the dataset image. In response to this problem, we con-489 duct a performance analysis based on the information in 490  Table 2. The results are shown in Table 2. The AP results 491 using our method are 79.17%, 85.89%, and 72.14% for small, 492 medium, and large objects, respectively. It can be clearly seen 493 that in our method, small-scale objects feature significant 494 improvements in detection performance, which further ver-495 ifies the effectiveness of the strategy we proposed.

496
Results on the HRSC2016 Dataset: The objects in DOTA 497 contain a large number of aircraft, storage tanks, and base-498 ball diamonds, as well as other objects with small aspect 499 ratios. These types of objects can achieve good bounding box 500 effects, even if there is a certain angle deviation, by approach-501 ing the square bounding box. Thus, it is difficult to effectively 502    As shown in Table 3, our method has an advantage when 513 using the strategy based on an oriented bounding box, and 514 the mAP is 95.01%.

515
In order to further verify the effectiveness of the proposed 516 algorithm, we also compare the inference speed of the net-517 work on the dataset. When the image input size is 512 * 800, 518 the inference speed of the model reaches 15.1 FPS. Compared 519 with other two-stage models, the inference speed in this paper 520 is faster, which verifies that the algorithm proposed in this 521 paper also has a certain improvement in detection efficiency. 522  in Table 4, we design the following ablation experiments:

528
(1) Instead of using a feature fusion network, the output 529 features of modules conv3, conv4, and conv5 in Fig. 1 are proposed IBFF network is adopted, but the proposed DACFF 535 module is not used; (5) IBFF and DACFF are used at the same 536 time.

537
The experimental results are shown in the Table 4.

538
The feature fusion network proposed in this paper offers 539 good performance and has the highest mAP, which is 540 78.99%. Moreover, the AP values of most categories are 541 improved. Specifically, the AP values of small objects are 542 significantly improved, while those of large objects are 543 less significantly improved. These results show that the 544 proposed feature fusion network based on information bal-545 ance can effectively improve the detection precision of small 546 objects. 547 We compare Parameters, GFLOPs and FPS of the differ-548 ent feature fusion network, and the experimental results are 549 shown in the Table 5. Compared with the common feature 550 fusion networks of FPN+PAN, the proposed feature fusion 551 network has higher detection efficiency. In the input image 552 size of 1024 * 1024, the FPS of the IBFF method reaches 13.5, 553 which is higher than FPN+PAN method. Compared with 554 FPN+PAN, IBFF had fewer parameters, only 66.1M. It is 555   verified that the algorithm proposed in this paper has a certain 556 improvement effect on detection efficiency. (2) the multi-scale structure 562 is not adopted, but the DACFF module is adopted; (3) the 563 multi-scale structure is adopted but not the DACFF module; 564 (4) the multi-scale structure and DACFF module are adopted 565 at the same time.

566
The experimental results are shown in Table 6. Under 567 the network with or without a multi-scale scheme, the fea-568 ture fusion module we proposed can effectively improve the 569 mAP of the detection network. The mAP of the multi-scale 570 scheme and DACFF module is the highest-4.22% higher 571 than that of the scheme without the multi-scale structure or 572 DACFF module. The experimental results show that the mAP 573 improvement achieved using our approach is not only due to 574 the multi-scale structure. Although the multi-scale structure 575 can improve a certain mAP, adding the feature fusion module 576 proposed in this paper plays a great role. When the multi-577 scale structure is not adopted, the mAP when adding the 578 three proposed feature fusion modules is found to be 1.82% 579 higher than the mAP when not adding the modules. When 580 using the multi-scale structure, the mAP when adding the 581 three proposed feature fusion modules is 1.97% higher than 582 that when not adding the feature fusion modules. As shown 583 in Table 4, after adding the three proposed feature fusion 584 modules, the detection results of each class are improved 585 to varying degrees. The experimental results show that the 586 proposed feature fusion module can effectively improve the 587 detection precision of the network.

588
To further demonstrate the effectiveness of the proposed 589 feature fusion module, we next compare the intermediate 590 heatmaps with and without the DACFF module. Fig. 8 shows 591 the comparison of heatmaps at different levels of the network 592 before and after the introduction of the DACFF module. 593 The first row shows the results for when the module is 594 not introduced, while the second row shows the results 595 after introduction of the module. The first to third columns 596 present the visualization results of F c 1, F c 2, and F c 3, respec-597 tively. As shown in the figure, with deepening of the level, 598 the heatmap increasingly highlights the detected objects. 599 We believe that this result is due to deepening of the level 600     The basic mechanism of most object detection algorithms 633 is to cover the object area through the bounding box and 634 to capture the image within the bounding box and send the 635 image to the classifier to judge the object attribute. Ideally, the 636 bounding box will be tight around the object. However, that is 637 often not the case. If the proportion of the object covered by 638 the bounding box is small, or the background accounts for a 639 large proportion in the bounding box, accurate classification 640 of the classifier will be difficult. Therefore, it is important 641 to improve the IoU of the algorithm for object detection. 642 Fig. 10 shows that our improved approach provides varying 643 degrees of improvement for different categories of objects. 644 The improvement effect is obvious for slender objects, such 645 as bridge, harbor, ground track field, ship, large vehicle, and 646 small vehicle, as represented by solid lines (the other objects 647 are represented by dotted lines in the figure). However, for 648 objects with an aspect ratio close to 1, the improvement effect 649 is small. Experimental results show that our improved ori-650 ented bounding box scheme has better detection performance, 651 especially for slender objects.

653
This paper proposes oriented object detection based on con-654 text information. The proposed IBFF feature fusion network 655 attaches importance to and retains the extraction and uti-656 lization of shallow feature information when fusing feature 657 maps of different scales, enhances the edge of the object and 658 improves the localization ability during detection, especially 659 for the improvement of small object detection performance. 660 In addition, the feature fusion network adopts the attention-661 based context feature fusion module, which can efficiently 662 extract the important information contained in the deep and 663 shallow feature maps according to the characteristics of 664 different layers, reduce the interference of other noises, and 665 further improve the detection performance of the model. 666 We also improve the oriented bounding box regression strat-667 egy to further improve the positioning accuracy. Our model 668 is evaluated on two public remote sensing imagery datasets, 669 DOTA and HRSC2016, and the mAP values are 80.96% 670 and 95.01%, respectively. The experimental results show the 671 effectiveness of the proposed algorithm.