Learning Calibrated-Guidance for Object Detection in Aerial Images

Object detection is one of the most fundamental yet challenging research topics in the domain of computer vision. Recently, the study on this topic in aerial images has made tremendous progress. However, complex background and worse imaging quality are obvious problems in aerial object detection. Most state-of-the-art approaches tend to develop elaborate attention mechanisms for the space-time feature calibrations with arduous computational complexity, while surprisingly ignoring the importance of feature calibrations in channel-wise. In this work, we propose a simple yet effective Calibrated-Guidance (CG) scheme to enhance channel communications in a feature transformer fashion, which can adaptively determine the calibration weights for each channel based on the global feature affinity correlations. Specifically, for a given set of feature maps, CG first computes the feature similarity between each channel and the remaining channels as the intermediary calibration guidance. Then, re-representing each channel by aggregating all the channels weighted together via the guidance operation. Our CG is a general module that can be plugged into any deep neural networks, which is named as CG-Net. To demonstrate its effectiveness and efficiency, extensive experiments are carried out on both oriented object detection task and horizontal object detection task in aerial images. Experimental results on two challenging benchmarks (DOTA and HRSC2016) demonstrate that our CG-Net can achieve the new state-of-the-art performance in accuracy with a fair computational overhead. The source code has been open sourced at https://github.com/WeiZongqi/CG-Net


I. INTRODUCTION
O BJECT detection in aerial images is one of the most fundamental yet challenging research tasks, which aims to assign a bounding box with a unique semantic category label to each surficial object in the given aerial images [1]- [5]. This task is critical for a wide range of downstream tasks, e.g., land resource management, ecological monitoring, and land ecosystem evaluation [6], [7]. Thanks to the recent promising development of deep Convolutional Neural Networks (CNNs) in image processing, object detection in aerial images has also made tremendous progress. The state-of-the-art approaches are Z. Wei  usually based on a one-stage detector (e.g., RetinaNet [8] and YOLO [9]) or a two-stage detector (e.g., Fast/Faster R-CNN [10], [11]) with a CNN as the backbone.
Compared to objects in general natural scenes, objects in aerial images usually have smaller size, higher density, objects with different size, worse imaging quality, and more complex background [14], [15]. Therefore, it is difficult to directly achieve a satisfying recognition performance in aerial images using the existing natural scene object detectors. To this end, state-of-the-art methods focus on developing effective head networks [1], adaptive dense anchor generators [2], and labeling strategy [3], [5]. Besides, effective feature learning strategies play a crucial role. Because such methods can provide generalized features to improve the model performance. To this end, a large amount of feature calibration methods based on the attention mechanisms have been proposed to improve the rough feature representations in CNNs [4], [6], [16]- [19]. Conceptually, these attention-based methods can be basically divided into two categories: (I) the spatial-attention-based one, and (II) the channel-attention-based one. For the first category (e.g., spatial attention module [17], [19], [20], recurrent attention structure [6], self-attention mechanism [21], and nonlocal operation [22]), as shown in Figure 1 (a), a global context mapping for each feature position can be obtained by computing the similarities between the feature of each specific position and all the remaining feature positions [23], [24]. Through such an operation, each pixel can obtain the long-range dependence information of the input image. For , where CG is deployed on both the intralayer feature maps and the feature pyramid (i.e., the standard feature pyramid network [12]). In comparison, feature map with CG module has a stronger representation ability. After that, we use a task-specific head network for dealing with both oriented and horizontal object detection tasks in aerial images. ResNet [13] is used as the backbone network.
the second category (e.g., channel attention module [20], and the channel-wise attention [17], [25], and the squeeze-andexcitation block [4], [16], [26]), as shown in Figure 1 (b), each channel can obtain a weight that reflects its own importance in object detection, and then integrate the weight into the model by the channel re-weighting manner.
Despite the success of the existing attention-based methods in calibrating features for object detection, we argue that most of these methods on feature calibrations in channels are not enough. That is to say, they cannot introduce channel communications to capture the dependencies between channel feature maps, which has empirically shown their benefits to a wide range of computer visual recognition tasks [27]- [32]. Although the existing channel-attention-based methods can enable different channels to obtain different weights, modules (e.g., global average/max pooling) based on their channel feature maps cannot guarantee all the channels have sufficient communications. Therefore, from this point of view, these methods are still local-based.
To address those problem, including different size objects and complex background in aerial images and limitations of existing attention-based methods in calibrating features, in this paper, we propose a simple yet effective Calibrated-Guidance (CG) scheme to enhance channel communications in a feature transformer fashion, which can adaptively determine the calibration weights for each feature channel based on the global feature affinity-pairs. CG is an active feature communication mechanism, as illustrated in Figure 1 (c), which can explicitly introduces feature dependencies in a channel-wise manner. Specifically, CG is applied to the pyramid features, including the inner and inter-layers of the pyramid, and the pyramid layer features are also regarded as the "channel" of overall pyramid features. CG consists of two steps: first, feature similarities (via the dot product operation) between each channel and the remaining channels are computed as the intermediary calibration guidance. Then, we represent each channel by aggregating all the channels weighted together via the guidance. The weighted feature maps has the same spatial size as the input feature maps, but contain richer information about the long-range channel dependency information. For typical problems of aerial images, within and between pyramid layers, we propose Base CG and Rearrange Pyramid CG to realize calibrating features locally and globally.
CG is a general unit that can be plugged into any deep neural network. We name a CNN model deployed the proposed CG module as CG-Net. The overall architecture is shown in Figure 2. To demonstrate its effectiveness and efficiency, we conduct extensive experiments on both oriented object detection task and horizontal object detection task. Experimental results on the challenging benchmarks DOTA [14] and HRSC2016 [33] for oriented object detection show that our proposed CG-Net can boost substantial improvements compared to the baseline methods and achieves the state-ofthe-art performance in accuracy (i.e., 77.89% and 90.58% mAP, respectively) with a fair computational overhead. Besides, experimental results on DOTA [14] for horizontal object detection also validate the flexibility and effectiveness of the proposed CG-Net, which also achieves the new state-of-the-art performance with the accuracy by 78.26% mAP.
In summary, our main contributions are two-fold: • a simple yet effective CG scheme is proposed to enhance channel communications in a feature transformer fashion, and implements within and between feature pyramid layer to enhance pyramid representation; • we propose a CG-Net, which can achieve the state-of-theart oriented and horizontal object detection performance on two challenging benchmarks for aerial images, including DOTA and HRSC2016.

II. RELATED WORK
A. Object Detection in Aerial Images.
The purpose of object detection in aerial images is to locate objects of interest on the ground and recognize their categories by a bounding box [15], [34]. Each bounding box not only contains the object coordinate information, but also contains the category information. Object detection in aerial images can be divided into horizontal-based ones and orientedbased ones. Horizontal object detection aims to detect objects with horizontal bounding boxes [8], [9], [11], [35]. Being observed from an overhead perspective, the objects in aerial images present more diversified orientations. Oriented object detection [1]- [5], [36]- [44] is an extension of horizontal object detection to accurately outline the objects, especially those with large aspect ratios.
Based on horizontal object detection, rotating boxes are important learning parts in oriented object detection. There are many methods on how to rotate boxes. CSL [3] design a detection frame by transforming angular prediction form a regression to a classification task. Gliding Vertex [38] glides the vertex of the horizontal bounding box (regressing four length ratios characterizing the relative gliding offset on each corresponding side) on each corresponding side to accurately describe a multi-oriented object. DAL [2] propose a dynamic anchor learning method, which utilizes the newly defined matching degree to comprehensively evaluate the localization potential of the anchors. RoI Trans [1] proposes a ROI Transformer to address the mismatches between the Region of Interests (RoIs) and objects on training. CFC-Net [40] proposes a Critical Feature Capturing Network to address problems of discriminative features in object detection in refining preset anchors, building powerful feature representation and optimizing label assignment. R-RPN [44] overcomes the limitation of ROI pooling when extracting ships features with various aspect ratios. For fast and accurate oriented object detection, R 3 Det [42] and O 2 -DNet [43] make attempts in one-stage model with RetinaNet and anchor free structures. Based on R 3 Det, R 3 Det-DCL [5] designs Densely Coded Labels (DCL) for angle classification, which replaces the Sparsely Coded Label (SCL) in classification-based detectors before, and reduces three times training speed, further bringing notable improvements in accuracy of detection tasks. What's more, for oriented object detection, SCRDet [4] combines pixel and channel attention network for small and cluttered objects. DEA [45] leverages a sample discriminator to realize interactive sample screening between an anchor-based unit and an anchor-free unit to generate eligible samples in aerial images detection.
From the presentation form of bounding boxes, oriented object detection can be more suitable for aerial object detection, because it contains the orientation information of objects with more accurate bounding-box. In this work, we consider both oriented and horizontal aerial object detection tasks and develop a pipeline line to benefit both of them.

B. Feature Calibration over Images.
The purpose of feature calibration is to refine feature maps through the existing information, so as to further improve the representation ability. Currently, most of the state-of-theart methods are designed from the perspective of feature calibration to deal with the challenges of complex background and noise in object detection [17], [19], [20], [22], [26]. Among those methods, attention-based ones are proposed to calibrate features from two aspects, including spatial-attention and channel-attention-based.
Spatial-attention-based mechanisms capture object positions in the spatial dimension. Position attention module [20]/Nonlocal operation [22] build rich contexts on local features by using a self-attention mechanism. Transformer [46] is the first sequence transduction model combined with multiheaded self-attention. DETR [21] is proposed to explore the relationship between objects in the global context, which is of precision similar to those of the two-stage detectors, but has a weakness on detecting large objects with high computational overheads [47], [48]. In aerial image analysis, ARCNet [6] utilizes a recurrent attention structure to squeeze high-level semantic features for learning to reduce parameters. Channelattention-based mechanisms allocate resources for channels referring to their importance. SENet [26] utilizes a squeezeand-excitation block to implement dynamic channel-wise feature re-calibration. For obtaining better feature representations, DANet [20] utilizes a channel attention module to capture contextual relationships based on the self-attention mechanism. In aerial image, a residual-based network combining channel attention [16] is used to learn the most relevant high-frequency features.
There are also some works that combine spatial attention with channel-wise attention together, e.g., SCA-CNN [25], and DONet [49]. These methods take advantage of both channel-wise attention and spatial-wise attention. Besides, to address considerable interference of complex background in aerial detection, multi-scale spatial and channel-wise attention mechanism [17] is proposed to strengthen the object region in aerial detection task. Despite the success of the existing attention-based methods, they are not sufficient for feature calibration in channels. In this work, we propose a simple yet effective CG scheme to enhance channel communications in a feature transformer fashion, which can adaptively determine the calibration weights for each channel based on the global feature affinity-pairs.

III. METHODOLOGY
In this section, we show the technical details of our proposed Calibrated-Guidance Network (CG-Net) for object detection in aerial images. Specifically, we first revisit the channel attention mechanism on images in Section III-A. Then, our proposed Calibrated-Guidance (CG) module which can enhance channel communications is described in Section III-B. After that, we introduce how to implement CG on the base CNNs' feature maps (i.e., Base CG) and on an intra-network feature pyramid (i.e., Rearranged Pyramid CG) for object detection in aerial images in Section III-C and Section III-D. Finally, we show the details of the network architecture in Section III-E.

A. Channel Attention Revisited
Channel-wise Attention (CA) module utilizes the interdependencies between the channels to emphasize the important ones by weighting the similarity matrix. To be specific, CA operates on queries (Q), keys (K) and values (V) among a set of single-scale feature maps X, and the improved version X has the same scale as the original X. For a given set of feature maps X ∈ R W ×H×C , where W , H and C are width, height and channel dimension, respectively, CA implementation can be formulated as: where and f v (·) denote the query/key/value channel transformer functions [21], [46]; X i and X j denote the i th and j th channel feature in X; F sim is the dot product similarity function; F nom is the softmax normalization function; F mul denotes matrix dot multiplication; X i is the i th channel feature in the transformed feature map X , and the response of i th channel feature is computed by j th ones that enumerates all possible channels. Although CA can enable different channels to obtain different weights, the coarse operation based on the entire channel feature maps (i.e., without the grouped feature representations [27], [31], [32], [46]) cannot enable all the channels to have sufficient communications, which has been empirically shown its importance in a large range of computer vision tasks. As a result, the ability to feature representation is limited.

B. Calibrated-Guidance (CG)
We propose CG to enhance feature channel communications in a feature transformer fashion, which can adaptively determine the calibration weights for the channels based on the global feature affinity-pairs. Its detailed structure is illustrated in Figure 2. CG is inspired by the transformer mechanism and the difference is that we combine the multihead representations, and concatenate the original feature maps and the calibrated features, then use a convolution layer to produce the enhanced feature maps as output.
We deploy the multi-head architecture to focus on richer channel feature representations. Multi-head in ViT [50] and DETR [21] can provide more feature selection when extracting features. Multi-head structure complements features by learning different contents, which is more sufficient than one head. Analysis work [51] finds that important ones in multi-head have one or more specialized and interpretable functions in the model, which indirectly shows the necessity of adopting multi-head structure.
First, we divide query and key into N parts in the channel dimension. Then, we feed the divided feature with shape (B, C/N, H, W ) into each head, where each structure is a CG module (B is batch size). For n th head module, the shape of similarity matrix s n is (B, C/N, C/N ), which can be expressed as: where each w denotes the learnable similarity scalar. After that, the outputs of these head modules (i.e., the partial result) are concatenated together to produce the holistic output feature maps, which have the same shape as the original feature maps. The above process can be formulated as: where s n i,j and w n i,j denote the n th partial similarity weight of the i th and j th channel features and the normalized one. The i th channel feature is calculated by other channel features. v j,n denotes the j th value of the n th head. F con is used for feature concatenation in the channel dimension. Compared to the previous transformer-based approaches, the multi-head CG has lower computational complexity, O(N C 2 ) both in time and space, while the previous ones have the computational complexity of O(N H 2 W 2 ). Compared to CA, our proposed CG implements on pyramid features have the following three advantages: (i) CG is designed for the enhancement of communications within and between feature pyramid layers, while most of the previous ones are used to capture the long-range dependencies in space and channel within features. (ii) CG is based on the multi-head structure, which has its unique tendency of feature representation in different feature spaces [46], [52]. Hence CG can provide an enhanced feature representation. (iii) CG is designed for object detection in aerial images. By enhancing feature pyramid representation, CG can solve complex background and worse imaging quality problems in aerial images, then obtain a more accurate proposals in head network (in Section IV-B). Experimental results (Section IV-C) show that CG can improve the state-of-the-art performance swimmingly on both oriented and horizontal tasks. Two CG implements of Base CG and Rearranged Pyramid CG show as follows.

C. Base CG
Given an arbitrary aerial image, we can extract a set of feature maps by a fully convolution network. For these feature maps, CG can directly achieve calibrated-guidance practice to enhance channel communications and adaptively determine the calibration weight for each channel. Its detailed architecture in a level of the feature pyramid (i.e., feature maps with the same scale) is illustrated in Figure 2 (b). Since this CG implementation is performed on the basic feature maps, we call it Base CG. Base CG is a general unit, which works on the backbone network.
Compared to other existing head-network-based taskspecific methods [4], [53], it is more universal and can facilitate a wide range of downstream recognition tasks. Our Base CG improves feature extraction, and the results can be seen from the ablation experiments shown in Section IV-B.

D. Rearranged Pyramid CG
Feature pyramid has shown its effectiveness in a wide range of computer vision tasks [8], [12], [54]. In this section, we show how to implement our Calibrated-Guidance on a feature pyramid (i.e., the proposed Rearranged Pyramid CG (RP-CG)). Compared to the existing feature calibration methods on the in-network feature pyramid [55]- [57], our RP-CG has lower computational complexity and fewer model parameters (details are shown in Section IV-A). The RP-CG module works on an extracted feature pyramid from the feature pyramid network [12], whose architecture is illustrated in Figure 2 (c).
From the perspective of levels inside the feature pyramid, each level can been seen as local features, i.e., only part of the features of the input image are captured. In order to emphasize the most suitable feature in the channel dimension of the feature pyramid, combining global and local information is crucial in feature extraction. In our work, RP-CG focuses on weighting different features among pyramid levels X P 2−P 6 following work [12], [54]. As illustrated in Figure 2 (c), we apply CG between 5 levels of the feature pyramid to fully communicate levels' information. In our implementation, firstly, we reduce the channel dimension and launch interpolation on pyramid features X P 2−P 6 to generate the same scale features (same scale as the largest one: P 2) and then concatenate them as X P 2−P 6 , which is expressed as: where F intp is a channel dimension reduction and scale interpolation function. The shape of output feature X P 2−P 6 is (B, 5, H p2 , W p2 ). Then, same as Base CG, RP-CG produces the output X i from input q i , k j and v j by learning the weight between the query and the key. The interaction is formulated as: Input : X P 2−P 6 Interpolation : X P 2−P 6 Extraction : where X i is the i th level feature in transformed feature map X rpcg P 2−P 6 with shape (B, 5, H p2 , W p2 ). X rpcg P 2−P 6 realizes global channel communication in pyramid features, but we need to find the right way to feed back to pyramid features.
In addition, there have been multitudes of methods to verify the effectiveness of the combination of global and local information in visual recognition, and our method is global in essence. To this end, combining our RP-CG with the existing local channel attention method is a natural choice. In this work, the classical channel attention [26] is chosen. Based on this, the overall structure of our proposed Rearrange Pyramid Calibrated-Guidance module can be expressed as: Output : X f inal P 2−P 6 = F conv (X P 2−P 6 ⊕ X P 2−P 6 ).
The output from X P 2−P 6 are divided into 5 parts (P 2 − P 6). X rpcg P 2−P 6 is the overall feature after we have weighted X P 2−P 6 . We use F mean to derive the weighting parameter to distinguish different scales' features, and it includes the operation of using the mean value as the weighting parameter for each pyramid's levels, which is then resized to the same scale of the original level feature. ⊗ is matrix cross multiplication, and ⊕ is channel concatenation. X P 2−P 6 is the calibrated feature with the same size as the original feature pyramid. We get final output X f inal P 2−P 6 from convolution F conv , which is to reduce the channel to the original size.

E. Network Architecture
CG can help the model learn richer communication information between feature channels, so it is suitable for object detection task in aerial images. In this paper, we build a Calibrated-Guidance network (CG-Net) for both oriented and horizontal object detection tasks of aerial images. The overall architecture is illustrated in Figure 2. CG-Net is based on our proposed Base CG (in Figure 2 (b)) and RP-CG (in Figure 2 (c)) for transforming pyramid features. Specifically, we deploy ResNet [13] as backbone following [1], which has been pre-trained on the ImageNet [58]. Then, we produce a feature pyramid from the feature pyramid network [12]. For this feature pyramid, we firstly apply Base CG in the feature maps from each level of the pyramid. After that, we deploy the RP-CG to produce a new feature pyramid that realizes global and local communication in the feature pyramid. Then, we concatenate the original feature maps with the calibrated ones together in the channel dimension and reduce the dimensionality of the concatenated feature maps into 256 channels by a 3 × 3 convolution. Finally, we use the head network from the RoI transformer [1] for oriented object detection and a standard Faster R-CNN [11] for horizontal object detection.

IV. EXPERIMENTS
To demonstrate the effectiveness and efficiency of our proposed method, experiments are carried out on both oriented object detection task and horizontal object detection tasks in aerial images. In what follows, we first show experiments settings including datasets, image size, baseline model, hyperparameters, implementation details and evaluation metrics in Section IV-A. Then we show some ablation results including some quantitative and qualitative experimental results in Section IV-B. Finally, we show result comparisons with stateof-the-art methods in Section IV-C.

A. Experimental Setup
In our work, two challenging datasets are selected in experiments, which are A Large-Scale Dataset for Object Detection in Aerial Images (DOTA) dataset [14] and High Resolution Ship Collections 2016 (HRSC2016) dataset [33]. DOTA is used for both oriented and horizontal object detection. HRSC2016 is used for only oriented object detection. . Images range in size between about 800 × 800 and 4, 000 × 4, 000 pixels and contain objects rendered in various scales, orientations, and shapes. For dataset split, we follow the setting of work [4], [14], and randomly select 1/2 of the original images as the training set, 1/3 as the testing set, and 1/6 as the validation set. • HRSC2016 [33] is a ship detection dataset of aerial images with challenging problems like arbitrary orientations and large aspect ratios. HRSC contains 20 ship categories with various appearances in 1061 images, collected from 6 harbors by Google Earth. Images range in size between about 300 × 300 and 1500 × 900 pixels. For dataset split, we follow the setting of work [33], and the ratio of the training, validation, and test sets is 5 : 2 : 5, respectively including 436 images, 181 images, and 444 images. Due to inconsistent image sizes in the experimental datasets and taking into account the training efficiency and effect for DOTA and HRSC2016, we follow benchmark [1] setting and  [14] with ResNet-101 [13] for oriented object detection. The lower the better. generate a list of 1, 024 × 1, 024 patches based on original images using 824 stride for training, validation and test sets.
Our baseline model is Faster R-CNN [11], which is the standard two-stage detector in object detection and backbone utilizes ResNet-101. We adopt FPN [12] as neck network to construct a feature pyramid with predefined anchors on pyramid level P2 -P6. In oriented object detection, we utilize RoI-Transformer [1] as the rotated head network that transforms horizontal proposals into rotated ones. For comparison fairly, all parameter and experimental settings are strictly consistent as those reported in [1], [14], [33]. The entire network is trained by end-to-end style without any extra rotation setting.
Although experience shows that the adjustment of hyperparameters is conducive to the further improvement of model performance, it is necessary for the fairness of comparison. In this paper, following [1], [2], For DOTA and HRSC2016, anchor size is set to {8 2 } with {1/2, 1, 2} aspect ratios and {4, 8, 16, 32, 64} anchor strides of each pyramid level in horizon-    tal anchors. To compare fairly and verify the effectiveness of the proposed method, we conducted ablation studies based on DOTA, and we avoid combining any other data augmentation or bells-and-whistles training strategy. When comparing with SOTA methods on DOTA and HRSC2016, like [1], [2], [4], we only add an augmentation with random rotation from the angles of (0, 90, 180, 270). For multi-head, N can be seen as a hyperparameter used to divide channels and set the number of multi-heads in Base CG. The dividing feature can provide more feature selection for model learning, and if N is large, it will weaken the communication ability among channel. Following parameter setting of previous work [54] and parameter adjustment, we set N to 2 in our final network.
In our work, the learning rate is 0.005 initially and conducts 0.0001 weight decay and 0.9 momentum in the SGD optimizer. Training iterations are set to 80k and 20k for DOTA and HRSC2016 following [14], [33]. In the testing step, we do not use any testing augmentation, such as multi-scale input or TTA. Besides above, experiments are conducted on two RTX2080Ti.
For evaluation, the results can be obtained from DOTA official evaluation server 1 by submitting predictions files. The mean Average Precision (mAP) of each category and entire is used to evaluate the model and analyze the result distribution following [14]. What's more, GFLOPs / FPS and model Parameters (#Params) are adopted to verify efficiency in the model, which is used to evaluate the computational complexity and runtime efficiency of the model.

B. Ablation Study
Based on DOTA [14], we carry out ablations study for oriented object detection in aerial images, which is aimed to: (1) verify the efficiency and effectiveness of different backbone networks combining our proposed methods; (2) verify the effectiveness of the two proposed units on base CNN feature maps (i.e., Base CG) and a feature pyramid (i.e., RP-CG); (3) compare different attention structure with our proposed methods; (4) explore the improvements of RPN input for aerial object detection; (5) reveal mismatching error rates on different scales; and (6) show some visual comparisons. The details are as follows: (1) Different backbones In Table I, the experimental results show different backbone networks results on the test set of DOTA, containing ResNet-50, ResNet-101, and ResNet-152. We contrast GFLOPs/FPS/#Params/mAP and improvements from the combination of our module. We can observe that combining our units to the backbone can increase mAP by 0.95%, 1.24%, and 0.75%. Besides, #Params and GFLOPs / FPS are reported for comparisons of model efficiency. Using Base CG and RP-CG increases computational costs; for example, it brings an average of 1.80 M model #Params with around 155 GFLOPs increment, and with around 5-10 FPS reduction on these three backbones. Considering the mean Average Precision and Computational complexity, ResNet-101 is selected as our backbone network in experiments.

(2) The proposed units
In Table III, we show our proposed units and their combined performance on ResNet-101. We can observe that Base CG and RP-CG respectively bring 0.58% and 0.46% improvements for the bounding box mAP. The corresponding eachcategory mAP radar chart for oriented object detection is in Figure 3, to show the trend of the performance change. Combining Base CG and RP-CG together (i.e., our proposed CG-Net), the model can increase mAP by at most 1.24%, in which some categories have large improvements, such as BD (Baseball diamond) 5.05%, SBF (Soccer-ball field) 3.14%, and RA (Roundabout) 2.77%. These results indicate that the feature presentation capabilities have been further improved by Base CG and RP-CG. As for the model efficiency, we can observe that Base CG and RP-CG respectively bring 0.59 and 0.61 M model #Params with 51.53 and 51.89 GFLOPs. When these two models are deployed together, there is 1.79 M model #Params and 154.95 GFLOPs increment. Our proposed CG is based on self-attention and calculates the similarity matrix between features so that GFlops increases from 289.26 to 444.21. In Table II, we compare the results of multi-head in our CG module and found that #Params reduce 0.36 M and mAP have 0.68% increment when adding multi-head structure.

(3) Different attention comparison
In Table III, we also show different attention mechanism comparison results, including Non-local [22] in spatial dimension and Squeeze-and-Excitation (SE) block from SENet [26] in channel dimension. In more detail, We apply Non-local and SE blocks in different levels of the feature pyramid. We can observe that Non-local and SE block respectively bring 0.27% and 0.13% improvements for the bounding box mAP and improve 0.43% mAP when combined together. When we apply the attention module in feature pyramid levels directly, improvements in mAP are less than our proposed CG module, and Non-local structure also has higher computational complexity and model #Params. From the table, we can observe that Non-local brings 2.18 M model #Params with 237.15 GFLOPs. SE block has little change in #Params and GFLOPs but improvements are very limited compared to the increase of mAP results from CG. When these two parts are deployed together, there is a 0.43% mAP increment, less than Base CG 0.58% and RP-CG 0.46%. (4) Improving RPN input for aerial object detection CG-Net shows significance when addressing complex background and worse imaging quality problems. Aerial images have complex geological structures, objects of different sizes, and object categories due to overhead shots from high altitudes, so they have a more complex background. In aerial object detection, worse imaging quality is detrimental to learning object features and directly affects model training. Therefore, we implement CG on pyramid features with Base CG and Rearrange Pyramid CG. In pyramid features, the size of proposals from the Region Proposal Network (RPN) [11] depends on the maximum response layer. Therefore, whether object proposals are selected accurately will affect the difficulty of the ROI module in training the detection box, which requires more accurate pyramid features. CG-Net can help the model learn richer communication information within and between each layer of pyramid features. To sum up, making Calibrated-Guidance operation for pyramid features is essential before input into region proposal network. the feature map, we define mismatching error rates on different scales in the feature pyramid, i.e., the selected level of each object is not consistent with the ground-truth level. It can be seen from Figure 4 that the mismatching error rate of each layer in the feature pyramid has been reduced after deploying our proposed method (i.e., the joint implementation of Base CG and RP-CG). Compared with the low-level feature in the feature pyramid that is more suitable for small objects, the reduction of error rates in high-level is obvious. For example, there are 0.1%, 0.2%, 0.1%, 0.7%, and 1.2% error rate reduction from level P 2 to P 6. Therefore, the effectiveness of our method can be further confirmed.

(6) Visualized samples
From results of Ablation Experiment Table IV and Figure 6, complex background and worse imaging quality, showing like Baseball diamond (BD), Ground track field (GTF), Plane (PL) and Roundabout (RA), can be seen as obvious problems. Specifically, when detecting boxes are used to cover the whole objects, the boundary of boxes may show certain fuzziness, such as class Roundabout in Figure 5 left line 2, the problem of which is affected by complex background and labeling for completely covering object in aerial data. In left line 3, worse imaging quality leads to somewhat additional false detection boxes in local areas.

C. Peer Comparisons
On DOTA. The experimental result on the test set of DOTA is shown in Table IV. The each-category mAP radar chart for oriented object detection is in Figure 7 and for horizontal object detection is in Figure 8, to show the trend of the performance change. CG-Net achieves the best score among all compared methods, both on oriented object detection (77.89% mAP) and horizontal object detection (78.26% mAP). Inside 15 categories, CG-Net achieves great results for oriented object detection (6 rank top) and horizontal object detection (10 rank    [14] for both oriented and horizontal object detection in aerial images. By "Ours" we mean that implementing Base CG and RP-CG on the baseline model at the same time. "R-" in the Backbone column denotes the ResNet [13], "D-" in the Backbone column denotes the DarkNet [9], and "H-" denotes the Hourglass network [66]. top). It is worth noting that CG-Net utilizes a weaker backbone network to surpass the state-of-the-art by 0.52% mAP on oriented object detection tasks (ResNet-152 vs ResNet-101) and brings 2.91% mAP increment for horizontal object detection with the same backbone. Compared to the approach (i.e., SCRDet [4]) with the same backbone network (i.e., ResNet-101), our model has improved mAP by 5.82%, which is quite remarkable in today's performance. Rotating boxes avoid excessive background and clutter when calculating mAP compared with horizontal boxes so the improvements using our method for rotating boxes task are limited. While horizontal boxes contain more background, the features processed by our CG suppress background and highlight the object's foreground features, so that mAP changes in horizontal boxes task are higher. Visualization results on the test set of DOTA are shown in Figure 6. We can clearly observe that our model can achieve accurate recognition results.
On HRSC2016. From table V, result comparisons with peer work on the test set of HRSC2016 [33] show that the performance of our CG-Net surpasses the state-of-the-art  methods by 90.58% mAP, which increases 1.12% mAP on the previous best model (R 3 Det-DCL [5]). Compared with the existing anchor strategy with large number and ratio, our CG-Net only combines original anchors setting with {1/2, 1, 2} ratio when training network, so it is worth noting how to utilize the presetting anchors to select or strengthen high-quality feature is reasonable and necessary considering efficiency and effectiveness. In addition, we also believe that our model can achieve further recognition performance with more complex aspect ratios.

V. CONCLUSION AND FUTURE WORK
Complex background and worse imaging quality are obvious problems in aerial object detection. Most approaches tend to develop elaborate attention mechanisms for the space-time feature calibrations with arduous computational complexity. We have proposed a CG operation to enhance channel communications, which can determine the calibration weights for each channel. We implemented CG on the standard object detection backbone network with a feature pyramid network and we conducted extensive experiments on both oriented and horizontal object detection of aerial images. Experimental results on the challenging benchmarks indicated that the proposed CG-Net achieve state-of-the-art performance in accuracy with a fair computational overhead. The each-category mAP radar chart for oriented object detection and horizontal object detection show the robust trend of its performance. CG-Net surpasses the state-of-the-art for oriented object detection with a weaker backbone network (ResNet-101 vs ResNet-152) and for horizontal object detection with the same backbone. We will explore to apply CG-Net to a broader range of natural scenes. Meanwhile, exploring how to use CG-Net in other visual tasks such as semantic segmentation and object re-identification is also an important direction.