Fine-Grained Object Detection in Remote Sensing Images via Adaptive Label Assignment and Refined-Balanced Feature Pyramid Network

Object detection in high-resolution remote sensing images remains a challenging task due to the uniqueness of its viewing perspective, complex background, arbitrary orientation, etc. For fine-grained object detection in high-resolution remote sensing images, the high intra-class similarity is even more severe, which makes it difficult for the object detector to recognize the correct classes. In this article, we propose the refined and balanced feature pyramid network (RB-FPN) and center-scale aware (CSA) label assignment strategy to address the problems of fine-grained object detection in remote sensing images. RB-FPN fuses features from different layers and suppresses background information when focusing on regions that may contain objects, providing high-quality semantic information for fine-grained object detection. Intersection over Union (IoU) is usually applied to select the positive candidate samples for training. However, IoU is sensitive to the angle variation of oriented objects with large aspect ratios, and a fixed IoU threshold will cause the narrow oriented objects without enough positive samples to participate in the training. In order to solve the problem, we propose the CSA label assignment strategy that adaptively adjusts the IoU threshold according to statistical characteristics of oriented objects. Experiments on FAIR1M dataset demonstrate that the proposed approach is superior. Moreover, the proposed method was applied to the fine-grained object detection in high-resolution optical images of 2021 Gaofen challenge. Our team ranked sixth and was awarded as the winning team in the final.


I. INTRODUCTION
O BJECT detection in high-resolution remote sensing image is to accurately locate and identify the object of interest. Automated analysis and understanding for remote sensing images have become critically important in many real-world applications, such as town planning, strategic deployment in the military field, and Earth observation [1], [2], [3], [4]. Thus, object detection in remote sensing images has a very broad application prospect. Manuscript 1. Illustration of fine-grained objects detection, objects have high interclass variation and low inter-class variation, which make object detection an even more challenging task.
In recent years, with the development of convolutional neural networks (CNN), the field of computer vision has grown considerably due to the powerful feature extraction capability of CNN. Various vision-based tasks including classification, object detection, and semantic segmentation have been able to achieve superior performance. A number of CNN-based object detectors [5], [6], [7], [8] have made significant progress and achieved excellent performance on MS COCO dataset [9] and PASCAL VOC dataset [10]. However, most of the existing techniques tend to suffer from dramatic performance degradation when applied to remote sensing images, mainly due to the difference between remote sensing images and natural scene images. Objects in remote sensing images are usually densely distributed, appear in arbitrary orientations, and have large scale variations, which make object detection an even more challenging task. As shown in Fig. 1, for fine-grained object detection, the high intra-class variation and low inter-class variation lead to limitations in the performance of various detectors.
To solve these issues, a number of approaches [11], [12], [13], [14], [15], [16], [17], [18], [19] have been developed. Feature pyramid network (FPN) [20] provides an effective solution to This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the problem of large-scale variation in images. The hierarchical structure of FPN makes the feature maps at different levels contain feature information at different scales. In the FPN, information can be interacted between different layers, which effectively improve the accuracy of multiscale object detection. However, the FPN makes the semantic information between the feature maps of nonadjacent levels sparse when fusing the information of feature maps. Furthermore, objects in remote sensing images usually suffer from a large amount of noises, which greatly affects the performance of the detector. In fine-grained object detection, we need to feed high quality feature maps with richer semantic information into the detector.
Meanwhile, most of existing object detectors employ the Intersection of Union (IoU) as a matching metric to select the high-quality samples for classification and localization. The performance of the detector will be greatly affected if label assignment strategy is not appropriate. The fixed IoU threshold is used in RetinaNet [21], while the ATSS [22] set the IoU threshold dynamically by automatically selecting positive and negative training samples according to the statistical characteristics of the data. But the label assignment strategies applied directly to remote sensing images has some drawbacks, which does not make full use of the statistical characteristics of oriented objects. Moreover, there are a large number of narrow objects with arbitrary orientation in the remote sensing images. IoU is extremely sensitive to angle changes for narrow oriented objects, a small angular deviation leads to a dramatic drop in IoU. Fixed IoU threshold will lead to narrow oriented objects without sufficient positive samples, which limits the performance of the detector.
To tackle the above issues, we propose the refined and balanced feature pyramid network (RB-FPN) and center-scale aware (CSA) label assignment strategy. RB-FPN closes the semantic information gap between different layers of the FPN, and forces each layer of the network to learn the features of the objects at different resolutions. Moreover, the RB-FPN eliminates the complex background information and enhances the semantic representation of feature maps, and enhances the variance between different features. The obtained high-quality feature maps are more effective for the recognition of finegrained objects. Then we propose a CSA label assignment strategy to automatically select positive samples according to statistical characteristics of oriented objects. The CSA label assignment strategy selects more high-quality positive samples during training. On the other hand, dynamically adjusting the IoU threshold according to the statistical characteristics of the ground truth (GT) boxes enhance the robustness of the detector. To summarize, the main contributions of this article are as follows.
1) A refined and balanced feature pyramid network is proposed to reduce the semantic information gap of FPN and suppress background information while focusing on regions that potentially contain objects. The obtained highquality feature maps enable efficient fine-grained object detection. 2) A novel center-scale aware label assignment strategy is proposed to dynamically adjust the IoU threshold based on the IoU distribution around the GT and its aspect ratios.
3) Comprehensive experiments are conducted on the FAIR1M dataset of Gaofen Challenge to demonstrate the efficacy as well as the superiority of the proposed methods.

A. Generic Object Detection
With the advancement of the deep learning techniques, object detection has achieved great progress owing to the powerful representative ability of deep convolutional neural networks. Most of the existing detectors can be divided into two types: 1) two-stage methods and 2) one-stage methods. The two-stage detector is a coarse-to-fine structure. In the first stage, a region proposal network (RPN) is used to generate region of interest (RoI) that potentially contains objects. In the second stage, category prediction and location regression are performed on the selected RoIs. The representative two-stage detectors are the pioneering RCNN family [5], [23], [24].
The simple architecture of one-stage detectors allows for tradeoffs between accuracy and speed and is more suitable for real-time detection tasks. One-stage detectors get rid of the complex regional proposal stage and predict the object instance categories and their locations directly from densely predesigned candidate boxes. One-stage detectors are popularized by SSD [7], YOLO family [6], [25], [26], [27], and RetinaNet [21].
FPN and other similar top-down structures [28], [29], [30], [31] are proposed to solve the problem of scale variations of objects. FPN takes advantage of the pyramid shape of convolution features and combines them in various resolutions to construct a feature pyramid with rich semantic information to recognize objects at different scales. PAFPN [32] adds a bottom-up fusion path to the FPN, fully exploiting the shallow features of the network. Liu et al. [33] proposed a data-driven strategy for pyramidal feature fusion method, which learns the way to spatially filter conflictive information to suppress the inconsistency.
Many recent works have refined the process of label assignment to further improve detection performance. ATSS [22] automatically selects positive and negative training samples based on the statistical characteristics of the objects. Kim et al. [34] assume that the distribution of joint loss for positive and negative samples follows the Gaussian distribution. Hence, it uses Gaussian mixture model to fit the distribution of training samples, and then uses the center of positive sample distribution as the positives/negatives division boundary. Autoassign [35] tackles label assignment in a fully data-driven manner by automatically determining the positives/negatives in both spatial and scale dimensions. OTA [36] views the label assignment process as an optimal transportation problem, and the number of anchor boxes assigned to each GT is dynamically calculated according to a global anchor box regression state.

B. Oriented Object Detection in Remote Sensing Images
Oriented object detection has attracted plenty of interest, especially in remote sensing images. Oriented object detectors locate and classify objects with oriented bounding boxes, which provide more accurate orientation information of objects. Yang et al. [14] built an oriented object detection method on the generic object detection framework of faster R-CNN. Xu et al. [37] proposed the Gliding Vertex, which learns the four vertex gliding offsets on the regression branch to achieve oriented object detection. Wei et al. [38] proposed a one-stage, anchorfree, and nms-free model (O 2 -DNet) to detect oriented objects by predicting a pair of midlines inside each object. ReDet [39] encodes rotation equivariance and rotation invariance in image features to increase the accuracy of oriented object detection. Ming et al. [40] designed a new label assignment strategy for one-stage oriented object detection based on RetinaNet [21]. It assigns the positive or negative anchors dynamically through a new matching strategy. Zhang et al. [41] proposed aspect ratio-guided label assignment to adjust the IoU threshold, and aspect ratio guided IoU loss is designed to automatically adjust the weights of the angle regression.
In recent years, an increasing number of works have focused on fine-grained object detection in remote sensing images. Sun et al. [42] proposed a cascaded hierarchical object detection network (CHODNet). CHODNet consists of four stages: 1) feature refinement network, 2) region proposal network, 3) proposals refinement network, and 4) fine-grained detection network. CHDONet learns external and internal representations independently from the dataset using a cascaded hierarchical structure. Zhang et al. [43] proposed a multiscale semantic segmentation feature fusion module, which merges the semantic features with the original features layer by layer to distinguish the foreground from the cluttered background. R 2 IPoints [44] employs a set of category-aware points to encode spatial and semantic information oriented to arbitrary objects.

III. PROPOSED METHOD
Oriented R-CNN (ORCNN) [45] is a superior two-stage oriented object detector. Our method is based on ORCNN and consists of the backbone network, RB-FPN, CSA label assignment strategy, oriented RPN, and R-CNN detection head. The proposed framework is illustrated in Fig. 2. RB-FPN provides higher quality feature maps for fine-grained object detection by eliminating background information and balancing the feature maps in the FPN. CSA label assignment strategy is designed to select potentially high quality samples based on the statistical characteristics of GT box. Overall, the model predicts the location and fine-grained category information of objects in remote sensing images more efficiently. More details are to be discussed in the following subsections.

A. Refined and Balanced Feature Pyramid Network
Deep features in backbones are with more semantic information, while the shallow low-level features are more descriptive in terms of detailed information. The top-bottom hierarchical network structure of FPN allows the feature maps of different layers to deliver information. But the sequential manner in this methods will make fused features focus more on adjacent resolution. The semantic information contained in nonadjacent levels is diluted in each fusion during the information flow.
Thus, it is crucial to utilize the features at different levels.
Besides, the complex background information in remote sensing images usually affects the performance of the detector. It is important to effectively eliminate the background information to provide a higher quality feature map for the subsequent tasks.
The proposed refined and balanced feature pyramid network (RB-FPN) makes the information of different levels of feature maps more balanced, enriching the semantic information in the feature maps. First, the feature maps of the different layers are resized to the same resolution, and the resized feature maps are then summed in pixelwise. Second, the integrated feature map is refined by the proposed deformable key-query-position (DKQP) attention module. Finally, the refined feature map is added to the original feature maps in the FPN by upsampling and downsampling, respectively.
Self-attention mechanism has been widely used in the field of computer vision and performed very well. In determining the attention weight assigned to a key for a given query, several properties of the input are usually considered. One is the content of the query. For self-attention, the query content can be a feature at a query pixel in an image. The second is the content of the key, where the key may be a pixel within the local neighborhood of the query. The third is the relative position of the query and key. Based on these input properties, Dai et al. [46] argue that the attention weights are expressed as a sum of four terms ( 1 , 2 , 3 , 4 ). Specifically, these factors are the query and key content ( 1 ), the query content and relative position ( 2 ), the key content only ( 3 ), and the relative position only ( 4 ).
In our DKQP attention module, we only focus on 2 and 3 of the attention factors, since the performance gain provided by other factors is insignificant when the computational overhead is taken into account [47]. In addition, deformable convolution [48] efficiently exploits sparse local locations and captures high-quality features, and is designed for capturing regions of interest. Inspired by the properties of deformable convolution, we use deformable convolutions and learnable vectors in the self-attention module to focus on regions that may contain objects, thereby obtaining high-quality feature information. The proposed DKQP attention focus more on potential object regions to enhance the semantic information of the feature map.
In deformable convolution, for each position p i in the output feature map, the output y(p i ) is defined as where w(p n ) is the weight for position p n , x(p) is the feature at position p, p n enumerates all the positions in grid R, and Δp n is the offset of the convolution sampling location. As illustrated in Fig. 3, key content attention is 3 . The deformable convolution and another learnable vector are calculated to obtain kqp 2 , it can be formulated as follows: where l m is a learnable vector and x q is the reshaped output of the deformable convolution. Generalized attention formulation  is as follows: (3) Here, A deform m (q, k, z q , x k ) denotes the attention weights in the mth attention head, which is calculated from kqp 2 and 3 , z q is query content at index q, and x k is key content at index k. W m and W m are learnable weights. Ω q specifies the supporting key region for the query. In DKQP attention, we use deformable convolution and learnable vector instead of query content and relative position. The feature capture capability of deformable convolution allows the model to focus more on the RoI Learnable vector captures global positional bias between the key and deformable convolution elements. And DKQP attention brings a lower computational overhead by sampling a sparse set of key element for each query making the complexity linear to the query element number.
RB-FPN reduces the semantic gap between different scale feature layers of the traditional FPN while forcing each layer of the network to learn the features of the objects at different resolutions. In refine stage, our DKQP attention module suppresses complex background information while focusing on regions that potentially contain objects.

B. Center-Scale Aware Label Assignment
In our model, the oriented RPN uses six parameters (x, y, w, h, Δw, Δh) to denote an oriented proposal. The oriented proposal needs to be projected into a oriented bounding box, as shown in Fig. 4. During the projection, there will be misalignment in the region represented by the box. But the center position does not change during the projection, so the center position of the prediction box is particularly significant. If the center distance between the anchor and GT box are relatively far when selecting positive samples, the quality of the learned samples will be inferior. In sample selection, the samples located near the center of the GT box are more representative. Moreover, there are a large number of narrow oriented objects in the remote sensing images. As shown in Fig. 5, IoU is very sensitive to the narrow oriented object, at the same aspect ratio, a small angular deviation causes a sharp drop in IoU. The anchor may still be a potentially high quality sample at this position, we expect to select this potentially high quality sample. But it will be filtered out because the IoU between the anchor and the GT box is less than a fixed threshold.
To solve the above problems, we propose the CSA label assignment strategy, which adaptively adjusts the threshold of IoU according to statistical characteristics of oriented objects. The CSA label assignment algorithm is shown in Algorithm 1. For each GT box g on the image, we first find out its candidate positive samples. On each pyramid level, we select k (k = 1, 2, 3, 4. . ..15, default k = 9) anchor boxes whose centers are closest to the center of GT box based on Euclidean distance. After that, we compute the IoU between these candidates and the GT boxes as D g , whose mean is computed as IoU m . Then, calculate the aspect ratio of each GT box as r. The aspect ratio of GT box is mapped to a constant value greater than or equal to 1 by the function g(r), The specific function g(r) is as follows: where r = w h , w and h are the width and height of the GT box, respectively. Then the function f (r) is used as a mapping function for the aspect ratio of GT box. The function is to allow larger aspect ratios to have lower value and mine enough higher potential samples. The function is defined as follows: where g(r) is obtained from (4), with mapping function g(r) and the computed average IoU, the final adaptive IoU threshold is available via CSA label assignment strategy. The specific IoU Algorithm 1: Center-Scale Aware Label Assignment. Input: P is the number of feature pyramid levels; G is a set of ground truth boxes on the imag; A is a set of all anchor boxes; A i is a set of anchor boxes from the ith pyramid levels; f (r) is a function that maps the aspect ratio k is a hyperparameter with a default value of 9; Output: S p is a set of positive samples; S n is a set of negative samples; 1: for each ground truth g∈ G do 2: build an empty set for candidate positive samples of the ground truth g: C g ← ∅; 3: for each level i ∈ P do 4: S i ← select k anchors from A i whose center are closest to the center of ground truth g based on Euclidean distance; 5: C g = C g ∪ S i ; 6: end for; 7: compute IoU between C g and g : D g = IoU (C g , g); 8: compute mean of D g : IoU m = Mean(D g ); 9: r ← compute the aspect ratio of ground truth g 10: compute IoU threshold T IoU := IoU m * f (r); 11: for each candidate c ∈ C do 12: if IoU > T IoU and center of c in g then 13: S p = S p ∪ c; 14: end if 15: end for; 16: end for; S n = A − S p ; 17: return S p , S n ; threshold is as follows: where f (r) is obtained from (5), IoU m is the mean value of IoU of candidate proposals around GT box. Finally, we select these candidates whose IoU are greater than or equal to the threshold T IoU as positive samples. The proposed CSA label assignment dynamically adjusts the threshold of IoU according to statistical characteristics of the GT boxes. Using the CSA label assignment strategy, oriented objects with large aspect ratios have smaller thresholds, thus ensuring the potential samples are selected. On the other hand, CSA label assignment ensures the number of positive samples changes dynamically according to statistical characteristics of GT boxes, which help to avoid the training loss being dominated by massive negatives. It is worthy of mentioning that the proposed label assignment strategy is only used for training, which does not incur the computational load at the inference stage.

C. Training
Our method consists of an oriented RPN and an R-CNN detection head. It is a two-stage detector, where the first stage generates high-quality oriented proposals in a nearly cost-free manner and the second stage is R-CNN detection head for proposal classification and regression. Next, we describe the loss function and representation of oriented RPN and R-CNN detection head in detail. The oriented RPN uses six parameters (x, y, w, h, Δw, Δh) to denote an oriented proposal. For bounding box regression, we adopt the affine transformation, which is formulated as follows: where (x, y), w, and h are the center coordinate, width, and height of external rectangle, respectively. Specifically, x a , x, x * represent values related to anchors, the predicted boxes, and the GT boxes, the same for y a , y, y * . Δw and Δh are the offsets of the top and right vertices of the prediction box and anchor relative to the top and left midpoints. Δw * and Δh * are the  Following is the loss function to train oriented RPN: Here, i is the index for anchors, p * i is the GT label of the ith anchor, p i is the output of the classification branch of oriented RPN. u * i is the supervision offset of the GT box relative to ith anchor, u i indicate outputs of the regression branch of R-CNN detection head. L cls is the cross entropy loss, L reg is the Smooth L1 loss. λ 1 is the balance parameters (default by 1).
The R-CNN detection head uses the five parameters (x, y, w, h, θ) to represent an oriented bounding box. The bounding box regression can be described by the following formulas: Here (x, y), w, and h are the center coordinate, width, and height of external rectangle, respectively. Specifically, x p , x, x * , represent values related to proposal samples, the predicted box, and the GT box, the same for y p , y, y * , respectively, the θ and θ * denote the angle of the GT box and the angle of the proposal box.
Following is the loss function to R-CNN detection head: where L cls is the cross entropy loss. L reg is the Smooth L1 loss. λ 2 is the balance parameters (default by 1). i is the index for proposal, v * i is the supervision offset of the GT box relative to ith proposal, v i indicate outputs of the regression branch of R-CNN detection head. p * i is the GT box label of the ith anchor, p i is the output of the classification branch of R-CNN detection head.

A. Dataset
Based on the 2021 Gaofen Challenge, we conducted experiments on the FAIR1M dataset [42]. FAIR1M dataset is a large-scale dataset for fine-grained object detection in remote sensing images. Images in the FAIR1M dataset are with a spatial resolution ranging from 0.3 to 0.8 m. In FAIR1M dataset, there are more than 40 000 remote sensing images with 1 million instances from Gaofen satellites and Google Earth platform. Each image is of the size in the range from 1000 × 1000 to 10000 × 10000 pixels and contains objects exhibiting a wide variety of scales, orientations, and shapes. All images are annotated with oriented bounding boxes and with respect to 5 categories and 37 subcategories. The types of airplanes include Boeing 737 (737), Boeing 777 (777)

B. Implement Details
We choose ResNet-50 [49] with FPN as the backbone network for ablation experiments and hyperparameters of these models are set to default values if not specified. We conduct the experiments on a server with four RTX 3090 GPUs using a total batch size of eight (two images per GPU) for training. We use a single RTX 3090 GPU for inference. The experimental results are produced on the mmdetection platform. The stochastic gradient descent (SGD) optimizer is used in training. The initial learning rate is set to 0.01 with the warming up for 500 iterations, and the learning rate is decreased by a factor of 0.1 at each decay step. The momentum and weight decay are set to 0.9 and 0.0001, respectively. We train the models with 12 epochs for FAIR1M dataset. The experimental environment was ubuntu 18.04, torch 1.7.0, and cuda 11.0 for the model training.
For FAIR1M dataset, we select 16 488 images as the training set and 8137 images as the testing set. The test results are submitted to the ISPRS Benchmark online validation platform. We first convert the annotations to DOTA dataset [51] format. Then, we crop the original images into patches with 800 × 800, the pixel overlap between two adjacent patches is 200. With regard to multiscale training and testing, we first resize the original images into three scales (0.5, 1.0, and 1.5) and then crop them into 800 × 800 patches with the stride of 200. We also apply random flipping and random rotation argument method during training. At the testing stage, we conduct the same data augmentation on the images to ensure consistency between training and testing.

C. Evaluation Metric
In the task of object detection, each image may contain objects of multiple categories. Therefore, a measure of detector performance is needed to validate the localization and classification capabilities. The average precision (AP) and mean average precision (mAP) are the most commonly used evaluation metrics. AP is determined by recall and precision, where recall refers to the ability of the model to find all objects, and precision refers to the ability of the model to correctly identify the detected objects. Each category uses a PR curve (P refers to precision and R refers to recall) to calculate the AP. There are currently two versions of the evaluation metric, PASCAL VOC2007 metrics and PASCAL VOC2012 metrics. We evaluate the models at the testing set in terms of PASCAL VOC07 metrics.

D. Comparisons With State of the Art
We compare the proposed approach with other state-of-the-art methods on FAIR1M dataset. To ensure the independence of the training, the results of these models are submitted to the online validation site of ISPRS benchmark to evaluate performance. Note that all methods adopt ResNet-50 as the backbone network, our state-of-the-art experiments which adopt Swin-Transformer (Swin-T) backbone network. As shown in Table I, our method obtains 42.62% mAP, which outperforms the baseline model by 2.4% mAP. With limited data augmentation (i.e., multiscale data and random rotation), our approach reaches 45.18% mAP. The backbone of the model is replaced with Swin-T [52], our method achieves 47.58% mAP, surpassing almost all recent state-of-theart detectors. Swin-T has a powerful feature extraction capability and focuses on the global information in the feature map, which effectively distinguishes the features between different categories, so it has a higher performance in some categories. We visualize some detection results in Fig. 6. It can also be observed  I  COMPARISON OF THE PROPOSED APPROACH WITH THE STATE-OF-THE-ART APPROACH IN THE FAIR1M DATASET ON THE ISPRS BENCHMARK ONLINE  VALIDATION DATASET that the proposed method can accurately detect densely arranged objects. The proposed RB-FPN provides high-quality feature maps that can effectively identify categories. The proposed CSA provides more high-quality samples that accurately learn the bounding box of the object.

E. Ablation Studies
Effectiveness of RB-FPN: We conduct ablation experiments on the online verification of ISPRS Benchmark to evaluate the effectiveness of the proposed RB-FPN. We use the Reti-naNet with an orientation prediction in the regression branch   Table III shows that the FLOPs increase by 0.3 G, but performance improved by 2.02% mAP, and the results demonstrate a large effect gain with a few parameters and computational load. As shown in Fig. 7, the feature maps of each layer in FPN contain a large amount of noisy information and do not effectively distinguish between background and foreground information. The integrated feature map refined by DKQP attention module has very high response on the objects, which effectively eliminate the background information and enhance the semantic information. RB-FPN balances the semantic information of each layer in the FPN and focuses on regions that may contain objects, thus extracting features that are more beneficial to the detector.

Effectiveness of CSA:
We also perform ablation experiments on the online verification of ISPRS Benchmark to evaluate the effectiveness of our proposed CSA. CA (center aware) consider only the priori information of the center distance between anchor and GT box to adaptively adjust the IoU threshold, CSA introduce the prior information of the aspect ratio of the GT box based on the CA. Experimental results for different label assignment strategies are shown in Table IV. CA label assignment strategy achieves 41.46% mAP, about 1.2% mAP higher than the baseline method. With CSA label assignment, our method obtain 42.23% mAP, which brings about 2.01% mAP gain over the baseline. Compared with the baseline, these two label assignment strategies yield significant improvements in performance, which also proves that the center distance and the aspect ratio of GT box are important information for label assignment. In addition, we compare our CSA scheme with the Max-IoU scheme during training. As shown in Table III, CSA label assignment strategy is not only an effective method with high detection accuracy but also an efficient scheme in both speed and parameters. As shown in Fig. 8, the visualization results show that the baseline method tends to generate false negatives or cannot accurately detect oriented objects [see Fig. 8 (1)], while our approach has better performance to those oriented objects. We argue that adopting CSA label assignment makes it easier for the network to select enough positive samples, thus making the detector more robust. The experimental results show that our CSA label assignment is effective in compensating for potential samples, and the detector performance is effectively improved due to the center distance and aspect ratio-guided IoU thresholds.

V. CONCLUSION
In this article, we propose a refined and balanced feature pyramid network (RB-FPN), which aims to eliminate the problem of complex background information and enhance the semantic feature information in the FPN. Specifically, RB-FPN focuses more on potential object regions to enhance the semantic information and balance the feature maps in the FPN to provide more pure information for subsequent classification tasks. And the proposed CSA label assignment strategy fully utilize the statistical characteristics of oriented objects. The CSA label assignment strategy dynamically adjust the IoU threshold during the training process, which alleviates the problem of angle sensitivity of IoU for narrow oriented objects. When performing sample selection, the CSA label assignment strategy allows narrow oriented objects to retain more potential samples and prevents high-quality samples from being filtered out. Moreover, a comprehensive and extensive evaluation of FAIR1M dataset indicates that our approach yields consistent and substantial gains compared to the baseline approach.
Junjie Song received the B.S. degree in electrical