Orientation-First Strategy With Angle Attention Module for Rotated Object Detection in Remote Sensing Images

Recently, object detection in remote sensing images (RSIs) have received extensive attention and made significant progress. Nonetheless, the arbitrary orientations of objects in RSIs make their detection a challenging task. Most of the existing detection methods are difficult to extract the orientation features of objects due to the lack of directionality of conventional convolutions. In addition, the boundary discontinuity in angle regression affects the detection of object orientations. In response to these problems, this article proposes an orientation-first refinement detector (OFRDet), which is based on a strategy that enables the detector to detect the angle of an object ahead of others and presets oriented anchors. In OFRDet, we propose an angle encoding regression module (AERM) and an angle channel attention module (ACAM). AERM transforms angle detection into multiparameter regression, which eliminates boundary discontinuities. ACAM uses convolution kernels with different angles to extract directional features purposefully according to the preset oriented anchors. After these two modules, more accurate bounding boxes are generated and sent to the refined stage to obtain the final detection results. We evaluate our method and demonstrate the effectiveness of it by conducting experiments on two challenging and credible datasets, DOTA, HRSC2016. OFRDet achieves competitive results 79.56%, 96.29% mAP on the two datasets, respectively.


Orientation-First Strategy With Angle Attention Module for Rotated Object Detection in Remote
Sensing Images Yuxi Zhang , Yongcheng Wang , Ning Zhang , Zheng Li , Zhikang Zhao, Yunxiao Gao, Dongdong Xu, and Guangli Ben Abstract-Recently, object detection in remote sensing images (RSIs) have received extensive attention and made significant progress. Nonetheless, the arbitrary orientations of objects in RSIs make their detection a challenging task. Most of the existing detection methods are difficult to extract the orientation features of objects due to the lack of directionality of conventional convolutions. In addition, the boundary discontinuity in angle regression affects the detection of object orientations. In response to these problems, this article proposes an orientation-first refinement detector (OFRDet), which is based on a strategy that enables the detector to detect the angle of an object ahead of others and presets oriented anchors. In OFRDet, we propose an angle encoding regression module (AERM) and an angle channel attention module (ACAM). AERM transforms angle detection into multiparameter regression, which eliminates boundary discontinuities. ACAM uses convolution kernels with different angles to extract directional features purposefully according to the preset oriented anchors. After these two modules, more accurate bounding boxes are generated and sent to the refined stage to obtain the final detection results. We evaluate our method and demonstrate the effectiveness of it by conducting experiments on two challenging and credible datasets, DOTA, HRSC2016. OFRDet achieves competitive results 79.56%, 96.29% mAP on the two datasets, respectively.
Index Terms-Angle channel attention, angle encoding, remote sensing images, rotated object detection.

I. INTRODUCTION
O BJECT detection is a technique in computer vision that requires locating and identifying the certain object in the image. Remote sensing images (RSIs) are more challenging to be detected since the scale of RSIs is larger and the content is more complex than that of ordinary natural images [1]. In addition, objects are unevenly distributed on RSIs and are generally small. With the continuous development of deep learning technology, neural networks are widely used in image processing. Meanwhile the object detection based on convolutional neural networks (CNNs) have made great progress. Numerous CNN-based object detection methods aimed at addressing the abovementioned challenges in RSIs have been proposed in recent years [2], [3], [4], [5], [6], [7], [8]. These methods have achieved pretty good results and solved some of the challenges to a certain extent.
The object detection method based on neural network uses the smallest rectangular boxes that can contain the objects to locate the objects. Generally, the horizontal bounding boxes (HBBs) are quite good at representing the objects in natural images but not the ground objects in RSIs because the ground objects have arbitrary orientation in the overhead view used in RSIs. As illustrated in Fig. 1(a), HBBs representing rotated objects may contain a lot of undesirable contents such as a large amount of background for narrow objects with large aspect ratios and parts of other objects for densely distributed objects. For better localization of rotated objects, oriented bounding boxes (OBBs) are widely used in RSIs objects detection [9], [10], [11], [12], [13], [14], [15]. As can be seen in Fig. 1(b), the OBBs better enclose the objects themselves and has almost none of the problems described above in the horizontal boxes. The angle value is required as well as the position and side length of the box when defining an orientation box. There are various ways to represent the angle of object, the most common is to use the angle between one side of the bounding box (e.g., the long side) and the x-axis of the image as the angle value of that object. Based on the structure of the HBB object detector, the detection of rotated objects is achieved by adding the angle prediction module. Normally, angle prediction can be implemented by increasing a channel in the location regression module.
Although the accuracy and efficiency of detection are getting better as many methods with different network structures for rotated object detection are proposed, there are still several nonnegligible problems in most rotated object detectors that have not been perfectly solved. List three challenges as follows.
1) In the anchor-based detectors, a large number of anchors with different angles are preset in the network in order to make the anchors match the rotated objects as much as possible, which causes a serious redundancy of the anchors and greatly increases the computational complexity.
2) The problem of discontinuity in the upper and lower boundaries of the angle values occurs when the angle is expressed in common rotated object detectors. The angle values near the two boundaries represent similar directions and have similar visual features on the image, while they are numerically jumpy and far apart, as sketched in Fig. 2. This makes the angle learning of anchors in the network somewhat confusing.
3) The structure of network has yet to be improved in terms of orientation-sensitive features extraction because the capability of extracting orientation-sensitive features is the key in rotated object detectors. In addition, the shape of the convolutional kernel in traditional CNN is generally horizontal and square, which also has a certain adverse impact on the extraction of orientation-sensitive features. To deal with the abovementioned problems of rotated objects detection, we propose an orientation-first refinement detector based on orientation-first strategy in this article. The orientationfirst strategy instructs the network to predict the orientation of the object first, and then preset the high-quality anchor based on the angle value. In this case, a large amount of redundancy in the initial anchor is avoided and the accuracy of the network for detecting rotated objects can be improved. An angle encoding regression module (AERM) is proposed in which the angle values are encoded as multiple parameters and the network predicts the object angle by learning multiple parameter values. The upper and lower boundaries of the angle values in this representation, such as −90°and 90°, correspond to the same encoded values, which solves the problem of discontinuity in the boundaries of the angle values. An angle channel attention module (ACAM) that uses the encoding parameters from the abovementioned angle representation method is also constructed in our network architecture. We use convolutional kernels with different angles in this module to extract features, and then utilize the abovementioned encoding parameters as weights to fuse multiple feature maps to generate a new feature map.
The main contributions of this article can be summarized as follows.
1) An orientation-first strategy for rotated object detection is proposed. This method avoids a large amount of redundancy in the preset anchor by predicting the object angle first, while the high-quality anchors improve the detection network for rotated objects. 2) We propose a new angle representation method that encodes the angle values into multiple parameters. This method can well solve the problem of discontinuous angle boundary and improve the learning ability of the network for object orientation. 3) An orientation feature extraction module based on multiangle channel attention that fuses feature maps generated by different convolution kernels is introduced to more effectively extract orientation-sensitive features, enabling the detector to detect rotated objects more accurately. The proposed rotated objects detection framework in this article achieves 79.56% and 96.29% accuracy in two challenging datasets, DOTA and HRSC2016, respectively.

II. RELATED WORKS
In recent years, object detection based on deep learning have gained great progress. Rectangular boxes are able to locate objects accurately and can be easily defined using a few parameters. While HBBs achieve excellent performance in most cases, the increased difficulty of object detection in specific environments has caused OBBs with arbitrary angles to be emphasized in research.

A. Object Detection Based on Deep Learning
The network architecture of object detection based on deep learning can be broadly classified into two categories, namely single-stage detector and two-stage detector, with the difference between the two types of detectors being whether the proposed regions are extracted. The two-stage detector will detect each proposed region separately after extracting the proposed regions, while the single-stage detector can detect all objects in an image end-to-end. Generally speaking, the two-stage detector has a higher accuracy rate, but its efficiency is reduced due to the extraction of proposed regions, whereas the single-stage detector has higher efficiency and lower accuracy rate. The R-CNN series [16], [17], [18] are representative algorithms of two-stage detectors, and plenty of improved algorithms based on Faster-RCNN [18] have emerged in recent years. SSD [19], YOLO series [20], [21], [22] are classic single-stage detection algorithms, among which YOLOv3 [22] has achieved brilliant results in various practical scenarios. In order to improve the accuracy of the single-stage detector, He et al. [23] proposed RetinaNet, which uses focal loss to deal with the problem of unbalanced positive and negative samples during the training of the network. This problem is considered to be an important reason why the accuracy of the single-stage detector is inferior to that of the two-stage detector. The most important role of the methods for extracting proposed region in the two-stage detector, such as RoI Pooling [17] and RoI Align [24], is to generate the refined feature maps corresponding to the proposed regions derived from the first stage, namely feature alignment. Inspired by this, Zhang et al. [25] proposed to use ordinary convolution operation to achieve feature alignment in a single-stage detector, and the potential is huge. With the introduction of the deformable convolution [26], [27], the method to achieve feature alignment is used by a variety of single-stage detectors and is called alignment convolution [28], [29], [30], [31], [32], [33]. The accuracy of single-stage detectors is gradually improved with the use of abovementioned methods.
Since anchor was proposed in Faster-RCNN, anchor-based algorithm have been extensively used in object detection because of its high accuracy. This type of algorithm has been fully developed and achieved great success in recent years, and it is still the mainstream method in the field of object detection. Anchors are rectangular boxes preset on the feature maps before the network detects the object. During the training process, the anchors are matched with the most similar bounding boxes of the ground-truth objects, and then corrected by the network to make the anchors as close as possible to the matched boxes. One problem of the anchor-based approach is that it is impossible for the anchors to match all the objects, especially the small objects with very few pixels. In order to match objects as much as possible, a lot of unnecessary anchors have to be preset, resulting in a waste of computational resources. In addition to the anchorbased approaches, the anchor-free models that does not require preset anchors are proposed and become popular [34], [35], [36], [37]. Law and Deng [34] proposed CornerNet based on corner point detection, and Duan et al. [36] proposed CenterNet based on center point detection. This type of method locates objects by detecting the key points, and then predicts other information at the positions of these key points. Subsequently, the application of anchor-free for object detection in RSIs has been extensively studied [38], [39], [40], [41], [42], [43].

B. Rotated Object Detection in RSIs
Unlike natural images, where objects are mostly arranged vertically under the effect of gravity, the objects on the ground may have arbitrary orientations in the overhead view of RSIs. If HBBs are used to represent these objects, the rectangular box will contain a lot of irrelevant content, especially when the aspect ratios of objects are large. In addition, the huge scale differences in RSIs, the complex and diverse earth surface, and the uneven objects distribution make detection more challenging. The abovementioned problem of HBB can seriously affect the accuracy of objects detection in RSIs. Therefore, more and more object detection methods use OBBs to locate objects in RSIs. In general, it is possible to convert from a horizontal box to an oriented box by simply adding one or few parameters to represent the orientation. The convenient conversion allows most classical object detection algorithms to detect rotated objects with minor modifications. At the same time, new methods and network structures have been proposed in order to further improve the accuracy.
The detection of angle is an important part in the rotated object detector and there is a difficult problem to deal with, i.e., the discontinuity of the angle boundary. This problem arises from the contradiction between the continuity of the directions and the discontinuity of the angle values. The DCL [44] and CSL [45] methods proposed by Yang et al. deal with this problem by converting the angle detection from a regression problem to a classification problem, however, the number of classifications limits the angle detection accuracy. The sliding vertex method proposed by Xu et al. [46] determined the oriented box by the minimum external horizontal box and the distance from the vertexes of the direction box to the vertexes of that horizontal box. Song et al. [47] designed a detector that first extracts proposals containing rotated objects and then predicts the endpoints of objects, avoiding the regression of angles. The AProNet [48] proposed by Zheng et al. determines the oriented box by the center point, the length and width of the object, and its mapping length in the horizontal and vertical directions. Besides the problem of angular boundary discontinuity, there is another problem in anchor-based rotated object detection, which is high difficulty of matching anchors with rotated objects. In [49], [50], [51], and [52], in order to match the rotated objects as much as possible, anchors with different angles are added at the same position. This method further increases the redundancy of the anchors, which greatly wastes computing resources. Zhong et al. [53] proposed an anchor matching method that matches a horizontal anchor to a horizontal box, which is obtained by decoupling the oriented box. Another solution is used in [54], [55], and [56], i.e., the horizontal anchor is still preset, but it is matched with the smallest outer HBBs of the ground-truth objects to increase the matching rate, and then let the anchor learn the oriented box in the subsequent network. In this article, the angle representation method of multiparameter regression is proposed, which has a periodicity consistent with the direction of object and fundamentally solves the problem of discontinuous angle boundaries. In this method, the angle is first detected before other information, and then the anchor with the angle is preset so that the objects can be matched more accurately with a small number of anchors.

III. PROPOSE METHOD
In this section, we detailed the OFRDet based on the orientation-first strategy proposed in this article. First, each OFRDet is a refinement detector with ResNet50-FPN as backbone to generate multiscale features. On each scale of the feature map we apply a detection head with orientation-first strategy to predict objects. AERM first predicts the orientation information to preset the oriented anchors. ACAM takes the outputs of AERM as attention maps and extracts directional features in multiple angle channels. The refined stage consists of AlignConv, ORConv, classification branch, and regression branch. component of OFRDet and the overall implementation process are described in Section III-A. The overall network structure of the detector is shown in Fig. 3. Then, the baseline we adopted is introduced in Section III-B. Next, the designed AERM and ACAM is introduced in Sections III-C and III-D, respectively. Finally, our designed loss function of the overall network is shown in Section III-E.

A. Overall Design Structure of OFRDet
OFRDet is a refinement detector proposed in this article that can detect rotated objects in RSIs. In order to obtain the feature information of objects with large scale differences in RSIs and achieve correct detection, OFRDet uses ResNet50 [57] and feature pyramid network (FPN) [58] as the backbone to generate multiscale feature maps, and sets detection heads on the feature maps of multiple scales. In each detection head we adopt the orientation-first strategy, which makes the head first detect the direction information among all the information of objects. The strategy is to enable the network to preset anchor boxes with angles to better match objects, and to extract directional features based on the initial angle information to better regress bounding boxes. We design AERM to implement the priority detection of angle, which predicts the angle values of the objects at the positions of all feature points within the object bounding boxes. And uses the multiparameter angle encoding method to deal with the discontinuity of angle boundaries. Then, the encoded values obtained on each feature point is decoded into an angle value, and oriented anchor is preset on each feature point according to the angle value. The angle of the anchor is similar to the angle of the object to which the feature point belongs, so the anchor can better match the object and be further adjusted. There are two detection stages in the detection head, the coarse stage, and the refinement stage, which detect objects by adjusting the anchor boxes. In order to better distinguish and extract features in different directions, in the coarse stage, ACAM is designed to fuse feature maps of multiple angle channels to generate a new orientation-sensitive feature map, and the attention mechanism is adopted to give them different weights during fusion. Subsequently, the convolution operation is performed on the new feature map to complete the detection at this stage. The refined stage adjusts the detection results of the coarse stage through a series of convolution operations to obtain the final detection results.

B. Refined Detector Based on RetinaNet as Baseline
We add the refined stage to RetinaNet and use it as our baseline. ResNet50 and FPN are used as the backbone of the network to extract features. The residual module in ResNet well solves the problem of gradient disappearance in deep networks, so it can better extract deep features containing rich semantic information. FPN fuses deep and shallow features to generate multiscale feature maps, so that objects with different scales can be detected on the feature maps with the corresponding scales. Both the object classification branch and the bounding box regression branch of the detection head consist of ordinary full convolutional networks. In addition, focal loss is used as classification loss to solve the problem of imbalance of positive and negative samples during training.
In our baseline, the regression branch of detection head adds a channel to predict the angle θ. The OBB is represented by five parameters (x, y, w, h, θ), where (x, y) is the position of the center point of the bounding box in the image, w is the length of the long side, h is the length of the short side, θ ࢠ [−π/2, π/2) represents the angle from the positive x-axis to the direction of the long side w.
In refined detector, the detection in the refined stage is adjusted bounding boxes according to the result of the coarse stage, so it is necessary to perform feature alignment according to the result of the coarse stage before the detection in the refined stage. We use AlignConv [33] to complete feature alignment. AlignConv calculates the offsets of the anchor adjustments in the coarse stage, applies the offsets to the convolution kernel, and uses the deformable convolution to perform the convolution operation on the feature map to achieve feature alignment. Furthermore, ORConv [59] is used after feature alignment. ORConv captures features in different directions by rotating the same convolution kernel N times (we set N to 8) and using them to perform convolution operations separately, with 1/N of the original number of channels in each direction and the total number of channels in the feature map unchanged.

C. Angle Encoding Regression Module
We propose a multiparameter encoding and decoding method that encodes an angle value into multiple regressable parameters (here, we use four parameters to introduce the method, and in the following, if not specifically stated, all four parameters are used as examples). The whole angular range T was divided into four intervals all bounded by θ 1 , θ 2 , θ 3 , θ 4 , and θ 5 . Particularly, θ 5 and θ 1 are the upper and lower limits of the angular range, which differ by T and represent the same direction. In direction detection, θ and θ+nT (n is an arbitrary integer) represent the same direction due to periodicity. Considering this periodicity and subsequent decoding operations, the four parameters x θ 1 , x θ 2 , x θ 3 , x θ 4 correspond to θ 1 +nT, θ 2 +nT, θ 3 +nT, and θ 4 +nT, respectively. When encoding an angle value θ, the interval in which the angle is located needs to be determined first, i.e., θ ࢠ [θ a , θ b ), where (a, b) ࢠ {(1, 2), (2, 3), (3,4), (4, 5)}. And then values of the two corresponding parameters x θ a and x θ b were determined according to the difference between θ and the two bounding angles θ a and θ b in that interval. The larger the angle difference, the smaller the corresponding parameter, and the sum of two parameters is 1. The parameters x θ a and x θ b is given by Finally, the other two parameters are set to 0, and the encoding of the angle θ is completed. The coding example is shown in Fig. 4, the interval [−π/2, π/2) with the range T of π is divided into 4 parts, bounded by −π/2, −π/4, 0, π/4, and π/2. The angle θ to be encoded lies between [−π/2, −π/4), and its difference from the boundary angle −π/2, −π/4 is α and β, respectively. Its encoding result is Based on the abovementioned encoding principle, it can be inferred that the upper and lower limits of angle range θ 1 and Fig. 4. Example of four-parameter angle coding method. The range of angle detection is [−π/2, π/2), which is divided into four intervals, bounded by −π/2, −π/4, 0, π/4, and π/2. The angle θ is encoded as E(θ) consisting of four parameters x θ 1 , x θ 2 , x θ 3 , and x θ 4 , each corresponding to −π/2+nπ, −π/4+nπ, 0+nπ, π/4+nπ. Algorithm 1: angle decoding with four parameters.
Input: Parameters x 1 , x 2 , x 3 , x 4 are acquired from regression module, θ 1 , θ 2 , θ 3 , θ 4 , θ 5 are angle interval bounds, T is the whole angular range of detection. Output: θ is decoded angle value. 1 begin 5 have the same encoding value, and the encoding values of angles slightly larger than θ 1 and slightly smaller than θ 5 are approximate and continuous at θ 1 and θ 5 . From the above, it can be seen that the encoding value has the same continuity and periodicity as the direction to be detected, so the problem of discontinuity of the angle boundary is solved.
The multiparameter angle encoding method makes angle prediction a multiparameter regression problem. As shown in Fig. 5(a), we designed the AERM to predict angle values. The module consists of three convolutional layers and two activation layers following the first two convolutional layers. The input is a 256-channel feature map, and the output is a four-channel angle encoding map that can be divided into four maps X 1 , X 2 , X 3 , X 4 . Each of the four maps corresponds to an encoding parameter. Since the sum of the four target encoded values is 1, the network performs a Softmax operation on these four channels for more efficient regression and decoding. The target encoding maps X θ 1 , X θ 2 , X θ 3 , and X θ 4 is obtained from the angle map through four-parameter encoding. The value of each point on the angle map is the angle value of the object to which the point belongs (if a point does not belong to any object, then no loss is calculated at that point). After the prediction of the angles in this module, the angle values that need to be adjusted in the next coarse stage and refined stage are not randomly distributed in the entire detection range, but are clustered around 0°, and the angle discontinuity problem basically disappears. Therefore, the encoded values obtained by this module is decoded into an angle value and provided to the coarse stage for further adjustment.
It can be known from the abovementioned encoding method that θ b is larger than θ a by T/4, and then according to (1), θ can be obtained by Therefore, in order to decode the angle θ from the encoded value, it is necessary to determine, which interval the angle θ lies in. Ideally, only two or one of the encoded values are nonzero, and it is easy to find the corresponding angle interval. However, the four values are all nonzero in practice because of the deviation of the network regression results. From this, we design the decoding process in Algorithm 1. The adjacent two of the four coded values are added in turn, and the two with the largest sum are considered ideal nonzero values. Its corresponding angle interval can be subsequently obtained. To make the decoding more robust, four boundary angle values are obtained by expanding two intervals outward from this interval, and then weighting and summing them using their corresponding coding parameters. In this case, the four encoded values obtained by the network regression can all participate in the calculation of decoding the angle.

D. Angle Channel Attention Module
In the coarse stage, we designed the ACAM to extract orientation-sensitive features, whose structure is shown in Fig. 5(b). In this module, we design an attention mechanism based on the angle channel, while other attention mechanisms have gained special interest in the field of remote sensing in recent years [60], [61]. The input of this module is the feature map extracted by the backbone network, and the output is a new feature map with the same shape as the input. In ACAM, the four angle channels are designed to perform convolution operations using rotated kernels with angles θ 1 , θ 2 , θ 3 , and θ 4 to extract features in corresponding directions, and generate four feature , respectively. In oriented object detection, objects with arbitrary directions to be detected have various directional features. When detecting various objects, the degree of attention to the four angle channels should be different. From this, the angular channel attention mechanism is proposed, which assigns different weights to the four channels on each feature point, and obtains the final feature map after the weighted summation of the feature maps generated by the four angle channels. The output feature map fo(x) can be given by In the above formula, X 1 , X 2 , X 3 , and X 4 are the weight maps corresponding to the four angle channels, which are cascaded together to form an attention map with the shape of h × w × 4. The angle of the object has been predicted in AERM, so the four weight values on each feature point here can be determined by the angle of the object to which the point belongs. The closer the angle of the convolution kernel used by the channel is to the angle of the object, the greater the weight of the channel. And the sum of the four weights is 1. Ingeniously, the setting principle of the weight values is the same as the setting principle of the encoding parameters in AERM. Therefore, provided that the angles of the convolution kernels used by the four channels are the same as the first four boundary angles of the intervals in AERM, the four weights can correspond one-to-one with the four encoding parameters. In this case, as shown in Fig. 5, the output of AERM can be directly used as the attention map.
As illustrated in Fig. 6(a), the rotated convolution kernel is obtained by rotating the regular convolution kernel around the center point by a certain angle. The offsets from the horizontal regular kernel to the rotated kernel can be calculated by the size of the kernel and the rotation angle value. As shown in Fig. 6(b), the convolution operation on the map using this rotated kernel can sufficiently extract features in a certain direction. And the rotated convolution with different angles in different angle channels can perform feature extraction in different directions. In addition, according to the angle of the preset anchors, the features in certain directions can be extracted more purposefully, and the generated feature maps are beneficial to the detection of rotated objects.

E. Loss Function
The loss of the whole network consists of three components, including the angle encoding regression loss, the coarse stage detection loss, and the refinement stage detection loss. The loss function is defined as In the first term of (5), the angle encoding regression loss, λ 1 is the balance coefficient, N E is the total number of encoded values that have a target, m represents each feature point that has target encoded values, n represents each encoding parameter on a feature point, L r is the regression loss, where smoothed L1 loss is used, pE mn represents the encoded value obtained by the network, and p * mn represents the target encoded value. The detection loss consists of object classification loss and bounding box regression loss, where the bounding box regression loss is obtained from positive samples only. In the second and third terms of (5), λ 2 and λ 3 are the balance coefficients, N C and N R are the number of positive samples in the coarse and refined stages, respectively, i represents each sample, L c is the classification loss, where focal loss is used, cC i and cR i are the category predictions for sample i in the two stages, l * i is the ground-truth label of that, [l * i >1] is the Iverson bracket indicating equation, i.e., the value is 1 when i is a positive sample, xC i and xR i are the location predictions for sample i in the two stages and g * i is the ground truth of that.

IV. EXPERIMENTS AND ANALYSIS
A. Data Sets 1) DOTA-v1.0 [1]: This is a large-scale aerial remote sensing dataset made for object detection, which contains 2806 aerial images collected from satellites such as Google Earth, satellite JL-1, and 188282 ground objects on them. All the objects are grouped into 15 common categories, which are plane (PL), baseball diamond (BD), bridge (BR), ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccer-ball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP), and helicopter (HC). Instances in this dataset are annotated with HBBs and OBBs, and we use the OBB annotation in it for experiments. The entire dataset is randomly divided into three parts, where 1/2 is used as the training set, 1/6 as the validation set, and 1/3 as the test set.
The image sizes in DOTA vary widely, ranging in size from 800 × 800 to 4000 × 4000 pixels. We crop the original image into a series of 1024 × 1024 patches with a stride of 824. In the experiments with multiscale data augmentation, the original images were resized using three scales (0.5, 1.0, and 1.5) and change the cropping step size to 512. If instances are segmented during cropping, we decide whether to adopt them or not according to the method in [1]. In testing, the cropped images were fed into the network for detection and merge the results into the original size image.
2) HRSC2016 [62]: This is a high-resolution image dataset for ship detection containing arbitrarily oriented ships from open sea or coast side. The images in the dataset are collected from Google Earth with resolutions ranging from 2 to 0.4 m and image sizes ranging from 300 × 300 to 1500 × 900 pixels. There are 1061 images in the dataset, including 436 images in the training set, 181 images in the validation set, and 444 images in the test set. We use the OBB annotations in the dataset for experiments. And all the images are resized to the range (512, 800) without changing their aspect ratio, i.e., each image has a short side of 512 pixels and a long side of up to 800 pixels.

B. Implementation Details
This article uses ResNet50 FPN as the backbone network in the following experiments. The ResNet50 is initialized using the parameters pretrained on ImageNet. In the pyramidal feature  by FPN, (P3, P4, P5, P6, P7) are selected to preset the anchors of different scales. An anchor box with an aspect ratio of 1:1 is set on each feature point, whose side length is four times the stride of the feature map (i.e., 32, 64, 128, 256, 512) and whose angle is determined by the network prediction. In the loss function, the balance parameter of the angle encoded regression loss is set to 0.1, and other balance parameters are set to 1. The hyperparameters α and γ in Focal loss are set to 0.25 and 2.0, respectively. For the matching strategy, the Intersection over Union (IoU) threshold of foreground and background are set as 0.5 and 0.4 in both the coarse stage and the refined stage. In the training phase, a single NVIDIA 3080Ti GPU is used for the experiments with the batch size set to 4. SGD optimizer is used to update the parameters of the model, in which the initial learning rate is set to 0.005, the learning rate is reduced to 1/10 of the previous one each time it decays, and the momentum and weight decays are set to 0.9 and 0.0001, respectively. When using the DOTA dataset, the network is trained with 18 epoches, compared to 36 epoches when using the HRSC2016 dataset.
To prevent overfitting, we use horizontal flipping to increase the complexity of the dataset, and we also use zero-padding random rotation and multiscale data augmentation when employing a data enhancement strategy. In the testing phase, we also use a single 3080Ti GPU for inference. We keep bounding boxes with classification scores greater than 0.05, and set the IOU threshold in rotated nonmaximum suppression to 0.1. At the same time, considering that an image contains a limited number of objects, we set the upper limit of the number of objects in each image to 2000.

C. Ablation Studies
We conduct ablation experiments on the DOTA dataset to verify the effectiveness of our method, using mAP as a criterion for evaluating method performance. To compare the best results achieved by various architectures, all ablation experiments below are performed using the data augmentation strategy described in Section IV-B. 1) Baseline: As a classical object detection network, Reti-naNet can fit the detection tasks in most scenarios and achieve good results. In our baseline, the refined stage is added to RetinaNet to pursue better results. We add an angle prediction channel to the regression branch of the detection head so that it can be used to detect OBBs. The training and test parameters of each part in the baseline are exactly the same as the parameters of the network structure in other ablation experiments that follow. In the refined stage, we use AlignConv to realign the feature map. When using an oriented box to locate an object, the object can mostly occupy a higher proportion inside the box compared to using a horizontal box. Therefore, feature alignment within the oriented boxes can play an important role and significantly promote the feature representation of the object in the box. From the detection results, the network based on RetinaNet with the addition of refined stage has good performance in rotated object detection. As shown in the first row of Table I, the mAP of baseline network is 77.65% for 15 types of objects on the DOTA dataset.
2) Effectiveness of ACAM in the Orientation-First Strategy: To evaluate the effectiveness of the ACAM in OFRDet, the experiment is conducted by adding this module in the coarse stage based on the baseline method. At the same time, in order to make the attention mechanism work by obtaining effective angle channel weights and not let AERM affect our judgment on the test results, we change the AERM that implements the priority angle detection to the angle detection method in the baseline, i.e., the conventional angle value regression. As shown in the second row of Table I, with these settings, the mAP of the detector for 15 categories of rotated objects in the DOTA dataset is 78.67%, which is 1.02% higher than the baseline method. This enhancement is due to the fact that the ACAM is able to extract directional features more purposefully under the guidance of the angles prior predicted, and the oriented anchors preset by the orientation-first strategy can be more easily adjusted to the ground bounding boxes. In addition, the detection results of different categories of objects are presented in Table I, and it can be found that the APs of categories such as GTF, SV, SBF, HC have been increased significantly. These objects have various features in different directions, and this performance indicates that ACAM is more effective in extracting directional features.
3) Effectiveness of AERM: The complete OFRDet is used in this ablation experiment to evaluate the effectiveness of AERM. Compared with the previous experimental settings to verify the effectiveness of the ACAM, OFRDet only changes the part that implements the priority detection of angles to AERM, and the other parts remain unchanged. As shown in the third row of Table I, the mAP obtained by OFRDet is 79.56%, which is about 0.89% higher than the previous experimental result without AERM, and about 1.91% higher than the baseline method, with a significant improvement. The main reasons for this enhancement are, first, that the angle representation method using multiparameter encoding is easier to be learned by the network when performing orientation-first detection, and second, the multiparameter encoding method fundamentally eliminates the boundary discontinuity problem of angle regression. The qualitative visualization detection results of the two networks for rotated objects with angle values close to the upper and lower limit are shown in Fig. 7. It can be seen that the performance of OFRDet using AERM is significantly better, and the bounding boxes in the detection results have smaller  errors and more suitable orientation, which confirms that AERM effectively handles the boundary discontinuity problem of angle regression.

4) Setting of the Number of Angle Channels and Encoding
Parameters: In OFRDet, as described in Section III-D, the multiple parameters obtained by AERM to encode the angle values can be directly used as weights for angle channels in ACAM. There is a one-to-one correspondence between angle channels and encoding parameters, and the number of these is the same. In the abovementioned ablation experiments, the number of angle channels and angle encoding parameters were set to 4. The influence of the number of angle channels and angle encoding parameters on the detection results is explored in the following experiments. In addition to four-channel four-parameter, threechannel three-parameter, and six-channel six-parameter are also set in the network for experimentation. The experimental results are shown in Table II, where the mAPs under the three settings are 79.22%, 79.56%, and 79.18%, respectively. These results show that the effect of the number of channels and parameters is relatively small compared to the enhancement effect of the ACAM and the AERM on the detection results. Moreover, the four-channel four-parameter has a better detection effect than the three-channel three-parameter and six-channel sixparameter, which shows that when extracting directional features through ACAM, the moderate angle interval can maximize its extraction ability. If the angle interval is too large, the extracted directional features are incomplete, and if it is too small, the extracted directional features are redundant.

D. Comparisons With the State-of-the-Art
In this section, we compare the proposed OFRDet with other state-of-the-art detection methods on two datasets, namely DOTA and HRSC2016. Their introduction and experimental details are in Sections IV-A and IV-B, respectively.

1) Complexity and Speed Comparison:
We compare our method with other methods in terms of speed and complexity, and the comparison results are shown in Table III. The comparative experiments are carried out on the DOTA dataset, and the cropped image patches of size 1024 × 1024 are detected.
We reflect the speed and complexity of the detector by the number of frames per second (FPS) and the amount of model parameters, respectively. The FPS shown here is average FPS obtained after detecting the entire validation set of 5297 split images. For fairness, all the methods are inferred with batch size of 1 on a single RTX 3080Ti. The detection speed of our method is 13.8 FPS and the model size is 38.21M. It can be seen from Table III that both the detection speed and the model size of our method are in the middle level among the compared methods.
2) Results on DOTA: On the DOTA dataset, we compare with a variety of advanced or classical methods at single-scale or multiscale, and the results are shown in Table IV. Among these methods, FR-O and RetinaNet-O are implemented by adding angle prediction channels in the bounding box regression branch of the classical computer vision algorithms Faster-RCNN [18] and RetinaNet [23], respectively. Other methods are specially proposed to detect rotating objects in remote sensing images. CAD-Net [10] learns global and local contextual information of objects by computing their correlations with the global scene and local adjacent features. DAL [13] is a dynamic anchor learning method that uses a new matching mechanism to evaluate anchors and assign them more efficient labels. S 2 A-Net [14] uses a new alignment convolution, which can adaptively align convolution features according to anchors. FoRDet [15] leverages the information of foreground regions from the perspectives of feature and optimization. Different from compared methods, our method proposes a new multiparameter angle coding and angle channel attention mechanism to enhance the angle regression and direction feature extraction of the network, so as to improve the detection ability of rotated objects. Our proposed OFRDet achieves a mAP of 74.19% on the single-scale dataset and 79.56% on the multiscale dataset. We achieve state-of-the-art results in 7/15 categories among the methods of comparison, and it is worth noting that our results have a large lead in the detection of GTF, SBF, and HC. The directionality of these classes of objects is obvious, indicating that our detector has a strong ability in direction detection. The qualitative visual test results produced by our method in detecting some images of the DOTA dataset are shown in Fig. 8. Although we preset only one anchor on each feature point, we finally obtained excellent detection boxes, which can closely surround objects, even for objects with large scale differences or densely arranged, which shows the effectiveness of the preset rotated anchor. For objects with arbitrary orientations in complex environments, our method can assign bounding boxes a suitable orientation with fewer errors to complete the detection.
3) Results on HRSC2016: OFRDet is compared with various methods on HRSC2016 dataset, and the comparison results are shown in Table V. Among these methods, R2CNN [67], Rotated RPN [68], and SBD [69] are proposed in the field of computer vision to detect slanted text with angles. Other methods are proposed to detect rotated objects in RSIs. It is worth noting that we use the PASCAL VOC2012 metric to calculate mAP for    Fig. 9, from which it can be seen that OFRDet can always give a suitable OBB to tightly enclose ships with arbitrary orientations, although some ships have the characteristics of large differences in scale and dense arrangement. Even in the different environments such as harbor, coast, and sea, the method can complete the detection with high quality.

E. Limitations of the Method
Although OFRDet has obtained competitive experimental results, it still has limitations in object detection that cannot be ignored. As shown in the Fig. 10(a), OFRDet can correctly detect the large vehicle and the small vehicle in most cases, however, in the case where the visual features of the two are very similar, the detection results will be misclassified. In addition, the feature information of objects is insufficient when they are extremely small, and then the objects are immersed in the background, resulting in the failure to be detected, which can be seen from the Fig. 10(b). The abovementioned problems are extremely   challenging in RSIs detection and test the feature extraction and discrimination ability of the detection network, which requires further specialized research.

V. CONCLUSION
This article provides a strategy for the detection of rotated objects in RSIs, namely orientation-first strategy, and OFRDet is proposed based on this strategy. In OFRDet, the ACAM is proposed to extract the orientation features of objects more accurately, thereby improving the regression accuracy of OBB. The AERM is proposed to solve the problem of discontinuous boundary in angle prediction, so as to obtain more accurate angle information of objects. We demonstrate the effectiveness of our proposed method on the DOTA and HRSC2016 datasets through extensive experiments. He is also a Research Assistant with the Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences. His research interests include digital signal processing and space payload embedded systems.