Arbitrary-Oriented Ellipse Detector for Ship Detection in Remote Sensing Images

At present, in arbitrary-oriented object detection, the angular periodicity problem of rotated bounding box described by angle causes an object to have different numerical representations, which leads to uncertainty of rotated bounding box regression. To eliminate the angular periodicity problem, in this article, we propose a novel and simple ellipse parameters representation method for arbitrary-oriented object, which hides the angle of the object in the focal vector of ellipse to avoid direct angle prediction. Moreover, the proposed representation method can enable the arbitrary-oriented object to have only one numerical representation, which is beneficial to alleviate the uncertainty of bounding box regression. In order to adapt the proposed ellipse parameters representation method, we adopt 2-D Gaussian distribution label assign for coarse samples selection, then the Kullback–Leibler divergence loss and SimOTA are used to refine the coarse samples to obtain the best positive samples. We extend the YOLOX with medium parameters as an oriented ship detector according to the proposed ellipse parameters representation method, and conduct the experiments on HRSC2016, RSDD-SAR, and RHRSID to demonstrate the effectiveness of the proposed method. The experimental results show that the proposed representation method achieves impressive results compared with state-of-the-art arbitrary-oriented object detection methods.

orientation. The arbitrary-oriented object detection can obtain more accurate object information unlike generic object detection which introduces too much background information. Because of the rapid development of deep learning, arbitrary-oriented object detection also has made great progress [5], [6], [7], [8] in recent years. Among these methods, the five-parameter regressionbased methods occupy the mainstream, which are implemented by predicting an extra angle parameter on the basis of original horizontal bounding box parameters. However, predicting the angle of object directly will lead to the issues of boundary discontinuity and angular periodicity [5], [9], [10], [11].
Hiding the angle in a vector is one of the ways to solve the problems caused by direct angle prediction. Yi et al. [12] described an oriented object via the vectors from the center point to the four sides, which provides a new idea for arbitraryoriented object detection, and achieve the competitive detection performance. However, there are two disadvantages, 1) to handle corner cases, the oriented bounding box is grouped into two categories: a) Horizontal bounding box; and b) rotational bounding box, which increases the complexity of processing oriented bounding box; and 2) there are at least ten parameters to predict, which has too many parameters to predict compared with five-parameter regression-based methods. In [13], He et al. also adopted the idea of hiding angle in a vector, however, according to the idea of [13], the order of the four vertices is based on the values of x i , i = 1, 2, 3, 4, i.e., (x 1 ≤ x 2 < x 3 ≤ x 4 ), which has the obvious problem that the order of the four vertices cannot be determined only based on the value of x i , because x 2 can also be equal to x 3 in practice. When x 2 = x 3 , the order of x 2 and x 3 cannot be determined, which leads to the ambiguity of oriented bounding box representation.
In order to solve the problems of [12] and [13] and avoid the problems cased by predicting angle directly, in this article, we propose a new oriented bounding box representation method for arbitrary-oriented object detection by hiding the angle in the focal vector of an ellipse. Specifically, we predict the center point (x, y), the absolute value (|u|, |v|) of focal vector (u, v), the difference (m) between long-axis length and length of focal vector, and a binary value (α) to indicate the orientation of object belongs to the first and third quadrants or the second and fourth quadrants. The definition of binary value (α) eliminates the corner cases in [13], which simplify the judgment of the oriented bounding box. We predict the sign (α) by regarding it as a classification task, therefore, there are totally seven parameters to describe the oriented bounding box, which is less than the parameters of BBAVetors [12]. Moreover, according to This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the proposed method, we give a simpler and complete method to determine the long axis and short axis instead of the incomplete constraint on four vertices in [13]. The proposed ellipse parameters representation method makes an object have only one numerical representation, which can reduce the uncertainty of predicted parameters regression in the training phase. Finally, we summarize our contributions as follows.
1) We propose a new arbitrary-oriented object representation method to avoid problems caused by predicting angle directly, which hides the angle in a vector and makes the object have only one numerical representation. 2) We determine the long axis and short axis of oriented object by judging the lengths of two side vectors in oriented bounding box, which has no ambiguity to describe a oriented object instead of depending on the order of the four vertices like [13]. 3) We employ the three-stage samples selection strategy to assign positive for ground truths. First, We adopt 2-D Gaussian distribution label assign [14] to generate the coarse positive samples. Second, we calculate Kullback-Leibler divergence (KLD) loss [15] between the coarse positive samples and ground truths. Finally, according to the obtained KLD losses, we use SimOTA [16] to choose optimal positive samples for each ground truth. 4) We extend the YOLOX [16] to detect the oriented object.
To distinguish it from YOLOX, we named it arbitrayoriented ellipse detector (AEDet), which is a single stage and anchor free detector. Oure AEDet achieves the stateof-the-art results on HRSC2016 [17], RSDD-SAR [18], and RHRSID [19] datasets. The rest of this article is organized as follows. The related works are introduced in Section II. A detailed description of the proposed method is presented in Section III. The results and analysis are presented in Section IV. Finally, Section V concludes this article.

A. Angle Prediction-Based Methods
The angle prediction-based methods are implemented by adding an extra angle parameter on the basis of the general vanilla detectors [3], [20], [21], [22]. In order to solve the inconsistency between classification and location tasks in object detection, Han et al. [23] proposed the S 2 ANet, in which, the detection head is divided into two separate parts, one for predicting the object category, another for predicting the location and angle of object. To handle the angles of the objects, a new detector with newly defined rotatable bounding box (RBox) is proposed, which is denoted as DRBox [24]. The DRBox achieves the rotation invariant property to better learn the correct angle information of the objects. Ming et al. [25] proposed a critical feature capturing network by extracting the discriminative features, which can extract more discriminative features, generate more precise anchors, and make more reasonable label assignments. The periodicity and discontinuity of orientation angle regression lead the abnormal loss of detection networks in the training phase, therefore, Yang et al. [5], [9] treated the regression of angle as classification to overcome the periodicity and discontinuity of direct angle regression. The authors of [10], [15], and [26] optimized the arbitrary-oriented object detection networks from the perspective of designing better loss functions, which improve the detection performance efficiently.

B. Keypoints Regression-Based Methods
To address the imbalance between positive and negative samples, Yi et al. [12] extended the keypoint-based method [22] to suit arbitrary-oriented object detection task. Specifically, to obtain bounding boxes with arbitrary angles, Yi et al. detected the center of object first, then regress boundary-aware vectors of oriented bounding box. All the oriented objects are within Cartesian coordinates. Li et al. [27] adopted the adaptive points representation to extract the geometric information of object instances, which is used for aerial arbitrary-oriented object detection. Fu et al. [7] constructed a novel point-based arbitraryoriented object detector capable of employing the spatial information explicitly, which uses a point-based representation to describe an oriented object, predict point location via fully convolutional network. Other keypoint-based arbitrary-oriented object detection methods are described in [28], [29], and [30], and they all achieve excellent detection performance.

C. Other Regression-Based Methods
To better represent an arbitrary-oriented object, Xu et al. [6] obtained the circumscribed horizontal bounding box from the four vertices of arbitrary-oriented object, then the offsets of four vertices on each corresponding side of horizontal bounding box are used to describe the arbitrary-oriented object. Xie et al. [31] proposed midpoint offset representation to describe the oriented objects, which regresses each oriented proposal by predicting its external rectangle and inferring its midpoint offset. In [32], Wu et al. proposed an arbitrary-oriented object detector ProjBB, which employs a six-parameter method to represent oriented bounding box. He et al. [13] described an oriented object via the center point, the offsets of the major semiaxis, the length of the minor semiaxis, and an orientation label. In [33], Zhou et al. adopted the polar coordinate which represents the oriented object for the first time, and the proposed that the polar remote sensing object detector reach the impressive detection accuracy. The authors of [34] and [35] also presented arbitrary-oriented objection methods with polar coordinate system.

A. Ellipse Representation of Objects
The ellipse representation of an object is shown in Fig. 1, we can see that the object angle is decomposed into the two components (u, v) of the focal vector. Each component in (u, v) can be positive or negative, so we predict the absolute value (|u|, |v|) of focal vector (u, v) by separating the orientation of the object from the focal vector. It is obvious that the separated orientation of an object can only be in the first and third quadrants or the second and fourth quadrants or coincide with the coordinate axis. When an object is in the first and third quadrants, as shown When an object is in the second and fourth quadrants, shown in Fig. 1(b), one of the horizontal component (u) and vertical component (v) is negative and another is positive. When the separated orientation of an object coincides with the coordinate axis, one of the |u| and |v| will be zero. The separated orientation α = 1 means that the orientation of the object belongs to the first and third quadrants; α = 0 means that the orientation of the object belongs to the second and fourth quadrants. For convenience, the separated orientation α = 1 also means that the orientation of the object coincides with the coordinate axis. In order to represent an oriented object, the center of (x, y) is necessary. In addition, we only predict the parameters related the long axis, therefore, the difference (m) between length of long axis and length of focal vector is used as one of the key prediction parameters. Finally, we can predict the (x, y, |u|, |v|, m, α) to determine the location of an object.

B. Conversion Between Rectangle and Ellipse Representations
In this section, we introduce how to transform the four vertices (x i , y i ) of oriented bounding box to the parameters (x, y, |u|, |v|, m, α), where i=1, 2, 3, 4. It should be noted that if the representation of the rectangle is (x, y, w, h, θ), then it needs to be converted to four vertices (x i , y i ). In our method, we need to determine the long axis and short axis with no ambiguity. Therefore, we first obtain the center point C(x, y) and the vectors − − → CD and − − → CE from center point to two adjacent vertices of four vertices, as shown in Fig. 2.
We can see from the Fig. 2 that when we know three points C, D, and E, we can obtain the two side vectors of rectangle After we obtain the two side vectors, we judge the length of the two side vectors and select the larger one as the long axis of the ellipse to calculate the ellipse parameters of the object. Our method is complete for determining the long axis and short axis of ellipse, unlike the incomplete constraints in [13]. The detailed algorithm to implement the conversion of four vertices to ellipse parameters is shown in Algorithm 1.
We can see from the Algorithm 1 that while the orientation of the object coincides with the coordinate axis, α = 1. In addition, the method for determining the long axis and short axis is complete and no constraints are left out.
According to the Algorithm 1, we can transform the ellipse representation to rectangle representation, as shown in Algorithm 2.

C. Object Detector With Ellipse Parameters
We apply the ellipse parameters representation method to the single stage and anchor-free object detection network YOLOX to construct an arbitrary-oriented object detector AEDet. We give an overview of our method with ellipse parameters prediction in Fig. 3. The feature pyramid network used in YOLOX is from [36], we denote it as PAFPN.
We can see from the Fig. 3 that the "multitask subnets" has three main branches, where "N" represents the number of categories; "O" is constant 2, which predicts the orientation of object (α 1 , α 2 ), α 1 means long axis of the object is within the first and third quadrants, α 2 means long axis of the object is within the second and fourth quadrants; "E" is constant 5, which predicts the center point (x, y), the absolute values (|u|, |v|) of focal vector components, the difference (m) between the long axis, and norm of focal vector; and "C" is constant 1, which represents the confidence of the object. Smooth L1 loss function is used for location regression of object. Focal loss is used for classification loss, confidence loss, and orientation loss.

D. Positive and Negative Samples Selection
Because the proposed AEDet is constructed by extending the YOLOX for arbitrary-oriented object detection, we need to redesign the way of choosing positive and negative samples. According the proposed ellipse parameters representation method, we adopt a three-stage strategy of 2-D Gaussian distribution label assign [14] + KLD loss [15] + SimOTA [16] method to determine the negative and positive samples. We, first, borrow the idea of 2-D Gaussian distribution label assign, but the difference from 2-D Gaussian distribution label assign is that we select all the positions in ellipse area as the coarse positive samples instead of generating the Gaussian candidate region according to a threshhold. Second, we calculate the KLD loss between the coarse positive samples and corresponding ground truths. Finally, according to the obtained KLD losses, we use SimOTA to assign the best positive samples for corresponding ground truth dynamically. Fig. 3. Architecture of the proposed AEDet, where "N" for prediction of category, "O" for prediction of orientations, "E" for prediction of location, and "C" for prediction of confidence.

E. Bounding Box Regression
For a ground truth (x, y, |u|, |v|, m, α), we denote its corresponding positive sample as (x a , y a ), the predicted box is denoted as (x * , y * , |u * |, |v * |, m * , α * ). In general, the prediction parameters are nonzero, so that the more intuitive and simple bounding box regression is as follows: Where the "eps" is a very small number. We can see from Algorithm 1 that one or both of |u| and |v| in ground truth (x, y, |u|, |v|, m, α) will be zero, which makes it difficult to detect the object with horizontal bounding box in training phase. The essence of the above phenomenon is that the training loss converge slows due to the large difference in the loss value of the predictor variables in the regression loss function. For solving this problem, we modify the way of bounding box regression to the following formula.

A. Datasets
HRSC2016 Dataset: The HRSC2016 includes 1061 images, in which, 436 images are divided as training set, 181 images for validation set, and 444 for testing set. In the training phase, we combine the training set and validation set for training.
RSDD-SAR Dataset: The RSDD-SAR dataset consists of 7000 images with size 512 × 512, there are a total of 10 263 ship instances in the dataset. 5000 images are divided into training dataset and 2000 images are divided into test dataset. In the test dataset, 159 images belong to inshore dataset and 1841 images belong to offshore dataset.
RHRSID Dataset: The rotation annotations of RHRSID are obtained by taking the minimum circumscribed rectangle of instance segmentation annotations. There are a total of 5604 SAR images in the HRSID dataset, in the 1962 images of test dataset, 369 images belong to inshore dataset, and 1593 images belong to offshore dataset.

B. Experimental Settings
Because the proposed AEDet is constructed based on the YOLOX with medium parameters, most of the hyperparameters are same with YOLOX. In the phase of training, we disable the multiscale training; the learning rate per image is 0.0003125 and the beta value in SmoothL1 loss is 0.2. The epochs and batch  size are 36 and 16, respectively, for three datasets. In the testing phase, the confidence threshold is 0.05 and the NMS threshold is 0.1. In order to make the AEDet training converge quickly, we use the pretraining weights provided by the authors of YOLOX. We use two NVIDIA TITAN Xp GPUs in all experiments.

C. Experimental Results on HRSC2016 Dataset
We simultaneously evaluate our AEDet with two different versions (VOC2007 and VOC2012 metrics) of the VOC evaluation metric [37]. As shown in Table I, We list the results of AEDet and some other algorithms, where the proposed AEDet achieves the best results of 90.45% and 96.9% in VOC2007 and VOC2012 mAP metrics, respectively. Experimental results show that our method can achieve competitive results and outperform most state-of-the-art algorithms, which demonstrates the effectiveness of the proposed ellipse parameters representation method.
We also show some visual results, as shown in Fig. 4. It should be noted that the objects in the HRSC2016 dataset have large aspect ratios, which is not friendly to the object detectors using angle regression, because a slight angle change will make a huge change in regression loss. However, we can see the results from the Table I and Fig. 4 that the localization of bounding boxes is fairly accurate, especially on objects with large aspect ratios, which means our ellipse parameters representation method can effectively address the shortcomings of direct angle regression methods by avoiding to predict angles directly.

D. Experimental Results on RSDD-SAR Dataset
In addition to dividing the RSDD-SAR dataset into training and test sets, where the test set is further divided into inshore set and offshore set. Therefore, we give the experimental results about the the test set, inshore set, and offshore set in the evaluation phase. We use the code from MMRotate [45] to generate the results of comparative methods. As shown in Table II, our AEDet gets the best results on the test set, inshore set, and offshore set on VOC 2007 mAP metric. Especially in the inshore scene, we obtain the 77.8% mAP, which exceeds the  second-best result by 2.2%. The experimental results of AEDet on RSDD-SAR show the competitive performance compared with other algorithms, which highlights the superiority of our ellipse parameters representation method.
We show some examples of detection results in Fig. 5. In the complex inshore regions, AEDet has very few false detections and missed detections, which means the proposed AEDet can distinguish the ship objects from the complicated background well. In the simple offshore regions, AEDet has great detection performance unsurprisingly. In addition, it is worth noting that the objects in the RSDD-SAR dataset is relatively small compared to HRSC2016 dataset, which means that our ellipse parameters representation method can describe objects of different scales efficiently.

E. Experimental Results on RHRSID Dataset
We obtain comparative experimental results about RHRSID dataset by using the code from MMRotate [45]. The results of test set, inshore set, and offshore set are shown in Table III, where the AEDet achieves the best mAP of 88.2% on test dataset, which outperforms other comparison methods. In the inshore dataset, we achieve the best mAP of 76.5%, which is far more than the second-best result. In the offshore dataset, we also  get the best result. The results on inshore datasets show that AEDet is extremely advantageous in detecting inshore object, this conclusion is also consistent with the results in RSDD-SAR. We also show some detection examples of AEDet in Fig. 6, in which, the proposed AEDet shows the remarkable capability on detecting the inshore ship objects.

F. Effect of Bounding Box Regression
As described in the previous section, when the orientation of the object coincides with the coordinate axis, one of |u| and |v| in ground truth (x, y, |u|, |v|, m, α) will be zero, which makes the formulas (3) to (12) and (13) to (22) have different performance according to theoretical analysis. In this section, we verify theoretical analysis by conducting the experiments on two datasets, one of two datasets is RHRSID. Another is STORAGETANK dataset, which is constructed by choosing the storage tank from the RDIOR [47] dataset; some example images are shown in Fig. 7. We can see that the bounding boxes of storage tanks is horizontal, which are suitable as a dataset to verify the effects on different ways of Bounding box regression.
In the experimental settings of STORAGETANK dataset, the epoch is set to 36, and the β in smooth L1 is set to 0.2, and the image size is 800 × 800. The other hyperparameters are    Table IV, where the AEDet_EPS represents the bounding box regression of (3) to (12), and AEDet means the bounding box regression of (13) to (22). From the Table IV, it is obvious that the proposed AEDet has huge improvements on two datasets compared with AEDet_EPS, which demonstrate the bounding box regression of (13) to (22) is much better than the bounding box regression of (3) to (12). The results in Table IV verify the theoretical analysis about different ways of bounding box regression. We visualize the results of AEDet on STORAGETANK dataset in Fig. 8, which shows the proposed AEDet can handle well the

G. Comparison of Related Work Results
In order to facilitate comparison with BBAVectors [5] and RIEDet [13], we conduct experiments on HRSC2016. As shown in Table V, AEDet gets the best result in the case of only replacing the representation method of bounding box, which verifies the effectiveness of the proposed oriented representation method.

H. Comparison With Angle-Based Method
In order to highlight the superiority of the proposed ellipse parameters representation method over the method of directly predicting the angle without any constraints, we modify the proposed AEDet by only replacing the object representation method (x, y, |u|, |v|, m, α) of AEDet with the object representation method (x, y, w, h, θ). For convenience, we name the new detector with object representation (x, y, w, h, θ) as RODet. We conduct experiments on the HRSC2016 dataset. As shown in Table VI, where the loss function of RODet is SmoothL1 represents that the object representation vector (x, y, w, h, θ) is taken as a whole as input of SmoothL1 function. GIoU + SmoothL1 in RODet means that the vector (x, y, w, h) is taken as whole as input of GIoU function, and angle(θ) is the input of SmoothL1 alone. KLD loss in RODet is used to test whether there is a problem with the object representation method (x, y, w, h, θ) we implemented or not. We show some visual results of RODet according to loss functions in Fig. 9.
We can see from the Table VI and Fig. 9 that RODet with SmoothL1 and GIoU+SmoothL1 has poor detection results, but RODet with KLD loss has great performance, which proves that RODet with the object representation method (x, y, w, h, θ) we implemented is correct, and the unsatisfactory detection performance of RODet with SmoothL1 and GIoU+SmoothL1 is caused by the angle periodicity. The visual results in Fig. 9 show that RODet with SmoothL1 and GIoU+SmoothL1 can accurately predict (x, y, w, h) of object, but cannot accurately predict the angle (θ). The above phenomenon indicates that predicting angle of object directly without any constraints makes the detection performance worse. On the other hand, the proposed AEDet with ellipse parameters representation method achieves excellent result of 90.45% without any constraints, which highlights the superiority of the ellipse parameters representation method in this article.

V. CONCLUSION
In this article, in order to solve the problem of abnormal bounding box regression loss of object with multiple numerical representations caused by angle periodicity, we propose an ellipse parameters representation method to describe arbitraryoriented objects, which has only one numerical representation for an object and can predict object by avoiding to predict angels directly. Then we propose the AEDet by applying the proposed ellipse parameters representation method to YOLOX with medium parameters. According to the proposed ellipse parameters representation method, we propose corresponding positive samples selection method and way of bounding box regression. We conduct all experiments on three arbitrary-oriented ship detection datasets: 1) HRSC2016, 2) RSDD-SAR, and 3) RHRSID. The proposed AEDet achieves the results of 90.45% mAP on HRSC2016, 90.10% mAP on RSDD-SAR, and 88.2% mAP on RHRSID, respectively. We have achieved better detection results compared to related methods, which shows the effectiveness and superiority of the AEDet and our ellipse parameters representation method.