Dual-Resolution and Deformable Multihead Network for Oriented Object Detection in Remote Sensing Images

Compared with general object detection, the scale variations, arbitrary orientations, and complex backgrounds of objects in remote sensing images make it more challenging to detect oriented objects. Especially for oriented objects that have large aspect ratios, it is more difficult to accurately detect their boundary. Many methods show excellent performance on oriented object detection, most of which are anchor-based algorithms. To mitigate the performance gap between anchor-free algorithms and anchor-based algorithms, this article proposes an anchor-free algorithm called dual-resolution and deformable multihead network (DDMNet) for oriented object detection. Specifically, the dual-resolution network with bilateral fusion is adopted to extract high-resolution feature maps which contain both spatial details and multiscale contextual information. Then, the deformable convolution is incorporated into the network to alleviate the misalignment problem of oriented object detection. And a dilated feature fusion module is performed on the deformable feature maps to expand their receptive fields. Finally, box boundary-aware vectors instead of the angle are leveraged to represent the oriented bounding box and the multihead network is designed to get robust predictions. DDMNet is a single-stage oriented object detection method without using anchors and exhibits promising performance on the public challenging benchmarks. DDMNet obtains 90.49%, 93.25%, and 78.66% mean average precision on the HRSC2016, FGSD2021, and DOTA datasets. In particular, DDMNet achieves 79.86% at mAP75 and 53.85% at mAP85 on the HRSC2016 dataset, respectively, outperforming the current state-of-the-art methods.


I. INTRODUCTION
T HE purpose of object detection in remote sensing images is to obtain information about objects' location and category. Thanks to the rapid development of aerospace sensors, it has become possible to detect and recognize typical objects using high-resolution remote sensing images [1]. And object detection in remote sensing images is a fundamental technique for geospatial intelligence acquisition [2], [3], environmental monitoring [3], urban planning [2], precision agriculture [2], geological survey [4], natural hazards [4], and humanitarian aid [5]. However, due to scale variations, arbitrary orientations, complex backgrounds, and long-tail distribution of categories, it is still a challenging task to detect objects in remote sensing images accurately and quickly [2], [6]. For a long time, the horizontal bounding box (HBB) is used to represent the boundary of an object when detecting objects in remote sensing images [7], [8], [9]. But aerial images have a bird's eye view and the HBB has limited adaptability to the diversity of objects' orientations in aerial images. In particular, for some objects with large aspect ratios (such as bridges and ships, as shown in Fig. 1), the HBB cannot effectively describe their accurate boundary and the oriented bounding box (OBB) is the better choice. Benefiting from the great progress of deep convolutional neural networks (CNN), various architectures have been proposed to achieve remarkable performance in image classification [10], [11], semantic segmentation [12], and object detection [13]. The excellent capability of CNNs for feature representation makes it possible to achieve more challenging oriented object detection (OOD) using OBB [1], [14]. Currently, methods using CNN for OOD mainly can be categorized into two-stage algorithms [15], [16] and single-stage algorithms This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ [6], [17], [18], [19]. In addition, single-stage algorithms can be divided into anchor-based algorithms [17] and anchor-free algorithms [19] according to whether the anchor is used or not.
The most famous two-stage algorithm for object detection is Faster RCNN [13], which adopts a region proposal network (RPN) to generate regions of interest (RoIs) and then refines these regions. There is a large number of published studies (e.g., RRPN [20], R2CNN [21], RR-CNN [22], and RoI transformer [15]) that modify Faster RCNN's architecture for OOD and they demonstrate outstanding performance. Since the slow detection speed of Faster RCNN makes it difficult to achieve real-time object detection, it is more popular to improve single-stage anchor-based algorithms (e.g., RetinaNet [23], YOLO series [24], [25], [26]) for OOD. And the accuracy of single-stage algorithms (e.g., DCL [27], S 2 A-Net [28]) has already been close to the two-stage algorithms. However, both Faster RCNN and RetinaNet need anchors for region proposals. Different datasets or tasks require different anchors, which introduces more hyperparameters. A large number of anchor boxes that do not contain objects exist as negative samples during training, while only a few anchor boxes which contain objects are positive samples. This causes a serious class imbalance problem and more calculations during training. Moreover, abundant anchor boxes also reduce the efficiency of prediction. To address the above problems caused by anchors, anchor-free algorithms (e.g., FCOS [29], CenterNet [30]) appeared. At present, studies using anchor-free algorithms [31], [32] for OOD have been proposed, but their performance is still far from being satisfactory compared with anchor-based methods, especially the two-stage algorithms.
To mitigate the performance gap between anchor-free algorithms and anchor-based algorithms for OOD in remote sensing images, the dual-resolution and deformable multihead network (DDMNet) is proposed. To exploit multiscale rich contextual information and spatial details, the dual-resolution network with the bilateral fusion method [33] is adopted to extract high-resolution feature maps. These feature maps not only contain contextual information at different scales but also maintain the continuity in spatial details between adjacent levels of feature maps, which facilitates the detection of multiscale objects. Moreover, we incorporate the deformable convolution [34], [35] into the network and design a dilated feature fusion module, which is used to adjust the feature maps' receptive fields adaptively according to the objects' orientations. Hence, the misalignment problem of feature maps at different scales is alleviated and multiscale deformable information is effectively aggregated. Following Yi et al. [31], box boundary-aware vectors (BBAVectors) which are defined by the midpoints of the OBB's sides are leveraged to represent oriented objects. BBAVectors can represent any OBB without using the angle information and it is a good approach to alleviate the problem of loss discontinuity when directly predicting the angles of OBBs. And to achieve robust and high-quality prediction, we develop the multihead network which is based on the fused feature maps and each level of feature maps. As a result, different levels of feature maps can be fully utilized. In summary, the contributions of this article are as follows.
1) A single-stage anchor-free framework DDMNet, based on different levels of high-resolution feature maps for oriented object detection, is proposed. These high-resolution feature maps can take advantage of continuous spatial details and multiscale contextual information. 2) In DDMNet, the deformable convolution is incorporated to align different levels of feature maps, which can adjust the receptive fields flexibly according to objects' orientations. Besides, a dilated feature fusion module is designed to make full use of the deformable feature maps and further expand the receptive fields of the deformable convolution.
3) The experimental results of the HRSC2016 [14] and the FGSD2021 [36] datasets indicate that the proposed DDM-Net can obtain more high-quality ship detection compared with existing detectors. Besides, the results of the DOTA dataset [1] show that our method not only achieves promising performance compared with other anchor-free algorithms but is also superior to many two-stages and anchor-based algorithms.
II. RELATED WORKS CNN has made great success in various compute vision tasks because of its outstanding performance in feature extraction and representation. In recent years, a series of OOD methods using CNN also have been proposed and achieved excellent performance. In this section, we briefly review some state-of-the-art methods for OOD, including two-stage methods, single-stage anchor-based methods, and single-stage anchor-free methods.

A. Two-Stage Methods for OOD
Generally, most two-stage methods for OOD are developed on Faster RCNN. Faster RCNN was originally designed to detect objects using HBBs instead of OBBs and it is necessary to modify the output of Faster RCNN. A popular practice is to introduce angle prediction coupled with the center point, width, and height of the OBB for OOD. Ma et al. [20] and Liu et al. [22] designed the rotated region of interest (RRoI) pooling layer which could generate rotated proposals with different angles in Faster RCNN. RoI Transformer [15] adopted an RRoI learner which can transform the horizontal RoIs into RRoIs. Yang et al. [37] developed a sampling fusion network and a multidimensional attention network for feature fusion and noise suppression. Qin et al. [38] used an arbitrary-oriented region proposal network to generate rotated proposals of different scales and then the proposals were fed into the multihead network for refinement. CAD-Net [39] leveraged a spatial-and-scale-aware attention module to guide the network to focus on the regions with rich information. ReDet [40] proposed a rotation-invariant RoI alignment method which is rotation invariance in the spatial dimension. In addition to the introduction of the angle for OOD, Wang et al. [41] defined a center-probability-map that could help eliminate the ambiguities of background pixels. Song et al. [42] introduced point-guided keypoint estimation and Fu et al. [43] designed point-based representation which could avoid discontinuous regression. Moreover, Xu et al. [16] and Xie et al. [44] employed the midpoint offsets to represent oriented objects.
These two-stage algorithms show excellent performance for OOD. But it is regrettable that most of the two-stage algorithms cannot get rid of RPN and anchors. As a result, their inference speed is slow and it is difficult to achieve real-time detection.

B. Single-Stage Anchor-Based Methods for OOD
The most significant advantage of single-stage algorithms for object detection is the inference speed and single-stage anchorbased methods for OOD become more popular. Take SCRDet [37] one step further, SCRDet++ [45] developed instance-level feature denoising and rotation loss smoothing on RetinaNet to address the periodicity of angle and exchangeability of edges. R 3 Det [46] designed a feature refinement module and SkewIoU loss to improve the detection performance. Qian et al. [47] proposed modulated loss function to solve the problem of inconsistent parameter regression. Circular smooth label (CSL) [48] cast angle prediction from a regression problem to a classification problem. In order to obtain better anchor boxes, CFC-Net [3] and DAL [49] adopted dynamic anchors to alleviate the divergence between classification and regression on the same feature maps. RIDet [50] proposed a representation invariance loss to optimize the bounding box regression for oriented objects. Han et al. [28] designed the feature alignment module to refine the anchors. Zhu et al. [18] and Chen et al. [6] introduced feature alignment modules in RetinaNet to alleviate feature misalignment problems. To accelerate the inference speed and reduce the computational cost, Huang et al. [51] devoted efforts to lightweight OOD and proposed LO-Det which achieved real-time detection. According to the current studies, the single-stage algorithms with anchors not only have a fast inference speed but also have achieved promising performance which is close to two-stage algorithms.

C. Single-Stage Anchor-Free Methods for OOD
To get rid of anchors, anchor-free algorithms (e.g., FCOS [29], CenterNet [30]) give new ideas for OOD without using anchors. Compared with anchor-based algorithms, anchor-free algorithms can avoid manual parameter settings and reduce the computational cost. Moreover, anchor-free algorithms are easier for inference and are friendlier to industrial applications. IENet [52] employed FCOS to regress an HBB first and then predicted the offset between the OBB and the HBB. But the performance of IENet is significantly inferior to two-stage algorithms like RoI Transformer. DARDet [53] introduced an alignment convolution module to refine OBB. O 2 -DNet [32] adopted the center point and a pair of middle lines to represent the OBB. Similar to O 2 -DNet, Xiao et al. [54] proposed to learn the axis which is the line connecting the head and tail of the OBB. To address the problem of angle periodicity, CHPDet [36] leveraged the OBB with a head point to detect ships in remote sensing images. PolarDet [55] proposed polar coordinates to represent arbitrary-oriented objects. CGP Box [56] adopted the guide points to locate the target and Guan et al. [57] introduced a regression strategy using the dual-angle and short side of an OBB based on the CenterNet. Huang et al. [19] developed a novel object-adaptation label assignment strategy that is flexible to fit the object's size and direction. EFN [58] employed an ellipse field network that integrates semantic segmentation and object detection. Yi et al. [31] defined the BBAVectors to represent OOB. BBAVectors can predict HBB and OBB by introducing the parameter to classify the bounding box which could alleviate the problem of angle periodicity. Single-stage anchor-free algorithms have made tremendous progress for OOD in recent years and demonstrate promising results, but there is still a big performance gap compared with some two-stage algorithms. From the above considerations, we follow the idea of BBAVectors and develop the DDMNet to mitigate the performance gap. Compared with BBAVectors and other anchor-free algorithms, DDMNet achieves much better performance and faster inference speed.

III. PROPOSED METHOD
In this section, we present our proposed DDMNet in detail. First, we give an overview of DDMNet in Section III-A. And then, we present some critical modules in Section III-B, Section III-C, and Section III-D.

A. Overview
The network architecture of DDMNet proposed in this article is shown in Fig. 2. First, DDRNet [33] which adopts bilateral fusion on dual-resolution feature maps is used as the backbone for feature extraction. DDRNet can obtain a series of highresolution feature maps whose size is 1/8 of the input image. Three levels of feature maps are extracted from three stages of the backbone, which are denoted as F 1 , F 2 , and F 3 . And then, the feature maps are aligned by the deformable convolutional layers [35] to accommodate the shapes of objects due to the objects' diverse orientations. To better make full use of the feature maps, a dilated feature fusion module is developed to aggregate the aligned feature maps (F 1D , F 2D , and F 3D ) and exploit the multiscale contextual information. Finally, four detection heads are designed to predict the representation of OBB on the three levels of aligned feature maps and the fused feature maps. The deformable convolutional layer is also introduced into each detection head, which further boosts the network's capability to capture the spatial and contextual information of oriented objects. The output of each detection head consists of heatmap P , offset O , box parameters B , and orientation A (details are presented in Section III-C). At the training stage, losses of four heads are aggregated to optimize the network's parameters. In the prediction phase, rotated nonmaximum suppression is used to obtain the best results from the four heads.

B. Extraction of High-Resolution Feature Maps
In the task of OOD, a challenging problem is that the scale variations make it difficult to effectively detect objects of different sizes. To gather multiscale information, the feature pyramid network or various feature fusion methods are employed. According to the different resolutions of output feature maps, the network for feature extraction such as ResNet [9] can be divided into five stages and five different levels of feature maps (corresponding to 1/2, 1/4, 1/8, 1/16, and 1/32 resolutions of the input image) are gained. Note that there is a significant scale gap between different levels of feature maps. When different levels of feature maps are fused, the misalignment problem arises because adjacent levels of feature maps are not continuous in scale. To obtain feature maps that contain multiscale information and maintain the scales' continuity at the same time, DDRNet [33] is used as the backbone network for feature extraction.
DDRNet is a real-time and high-performance semantic segmentation network that adopts feature fusion on different resolution feature maps. Fig. 2 shows the main structure of DDRNet and it can be divided into six stages: Stage-1, Stage-2, Stage-3, Stage-4, Stage-5_1, and Stage-5_2. Each stage consists of several residual blocks (the residual block is the same as that of ResNet). The size of the feature maps from Stage-2 and Stage-3 are 1/4 and 1/8 of the input size, respectively. Then the remaining stages consist of a high-resolution detail branch and a low-resolution semantic branch. Feature maps from the two branches are fused with the bilateral fusion (BF) method and deep aggregation pyramid pooling module (DAPPM). As shown in Fig. 3, BF includes a high-to-low module and a low-to-high module. The high-to-low module is integrating the high-resolution feature maps to the low-resolution feature maps which first leverages convolutional layers to downsample the high-resolution feature maps and then adds them to the lowresolution feature maps. The low-to-high module is integrating the low-resolution feature maps to the high-resolution feature maps which first upsample the low-resolution feature maps and then add them to high-resolution feature maps. As shown in Fig. 4, DAPPM adopts a series of convolutional layers with different kernels, and the multiscale feature maps extracted by different convolutional layers are aggregated in a hierarchical  residual-like style. DAPPM is located at the end of the network and works on the high-level feature maps (the size of the high-level feature maps is 1/64 of the input image), which provides rich context and hardly affects the inference speed. We obtain three levels of high-resolution feature maps from Stage-4, Stage-5_1, and Stage-5_2 of DDRNet39, which are denoted as F 1 , F 2 , and F 3 . The channel numbers of feature maps F 1 , F 2 , and F 3 are 128, 128, and 256, respectively.

C. Deformable Feature Fusion and Multihead for Prediction
1) Deformable Convolution: Another challenge of OOD is the problem of feature misalignment since the OBB of an object with a large aspect ratio is more sensitive to its orientation. Moreover, the feature maps used for the classification may not be suitable for the regression of objects' sizes. To improve the accuracy of OBB, it is crucial to obtain feature maps that can adjust their receptive fields adaptively according to the objects' orientations and shapes. Currently, kernels of the standard convolutional layer have fixed shapes (e.g., 3 × 3, 5 × 5, 7 × 7, etc.) and the regular grid sampling hardly effectively deals with the complex deformation of the object in the image. When the orientations and shapes of objects change, the standard convolution cannot adaptively adjust its receptive fields to cover the objects, which significantly affects the performance of OOD. Therefore, deformable convolution is incorporated to improve the robustness of feature maps to oriented objects' various orientations.
Convolution operation refers to the sum of element-wise multiplication conducted on the convolution kernel and the feature map. Take a regular 3 × 3 convolution kernel filter w for an example, it contains nine sampling locations, and the sampling region can be represented by R. The output (y) of convolution on the input (x) is calculated by where R = {(-1, -1), (-1, 0), …, (0, 1), (1, 1)}, p s enumerates the elements in R. The core idea of deformable convolution is learning the offsets to each sampling point which allows the sampling point to learn information from other locations. The deformable convolution [34] can be formulated as follows: where Δp s is offset at the location p s . Deformable convolution changes the positions of sampling points by learning the offsets to make their receptive fields cover the objects more flexibly. To better highlight the helpful sampling points and suppress the invalid sampling points, the modulation mechanism is introduced which learns different weights for different sampling points. The modulated deformable convolution [35] can then be expressed as follows: where m s denotes the offset's weight at the location p s , and m s lies in the range [0, 1]. m s is also trainable and it could change the influence of each sampling point automatically. We conduct the deformable convolution on three levels of feature maps to obtain deformable features and the process can be expressed as follows: where DBA 3×3 (·) represents the deformable convolutional block, which includes a 3 × 3 deformable convolutional layer, a batch normalization layer, and a ReLU layer. When using deformable convolution, we do not change the number of channels of F 1 , F 2 , and F 3 .
2) Dilated Feature Fusion Module: It is observed in Fig. 2 that the semantic scale of the feature maps from the detail branch is always 8×, while the semantic scales of the feature maps from the semantic branch are 8×, 16×, and 32×, respectively. We want to get more diverse semantic scales to deal with objects with various sizes and feature fusion can help do this. In addition, a deformable convolution only has nine sampling points and the receptive fields of a single convolution cannot effectively cover the object. It is necessary to establish the relationship between the receptive fields of multiple deformable convolutions on a larger scale. To effectively expand the receptive fields of deformable feature maps and make full use of different levels of feature maps, inspired by dilated convolution used in semantic segmentation tasks [59], we develop the dilated feature fusion module (DFFM) for the deformable feature maps F 1D , F 2D , and F 3D . The architecture of DFFM is shown in Fig. 5. To be specific, three levels of feature maps are fused by connection and the concatenated feature maps are fed into four branches where each branch consists of a dilated convolutional layer with different dilated ratios. And then the outputs of four branches are aggregated again. Different branches have different dilated rates which can enlarge the receptive fields of deformable feature maps and aggregate multiscale contextual information. Mathematically, the above process can be expressed as follows: where Cat(·) represents the channel-wise concatenation operation, DC k (·) represents the dilate convolutional block, which includes a 3 × 3 dilated convolutional layer with the dilated rate k, a batch normalization layer, and a ReLU layer. The number of convolution kernels in each dilated convolutional layer is 64 and the channel number of the fused feature maps F 4D is 256.

3) Multihead Detection:
We adopt the BBAVectors to represent the OBB of an oriented object. As shown in Fig. 6, the BBAVectors consist of r, t, l, and b which are defined according to the midpoints of the OBB's sides. The Cartesian coordinate system is established with the center point c = (c x , c y ) as the origin and the four corners of the OBB can be represented as follows: where tl, tr, bl, and br denote the top-left, top-right, bottomright, and bottom-left points of the object's true bounding box. And as observed in Fig. 6(c) and (d), when the bounding box of the object is approximate an HBB, tl, tr, bl, and br can be represented in a simpler way as follows: where w and h are the width and height of the HBB, respectively.
where K is the number of categories of the objects and s is 8 which refers to the output stride. The pixel value of the heatmap represents the predicted category and probability at the corresponding position in the image. Since the heatmap's size is 1/8 of the input image's size, the positions of center points in the heatmap are all integers and it is not accurate to calculate the OBB's center point only using the heatmap. Therefore, the offset is introduced which denotes the deviation between the OBB's center in the input image and the center calculated using the heatmap. The box parameters have 10 channels which consist of 4 vectors (t, r, b, and l) and 2 external size parameters (w and h).

D. Loss Functions
We adopt the same loss functions as CenterNet and BBAVectors, which consist of heatmap loss, box bounding loss, offset loss, and orientation loss.
The heatmap loss L P is defined as follows: where γ and β are hyperparameters of the focal loss [23], we set γ = 2 and β = 4 following CenterNet. N is the number of objects, ground truth p xyc = 1 corresponds to the object's center, while p xyc = 0 is the background, and p xyc is the prediction. Since the number of objects in an image is too few to train the model, the points near the objects' center are also regarded as positive samples in the heatmap loss using a two-dimensional Gaussian function as follows: where (p x , p y ) is the position of the points near the object' center in the heatmap, σ is an object size-adaptive standard deviation, and s is the output stride. We calculate the Gaussian radius and sigma following CenterNet. To reduce the difficulty of training, we double the Gauss radius to make more pixels around the object's center point play a role in training. The box bounding loss L B adopts the Smooth L1 loss function [13] and can be formulated as follows: where the ground truth d = {r, t, l, b, w, h}, dࢠB and prediction The offset loss L O is defined as follows: where ground truth o = (c x /s − c x /s , c y /s − c y /s ) and The orientation loss L A is defined as follows: where ground truth α ∈ A and prediction α ∈ A. The loss of the ith detection head is formulated as follows: where iࢠ{1,2,3,4}, λ 1 , λ 2 , λ 3 , and λ 4 are 1.0, 0.5, 1.0, and 1.0, respectively. To sum the losses of four detection heads, the overall loss L S of the network is formulated as follows:

IV. EXPERIMENTS
In this section, we conduct extensive experiments to evaluate the performance of the proposed DDMNet on representative challenging datasets for OOD. We first introduce the three benchmark datasets, evaluation metrics, and implementation details. Then, ablation studies are performed to analyze the effectiveness of each key component of DDMNet. Next, we compared the proposed DDMNet with several state-of-the-art methods on the benchmark datasets.
A. Datasets 1) HRSC2016: HRSC2016 [14] is a challenging dataset for ship detection. The images of HRSC2016 were collected from Google Earth with resolutions ranging from 0.4 to 2.0 m. And the image sizes range from 300 × 300 pixels to 1500 × 900 pixels. The training, validation, and test sets consist of 436, 181, and 444 images, respectively. We follow the same setting in [31] that only the training set is used to train the model and the test set is used for evaluation.
2) FGSD2021: FGSD2021 [36] is a dataset for fine-grained ship detection and recognition. The images of FGSD2021 were also collected from Google Earth which has a fixed ground sample distance, 1 m per pixel. The widths of images range from 157 to 7789 pixels and the heights range from 224 to 6506 pixels.

B. Evaluation Metrics and Implementation Details 1) Evaluation Metrics:
To evaluate our method on OOD, we adopt average precision (AP), mean average precision (mAP), and frames per second (FPS) as the evaluation metrics. Before calculating AP and mAP, the IoU of the predicted OBB and ground truth need to be calculated first. IoU is represented as follows: where OBB pred denotes the predicated OBB, OBB gt is the ground-truth of the oriented object, and IoUࢠ[0,1]. The closer the value of IoU is to 1, the more accurate the OBB pred is. In general object detection, the object is considered to be detected correctly when the IoU is greater than 0.5. In object detection tasks that require more accurate boundary, the IoU threshold can be set to 0.75 and 0.85 which means more challenging. Once the where TP represents the number of correctly detected objects, FP represents the number of incorrectly detected targets, and FN represents the number of missed objects. The curve of Pression and Recall (P-R curve) is drawn with Recall as X-axis and Precision as Y-axis. AP and mAP are introduced which are formulated as follows: where K is the number of categories. Higher values of AP and mAP indicate that the predicted boundary is closer to the true boundary.
2) Implementation Details: We employed PyTorch to implement our method and the training of the model is conducted on the NVIDIA GeForce RTX3090 with 24 GB memory. To obtain the inference speed of the proposed method and make a fair comparison with other methods, NVIDIA GeForce RTX2080Ti with 11 GB memory is adopted which is used by most methods. In the training stage, the backbone is initialized by the pretrained parameters (DDRNet39). The Adam optimizer [60] with the batch size of 8 and the weight decay of 0.0005 is used to update the network's parameters. The initial learning rates are set to 0.0005, 0.0005, and 0.000125 on the HRSC2016, FGSD2021, and DOTA datasets, respectively. We use the exponential decay strategy with the power of 0.98 to update the learning rate. Following Yi et al. [31], data augmentations that involve random flipping and cropping are leveraged. The whole training process contains 300, 300, and 150 epochs on the HRSC2016, FGSD2021, and DOTA datasets, respectively.

C. Ablation Studies 1) Effectiveness of Critical Modules of DDMNet:
To validate the effectiveness of each key component of DDMNet including multihead, deformable convolution (DCN), and DFFM, we conduct experiments on the HRSC2016 dataset with the input size of 800 × 800 pixels, and the AP of different methods at different IoU thresholds is shown in Tables I and II. The Baseline method adopts the single head for prediction which only employs feature maps F 3 from DDRNet39's Stage_5_2. In Table I, we can see the Baseline achieves an AP 50 of 90.15%, outperforming the original method proposed by Yi et al. [31] which also leveraged the BBAVectors for OOD. Furthermore, the architecture designed by Yi et al. [31] has a higher computational cost due to the usage of ResNet101. And our method adopts more lightweight DDRNet39 and this verifies the superiority of dual-resolution bilateral fusion. When we change the IoU threshold, we observe that the average precision of the Baseline method drops sharply and its AP 75 is only 67.18%. This means that the predicated OBBs cannot satisfy the demand for highquality object detection. Compared with the Baseline which only employs the single head for prediction, multihead employing three heads (based on F 1 , F 2 , and F 3 ) improves the AP 75 by about 9.60%. This shows that although feature maps from different stages have the same spatial resolution, they still have a contextual gap and it is crucial to leverage multilevels of feature maps for high-quality OOD. The multihead method can exploit the multiscale spatial and contextual information to alleviate the contextual gap between different levels of feature maps. Incorporating DFFM into the multihead method can further boost the detection accuracy, especially AP 85 and AP 90 . This imitates DFFM refines the feature maps by using dilated convolution with the different dilated rates, which is beneficial for the   aggregation of different feature maps and enlarging the receptive fields at various scales. Compared with DFFM, DCN obtains more desirable improvements at AP 70 and AP 90 . Due to the fact that deformable convolution has more effective receptive fields which can adjust to the object's shape and orientations adaptively, it is able to extract more discriminative and fine-grained feature representations for oriented objects. By considering both DFFM and DCN, the combination of DFFM and DCN can obtain better performance than only using a single module. And based on these works, we incorporate the deformable convolution into each detection head and proposed DDMNet. And DDMNet achieves the best performance under both PASCAL VOC2007 metrics and PASCAL VOC2012 metrics. In summary, with the above experimental results of incorporating multilevel high-resolution feature maps, multihead, deformable convolution, and DFFM, our method is able to achieve outstanding performance on OOD.
More experiments are conducted on the FGSD2021 dataset and the results are presented in Tables III and IV. And the same conclusions can be drawn from the experimental results on the FGSD2021 dataset as the HRSC2016 dataset which validates the effectiveness of our method. Fig. 7 shows the visualization of detection results using the Baseline method and DDMNet on the HRSC2016 dataset. From the visualization results, we can observe that the waves of the sea make the background complicated and it is difficult to differentiate some docked vessels from the background. Many OBBs predicted by the Baseline method have lower scores and there are some ships that are not detected correctly. And some OBBs predicted by the Baseline method do not accurately cover the ships. In contrast, DDMNet significantly improves the scores and the boundary of OBBs. This indicates the effectiveness of our proposed DDMNet and DDMNet can obtain more robust and reliable detections.
2) Ablative Experiment of BF and DAPPM: The BF method and the DAPPM are two important modules in DDRNet [33] for semantic segmentation which can greatly improve the performance of scene parsing. In this part, we analyze the effect of these two modules on our DDMNet and the results are presented in Tables V and VI. From Table V, we can find that without BF and DAPPM, DDMNet only achieves 90.23% mAP 50 and 68.25% mAP 75 . Benefiting from the fusion of spatial details and contextual information, BF greatly improves the performance of object detection, the mAP 50 is improved from 90.23% to 90.42%, and the mAP 75 is improved from 68.25% to 75.30%. DAPPM is located at the bottom of the network and is used to explore deep semantic information. And with DAPPM, DDMNet achieves the best 90.49% mAP 50 . Similar conclusions can be found in Table VI, on the FGSD2021 dataset, with BF and DAPPM, DDMNet also achieves the best performance. However, a strange phenomenon in Table VI is that the introduction of BF has resulted in a decrease in mAP 50 . This may be because there are many similar categories in the FGSD2021 dataset and distinguishing them requires more effective semantic information. And the use of DAPPM significantly alleviates this problem.
3) Performance Difference At Different Input Sizes: In order to further investigate the sensitivity of the proposed method to input sizes and IoU thresholds, we compare the performance of DDMNet with the Baseline method under different input sizes. Four input sizes are adopted on the HRSC2016 dataset for experiments and they are 416 × 416, 512 × 512, 608 × 608, and 800 × 800 pixels, respectively. Generally, the smaller the input size is, the faster the inference speed of the algorithm is. At the same time, small objects in the image become more difficult to detect. Table VII shows the comparison results under AP 50 and AP 75 when images of different sizes are input. As can be seen from Table VII, the performance of DDMNet and the Baseline method drops when the input size gets smaller. But even if the input size is 416 × 416 pixels, the AP 50 and AP 75 of the DDMNet are 2.18% (5.30% under the PASCAL VOC2012 metric) and 15.54% (17.58% under the PASCAL VOC2012 metric) higher than that of the Baseline method, respectively. Moreover, the performance of DDMNet at the size of 416 × 416 pixels is still superior to that of the Baseline at the size of 800 × 800 pixels. The performance comparison with different IoU thresholds between DDMNet and the Baseline method is shown in Fig. 8. As can be observed in Fig. 8, at different input sizes and IoU thresholds, the proposed DDMNet achieves much better detection performance than the Baseline method, which demonstrates that DDMNet is more robust to image size and has significant advantages in small object detection.     AP 85 , which outperforms current representative methods such as R 3 Det, RIDet-O, DAL, and SLA. It is worth noting that R 3 Det, SLA, and DAL are anchor-based methods while our DDMNet is an anchor-free method. We illustrate some visualization results of DDMNet on the HRSC2016 dataset in Fig. 9. It can be seen that the proposed method can effectively detect docked and adjacent ships with various categories, sizes, and complex backgrounds, respectively.
2) Results on the FGSD2021 Dataset: The FGSD2021 dataset consists of 20 categories and some ships are very similar in appearance but belong to different categories. Moreover, the quantity of ships of different categories varies significantly, and there is a significant class-imbalance problem during training. The evaluation results compared with other state-of-the-art methods on the FGSD2021 dataset are reported in Table X. First, we follow paper [36] to resize the image patches whose sizes are 1024 × 1024 pixels to 512 × 512 pixels at the training stage and the test stage. The Baseline method archives 84.82% at mAP 50 and 67.03% at mAP 70 , respectively. DDMNet makes a significant improvement (i.e., 87.29% at mAP 50 and 74.75% at mAP 70 ) compared with the Baseline method. DDMNet is slightly inferior to CHPDet which adopts the head point to represent oriented objects. This is largely due to the operation of resizing. The size of feature maps adopted by CHPDet is 1/4 of the input size, while ours is 1/8 of the input size for faster inference speed. CHPDet achieved a detection speed of 15.4 FPS at the best performance of 89.29% mAP 50 , while DDMNet achieved a detection speed of 20.8 FPS at the best performance of 93.25% mAP 50 .When original sizes such as 800 × 800 pixels and 1024 × 1024 pixels are used for experiments instead of resizing, we obtain much better performance compared with the Baseline method and other representative methods. Compared with CHPDet which achieves outstanding performance to the best of our knowledge, DDMNet improves the mAP 50 from 89.29% to 93.25%. And at other metrics like mAP 60 and mAP 70 , DDMNet is also superior to CHPDet.
In order to further observe the ability of our method to distinguish ships' categories in detail, we report the AP of each category, as shown in Table XI. As can be seen in Table XI, what stands out in the table is that DDMNet shows the best accuracy in most categories. Fig. 10 shows the detection results of the proposed DDMNet on the FGSD2021 dataset. The proposed method can effectively identify different categories of ships. And the results on the FGSD2021 dataset demonstrate the effectiveness and superiority of our method for fine-grained ship recognition.
3) Results on the DOTA Dataset: To verify the detection performance of the proposed DDMNet on other categories, relevant experiments are also carried out on the DOTA dataset. We compare DDMNet with some representative two-stage algorithms, single-stage anchor-based algorithms, and single-stage anchor-free detection algorithms, as shown in Table XII. It is apparent from this table that DDMNet achieves 78.66% mAP, which not only outperforms other state-of-the-art anchor-free algorithms but also is superior to many two-stage algorithms and single-stage anchor-based algorithms. DDMNet also obtains the best results in some challenging categories, such as GTF, RA, HA, and HC. It is noted that we implement DDMNet on the more lightweight DDRNet-39 while most compared algorithms are implemented on ResNet101. Even so, we find that several categories with accuracies below 70%, such as bridge (BR) and roundabout (RA). Bridges have large sizes and large aspect ratios. When processing remote sensing images, we need to cut the images into small slices which damages the integrity of the bridges. We visualize some detection results on the DOTA test set in Fig. 11 and as observed, DDMNet is capable of detecting oriented objects with dense distributions, large aspect ratios, and large-scale variations. In addition, from Fig. 11, we can also some objects that are not well detected such as bridges and helicopters. Because of the similarity in texture and shape between roads and bridges, some roads are misidentified as bridges. Due to the large size of the bridge, there are multiple detection results for one bridge. And similar phenomenon occurs in harbor detection. When detecting helicopters, the bounding boxes are not tightly aligned with the objects. This is largely because helicopters are small, fuzzy and difficult to distinguish from the background.

V. CONCLUSION
In this article, we proposed a high-performance anchorfree method DDMNet for oriented object detection. DDMNet mainly consists of three parts: extraction of high-resolution feature maps, feature alignment using deformable convolution and deformable feature fusion, and multihead prediction. Highresolution feature maps contain both continuous spatial details and multiscale contextual information. After feature alignment and feature fusion, the model's perception of objects' shapes and orientations is enhanced. Moreover, multihead prediction is adopted to make full use of different levels of feature maps. These components are tightly coupled and jointly trained to achieve high-quality oriented object detection with a fast inference speed. Extensive experiments of three challenging datasets have shown the effectiveness and superiority of DDM-Net. We use the BBAVectors which consist of four points to represent the OBB in the DDMNet. And the four points of the BBAVectors are trained and predicted independently. In fact, these four points have other geometric restrictions such as symmetry and vertical. Therefore, these geometric restrictions can be introduced to improve detection performance. In addition, the deformable convolution is incorporated into the framework, which improves the model's perception of the objects' shape and also increases the computational burden. Although we devote effort to improve the performance of anchor-free algorithms for oriented object detection, there is still a performance gap between anchor-free algorithms and some two-stage algorithms. Therefore, there is still much room for improvement in our algorithm both in terms of performance and speed. In the next work, we will explore more effective representations of oriented objects and more lightweight architectures for oriented object detection.