A Generating-Anchor Network for Small Ship Detection in SAR Images

Synthetic aperture radar (SAR) ship detection especially for small ships has issues, such as dense distribution of ships, interference from land and small islands. To address these issues, many deep learning methods, including anchor-based and anchor-free methods, have been successfully migrated from optical scenes to SAR images. However, when the preset scale of anchors does not match well with the ships, it will seriously reduce the detection precision. Due to the lack of anchor-based refinement process, anchor-free methods may generate missing or false alarms in complex scenarios. In this article, a two-stage ship detection network which can generate anchors is proposed. First, our method generates high-quality anchors by network, which is more beneficial for the network to capture small ships. In addition, the generated anchors are centrally set in the region of ships, which reduces the number of anchors unrelated to ships. Second, the receptive field enhancement module is inserted into the feature pyramid network. It sets different dilation ratios of atrous convolution according to the scale of the feature map, which further enriches the semantic information of the elements in the feature map. Therefore, the network can use the information of a wider region effectively to detect ships. Finally, to verify the effectiveness of our method, extensive experiments are carried out on SAR ship detection dataset and high-resolution SAR images dataset. The results show that our method has more strong ability of detecting small ships, and achieves better detection performance than some state-of-the-art methods.


I. INTRODUCTION
W ITH the propagation characteristics of electromagnetic waves, weather conditions have less impact on SAR than optical remote sensing sensor, and the SAR can work all day. As a valuable research topic in the field of SAR image processing, ship detection plays an important role in sea surface monitoring and fishery management [1]. Unfortunately, detecting ships in complex environments (such as areas near land and little islands) is still not a completely resolved task for researchers. What is more, detection of small ships is also a great challenge [2].
The backscatter signal of ships is typically stronger than the sea surface, resulting in its area being brighter than the Manuscript  surrounding background in SAR images [3]. The constant false alarm rate (CFAR) is generally introduced into detection [3], [4]. For detecting ships, a bilateral CFAR algorithm combined the intensity (i.e., brightness) distribution and spatial information of the SAR image [5]. The two-parameter CFAR detector with polarimetric whitening filter was derived under the distribution of clutters including Wishart, K-Wishart, F-Wishart, etc., [3]. However, the complex background can affect the performance of CFAR detector [5], [6], e.g., radio frequency interference [7] may cause mismatch of clutter model, and the CFAR detector may suffer performance degradation in multitarget scenarios [8].
The texture and contour of the SAR images have been extracted from the gray value as another type of features for ship detection [9]. Gao [10] investigated the effectiveness of multiple features (like spatial boundary features, fractal dimension feature), and extracted signal-to-noise-ratio (SNR) features for SAR target detection. Wang et al. [11] obtained the complete structures of the bright area via superpixel segmentation and Bayesian framework, then the morphological features was used to distinguish target from clutter. The fisher vector (FV) represents more characteristics of superpixel than its intensity values, and it contains the zeroorder, first-order, and second-order feature for ship detection [8]. Subsequently, the classifier completes the detection of ships in the feature space [12]. The widely used classifiers, like support vector machine (SVM), adaptive boosting, and so on, achieve accurate detection performance in the suitable scenarios. Nevertheless, due to the influence of speckle noise and small islands, false alarms often occur in these traditional methods based on image processing. Whenever an unknown scattering of ships appears or characteristic of interference changes turbulently, it takes a relatively long time for scholars to design new features manually.
The vigorous development of deep learning (DL) technology has promoted computer vision (CV) to a new stage. The powerful learning ability of neural networks eliminates the need for scholars to design features manually. Influenced by the significant performance in CV field, the convolutional neural network (CNN) has been introduced to detect targets in remote sensing images [13]. The anchor-based CNNs develop along two paths: 1) the single-stage with high operating speed and 2) the two-stage with high precision. The representative algorithms of single-stage methods are You Only Look Once (YOLO) [14], RetinaNet [15] and Single Shot Detection (SSD) [16]. In order to make YOLOv4 network [17] more suitable for ship detection in SAR images, Gao et al. [18] introduced scale-equalizing pyramid convolution module and convolutional block attention This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ module into the network, and modified the head of the YOLOv4. Represented by faster R-CNN [19], the two-stage method with the region proposal network (RPN) trades speed for an increase in detection precision. To alleviate the multiscale problem within target in the ship detection, Deng et al. [2] proposed a ship detection network that incorporates a design of multiscale filter based on the two-stage network structure, and the redesigned backbone network is compact in order to improve the training efficiency. The two-stage network can be regarded as the first stage of rough classification and the second stage of fine classification, which can further improve the accuracy of ship detection. Hou et al. [20] used the SSD network as the first stage, then constructed RefineDNet to improve the confidence of potential objects from the first stage. Zhang et al. [21] proposed an SAR ship detector based on the faster R-CNN incorporating four balanced strategies and verified effectiveness of the detector for solving scene imbalance and sample imbalance on multiple public datasets. Besides the above anchor-based CNNs, the anchor-free CNNs, like CenterNet [22], FCOS [23] have also been introduced to ship detection. The CenterNet has been used as a low-computation ship detector in [9]. The detector added the spatial shuffle-group enhance attention module for capturing features accurately under the interference of noise. Hu et al. [24] introduced deformable convolution into FCOS to capture more effective information for ships, and the nonlocal attention mechanism in the network effectively balanced the local information of the feature map. Gao et al. [27] proposed a novel feature aggregation scheme to enhance representation ability of the features, and the feature reuse strategy of the scheme improved the generalization ability of the model. Fu et al. [26] retained the overall architecture of the FCOS, and proposed a module for mitigating interference from objects adjacent to the ships. And an intersection over union (IoU) prediction branch was inserted into head of the network for the bounding box regression of small-scale ships.
Due to the small radar cross-section, small ships are littlescale and weak-intensity in SAR images, which are very easy to be confused with islands and speckle noise for CNNs. In [27], the inception module was adopted to increase the receptive field that can capture small ships more effectively. Cui et al. [28] integrated spatial and channel attention into the feature pyramid network (FPN) structure, which strengthened the important information in the small-scale feature map and improved the detection precision of small targets. In order to improve the ability of network to detect small ships, Wang et al. [29] added a nonlocal attention mechanism as a module on the SSD to enrich the semantic information of feature maps. The coordinate attention module was used to capture horizontal and vertical correlations on feature maps in [30], then these feature maps are processed by receptive field boosting to effectively reduce false alarms. For improving the positioning accuracy of small ships and reduce false alarms of nonships, Chen et al. [31] derived a shape similarity IoU loss to instead the original loss function of bounding box regression. Su et al. [32] used multiscale pooling operation to upgrade location information of small ships at the high-level features. Zeng et al. [33] pioneered the utilization of low-level feature to match the receptive field of small ships, the low-level features used to contain regional and texture information for capturing small ships.
In the abovementioned ship detection network, the researchers extended the DL algorithm based on optical target detection at close range to the ship detection in SAR images. With the advantages of high accuracy, the ship detectors based on the DL algorithm have become a research hotspot in the field of remote sensing. However, SAR images contain less information compared with optical images and have interference, such as clutter, the inherent shortcomings of these methods may be further revealed in SAR ship detection. The anchor-based methods regress the bounding box of target based on the designed anchor, but the designed anchors are usually placed uniformly over the entire feature map, resulting in a huge computational cost. Ships are sparsely distributed in SAR images, which means that the proportion of images occupied by ships, especially small ships, is generally small. Sampling of the corresponding ocean region on the feature map by the anchor will generate a large number of negative samples unrelated to the ships, which wastes computing resources. The detection performance of anchor-based methods are very sensitive to the setting of anchor hyperparameters. The methods with fixed scale and aspect ratio of anchors, such as faster R-CNN and RetinaNet, cannot be applied to all resolutions of ships datasets. Methods that spend the effort to design anchors manually, such as YOLOv3 [34], may have an unstable performance for large-scale variation within the class. If the scale of preset anchors is not small enough, the ability of the network to detect small ships will be severely affected. Although the anchor-free method avoids the effort of adjusting the hyperparameters and reduces the amount of computation caused by anchors, the anchor-free methods lack further refinement based on anchors, resulting in a lack of ability to handle complex scenes and cases [35]. When the ships are parked close to the shore or are distributed densely, the performance of the anchor-free method to predict the bounding box will decrease. Furthermore, the SAR images are grayscale images, which lack color information that helps to directly regress the bounding box without setting anchors.
To overcome the obstacles of the above anchor-based and anchor-free methods while obtaining stronger detection capability for small ships, we propose an SAR ship detection network capable of generating anchors. Our method predicts the shape of the anchor and its location on the feature map. The major contributions of our work can be summarized as follows.
1) To reduce false candidates unrelated to ships while preserving the design of anchor for high-accuracy detection performance, we propose a generating anchors module (GAM). The GAM receives the multiscale feature maps from the FPN and predicts the position and shape of the anchor. The anchor generated by the network can more effectively handle ships with various aspect ratios.
2) Feature maps with rich semantic ranges can provide spatial interaction information in the scene. Therefore, we design a receptive fields enhancement module (RFEM) for improving the ability of the network to locate ships. The feature maps with multisize receptive fields from the RFEM are merged into a new feature through channel then fed into the FPN. 3) To verify the effectiveness of our method, we conduct extensive experiments on two widely used SAR image ship datasets: SAR ship detection dataset (SSDD) [36] and high-resolution SAR images dataset (HRSID) [37]. Our method attains AP of 64.8 and 66.5, AP 50 of 95.7 and 91.1 on SSDD and HRSID, respectively, as well as achieves better detection performance for small ships than popular detectors.
The rest of this article is organized as follows: The Section II presents the details of our method, and Section III analyzes the effectiveness of our method with experimental data. Finally, Section IV concludes this article.

II. PROPOSED METHOD
This section describes the proposed method in detail, and Fig. 1 shows the overall architecture of our proposed ship detection network. An SAR image is input into the convergent network that has been trained. It is the first extracted feature by the backbone network, and the network obtains multiple feature maps of different scales. The resolution of these feature maps is gradually reduced, and the semantic content of which is enriched scale by scale. They are then fed into the RFEM, and the receptive fields of elements in each scale feature map are enhanced, facilitating the GAM to generate anchors that are more similar in shape to ships. These feature maps processed by the RFEM enter into the GAM after completing multiscale feature fusion in the FPN. The subnetwork of location prediction (SNLP) in the GAM filters the location of the center point, where the anchor is set on the feature map. And the subnetwork of shape prediction (SNSP) in the GAM generates the height (h) and width (w) of the anchor corresponding to the location from the SNLP. Subsequently high-quality anchors and feature maps output by FPN are fed into RPN and the head of the network to complete proposal extraction, bounding box refinement, and target classification in turn.

A. Basic Framework
Compared with the single-stage network, the two-stage network has one more RPN. Although the two-stage network has the inherent disadvantage of slow inference speed, the authors in [21] and [28] choose to develop ship detectors based on a two-stage network framework due to its high detection precision. With the improvement of computing power, the time-consuming gap between the two-stage networks and the single-stage networks will be further narrowed. And our method can reduce the amount of computation caused by invalid candidates, so our method adapts the two-stage network as the basic frame. The two-stage network framework with the FPN inserted is shown in Fig. 2. The feature maps {C 2 , C 3 , C 4 , C 5 } obtained from the bottom-up path in the backbone fuse with the feature maps {A 2 , A 3 , A 4 , A 5 } generated by top-down paths in the FPN via lateral connections. The new feature maps {P 2 , P 3 , P 4 , P 5 } contain the semantic information in the higher layers and retain the location information of target in the lower layers Since the number of channels among C i is inconsistent, the FPN first obtain C i with the same number of channels through 1×1 convolution. To detect large ships, we downsample P 5 to obtain P 6 , so the downsampling factor of {P 2 , P 3 , P 4 , P 5 , P 6 } corresponding to the input sample image I ∈ R C×H×W is s i = {4, 8, 16, 32, 64}. Therefore, the number of anchors N anchor_i set on the feature map P i is as follows: (4) where N size and N ratio are the number of two preset hyperparameters (the size and the aspect ratio of anchor), respectively. The RPN judges whether the anchors contain targets, and regresses the anchors containing targets as proposals. The nonmaximum suppression (NMS) is used to filter proposals for obtaining regions of interest (ROIs). Then the network assigns the ROIs to the feature map P i according to the scale. The detection head of the network extracts the features of the corresponding ROIs to complete the classification of targets and the refinement of bounding boxes. In the two regression of bounding box in the two-stage network, the network does not directly predict the center coordinates (x, y), height and width (w, h) of the box. When the RPN generates proposals, the network outputs the difference between the anchor and the ground truth (GT) box, i.e., the panning amount and the transformation scale (x t , y t , w t , h t ). The parameters (x, y, w, h) of the proposal can be decoded from the parameters of the anchor (x a , y a , w a , h a ).
The e in (6) is the base of the natural logarithmic function. When refining the bounding box, the network predicts the difference between the ROI and the GT box, and the decoding method is the same as (5) and (6).

B. GAM
In the anchor-based network, the anchors are densely set on the image, most of the anchors in the SAR ship detection are set in the ocean area, which causes the RPN to waste a lot of time for judging whether the anchor contains ships. The ships are long strips and sail at any direction in the ocean, so the aspect ratio of the GT boxes is usually quite different. The preset and clustered from dataset anchors are not robust enough for ship detection. Inspired by [35], we propose the GAM with supervision of aspect ratio to generate anchors. The location(x, y) and (w, h) of the ship's bounding box on an SAR image I follow a conditional probability density distribution p(x, y, w, h|I) = p(x, y|I)p(w, h|x, y, I).
The p(x, y|I) means that the ships appear in a specific location on the image, i.e., the probability of placing an anchor on each point of the feature map is different. And the p(w, h|x, y, I) means that the shape of the ships bounding box is related to the location of the ship on the image, that is, the (w, h) of anchor on each location has a relationship between the location on the feature map. Based on (7), the GAM structure is shown in the Fig. 3. This module contains two branches: the SNLP and the SNSP. In the SNLP, each position (x, y) in the feature map P i corresponds to the coordinate ((x + 1 2 )s i , (y + 1 2 )s i ) on the input image I. The p(x, y|P i ) indicates the probability that the ship exists in this location. The 1×1 convolution is applied to P i for obtaining the score map of ships existence. The score map is processed by the sigmoid layer to generate the probability map We take the location on the p(x, y|P i ) where the value is higher than the predefined threshold ε as the (x a , y a ) to place the anchors. The first term of the product in (7) is obtained by the SNLP, and the SNSP predicts the (w a , h a ) of the anchor at (x a , y a ). According to [15] and [19], the (6) can be used to obtain a more stable shape of anchor, therefore the SNSP predicts the transformation scale (w t , h t ) and the (w a , h a ) of anchor is as follows: where σ is the scale factor, and the w t and h t come from a two-channel map generated by applying a 1×1 convolution on P i . The (w a , h a ) combines the location of anchor center (x a , y a ) from the SNLP to obtain the anchor (x a , y a , w a , h a ) that can better capture ships. It is worth noting that only one anchor is associated with each location, so the N anchor_i on P i changes from (4) to Compared to (4), the number of anchors drops significantly after applying the GAM. The number of positive and negative samples becomes more balanced.

C. RFEM
To improve the receptive field of elements in the feature map, pooling operation [38] or atrous convolution [39] can be performed. Although the pooling operation does not increase the number of parameters, it is easy to cause the feature map to be disturbed by noise, especially the strong interference of speckle in the SAR image. And the pooling operation reduces the resolution of the feature map. Therefore, atrous convolution is used in our method to enhance the receptive field while preserving the spatial information of the feature maps. For a 2-D feature map P , the Q obtained after atrous convolution can be expressed as where (i, j) are coordinates on the feature map, and W and r are a convolution filter of size K * K and dilated rate, respectively. When the dilation ratio is 1, the atrous convolution degenerates into an ordinary convolution. In the ASPP structure, the input feature map is responsible for predicting targets with size within a range, so the dilation rate of atrous convolution chosen to improve the receptive field are {6, 12, 18}. However, the FPN assigns targets to feature maps of different resolutions according to scale, and the proportion of images taken up by ships is relatively low in ship detection. Therefore, it is easy to introduce redundant information irrelevant to the ships by using the atrous convolution operation with large dilation rates on the high-level feature map. In the structure of our designed RFEM, the number of atrous convolutions is gradually reduced as the resolution of the feature map decreases. The P 5 of lowest resolution requires only one atrous convolution operation with the minimum dilated rate. The structure of RFEM is shown in Fig. 4, the RFEM is embedded after the channel reduction operation of 1×1 convolution in FPN. The with the highest resolution uses atrous convolution with dilated rates {1, α, 2α, 3α, 4α} in parallel, and the α is the step size of the dilated rate. In order to ensure that the multiple output feature maps can be concatenated, these feature maps are made as the same shape like C 2 by zero-padding operation before processing by atrous convolution. The feature map with 256 × 5 channels obtained by concatenatting is subjected to 1×1 convolution for the interaction between feature maps of different receptive fields, so the obtained C 2 ∈ R 256× H 4 × W 4 has stronger representation ability. The operation for C 3 is similar to that for C 2 , except that the dilated rates of atrous convolution is {1, α, 2α, 3α}, by analogy, the dilated rates of the atrous convolution used for C 5 is {1, α}. The FPN and the RFEM are connected by cascade, so the C i in Fig. 2 is replaced by the output of RFEM C i in our method. The feature maps extracted from the backbone enhance the receptive field via the RFEM, then they complete the fusion of multiresolution features through a top-down path. The new feature maps output by FPN not only enable the network head to achieve better detection performance, but also promote GAM to generate higher quality anchors.

D. Loss
Our proposed ship detection network follows optimization approach of end-to-end via multitask loss. The multitask loss function Loss contains loss function of SNLP L SNLP and loss function of SNSP L SNSP from the GAM, loss function of classification L cls and loss function of regression L reg from the base framework. In the training of the network, the L cls and L reg are cross entropy loss and smooth L1 loss, respectively, The training of SNLP requires the region of ships occupation as label to calculate the L SNLP , the label can be obtained directly from the GT box of ships. Since the higher initial IoU value appear when the center of anchor and GT box are closer, the locations in the center region of the GT boxes on the feature map are regarded as positive samples. In addition, we wish to set as few anchors as possible on the region far from the center of the GT boxes. First the GT box (x g , y g , w g , h g ) must be mapped to the scale of the feature map P i to get (x g , y g , w g , h g ). The rectangular region is defined as R(x, y, w, h) like the bounding box, the three types of sample labels on the feature map are defined, as Table I. The CR usually occupies a smaller portion on the feature map, so we use focal loss [15] as L SNLP . Since the GAM is cascaded after the FPN, the assignment scheme of GT boxes in the FPN is still used when generating the binary label map.
In training, the basic framework assigns anchors for GT box to calculate the loss according to the maximum IoU value. But it is no longer applicable to the case where w and h are variables in the GAM. This problem is solved by approximating IoU with the variable IoU (vIoU) in the GAM vIoU(a wh , gt) = max a wh ∈a sample IoU(a wh , gt) where IoU(a wh , gt) is the IoU between a anchor with (w, h) and GT box gt, and a sample is the set of anchors with common (w, h) obtained by sampling. The nine pairs of sampling anchors in our experiments are the same as [15], i.e., the aspect ratio of anchors on the P i are ratio = {0.5,1,2}, and the base scale of anchors base_scale = 2 m/3 , m = 0, 1, 2. Compared with optical photos and optical remote sensing images, SAR images lack rich color boundary information to help the GAM predict the shape of anchors. Due to the interference, such as sea clutter, predicting the shape of anchor alone may lead to an error between the aspect ratio of anchor and ideal situation. Additionally the shape of the ships has higher requirements on the aspect ratio of the anchors.
As a result, the bounding box and GT box are not matched accurately enough, it is difficult to meet the requirements of the scene with a high IoU value between the bounding box and GT box. Therefore, we design the aspect ratio loss as the supervision for generating the anchor, and the L SNSP is as follows: where L 1 is the smooth L1 loss, and MSE is the mean square error loss.

III. EXPERIMENTS
All of our experimental results are run by a computer with NVIDIA RTX 3090 GPU. The operating system is ubuntu 20.04, and the installed DL framework is Pytorch. Furthermore, our method is implemented based on the MMDetection Toolbox [40].

A. Datasets and Settings
To verify the effectiveness of our proposed module and test the performance of our ship detection network, we conduct hyperparameter experiments, ablation experiments, and comparison experiments with other popular networks on the SSDD. The SSDD has multiple data sources (RadarSat-2, TerraSAR-X, and Sentinel-1), and its resolution covers the range of 1-15 m. The SSDD includes 1160 samples ranging in side lengths between 500 pixels and 600 pixels, with a total of 2456 ships in these samples. Furthermore, to demonstrate the generalization of our method, we also conduct comparative experiments with other popular networks on the HRSID. The resolutions of samples in the HRSID are 0.5, 1, and 3 m, and the resolution of most samples is 3 m. The sample size of the HRSID is 800×800 pixels, and these samples come from TanDEM, TerraSAR-X and Sentinel-1. The HRSID has a total of 5604 samples and 16 951 ship labels, with an average of three ships on each sample. The two datasets also provide labels for inshore and offshore samples, which can test the performance of our method in complex environments. For the SSDD, we divide the training set and the testing set according to [36]. The raw images whose the last digits of the file number is 1 and 9 are used as the testing set, and the rest are used as the training set, i.e., 232 samples are used for testing and 928 samples are used for training. In order to facilitate the input of network, the samples in the SSDD are resized to 512×512 pixels. For keeping the aspect ratio of the samples, the zero-padding operation is used when resizing samples. The sample division plan of HRSID is in accordance with the [37], that is about 65% (3642 images) are the training set, and 35% (1962 images) are the testing set. According to the [21], we directly input the HRSID samples into network without resizing and padding.
In our experiments, the samples do not undergo any augmentation operations to fully demonstrate the performance of network. The backbone network of all networks except the  [41] is the ResNet-50 network [42] loaded with pretrained weights of the ImageNet from the torchvision. All networks are trained on GPU with batchsize of 1 for 12 epochs, and the configuration of the optimizer is in Table II. For promoting the network to converge, we also adopt a linear warm-up strategy of the learning rate. The number of warm-up iterations and the warm-up rate of this strategy are 500 and 0.001, respectively. The NMS is applied to filter redundant bounding boxes from the outputs of network, and the IoU threshold of NMS is set to 0.5. We set λ 1 = 1.0 and λ 2 = 0.8 to balance the loss terms in our method. The rest of hyperparameters follow the default settings of the MMDetection Toolbox.

B. Evaluation Metrics
To objectively evaluate the performance of the method, we adopt some quantitative metrics in this section, such as the most widely used recall (r) and precision (p). The precision-recall curve (PRC) can show precision and recall, and can describe their relationship specifically. Therefore, we introduce average precision (AP ), AP 50 , and AP 75 from the evaluation metrics of COCO [43] to quantify the PRC for a more comprehensive evaluation instead of single precision. In addition, small ships have always been a nodus for SAR ship detection, so we adopted the AP s and AP 50 s to evaluate the detection performance of small ships. The AP s and AP 50 s are the AP and AP 50 of small ships, respectively. The recall and precision are defined by (16) where N TP , N FN , and N FP are number of true positives (TP), false positives (FP), and false negatives (FN), respectively. In SAR ship detection, the TP is the correctly detected ship and the IoU between the bounding box and GT box is higher than 0.5. The FP is a false alarm or the IoU between the bounding box predicted for the ship and GT box is lower than 0.5, and the FN stands for missed ships The AP represents the area under the PRC. In addition, calculating the AP needs to set the IoU threshold between the GT box and the bounding box predicted for the ship to determine the TP. The AP 50 is the area under the PRC curve with an IoU threshold of 0.5, and AP 75 has an IoU threshold of 0.75 like AP 50 . It is worth noting that the AP in the following content is the average  Our definition of small ships follows the COCO, i.e., the ships whose GT box is area < 32 2 pixels are considered as small ships.

C. Hyperparameter Experiment
To maximize the effectiveness of the designed modules in our method, we conduct hyperparameter experiments for ε in GAM and step size of dilated rate (α) in RFEM. Except for the different values of the hyperparameters studied in each group of experiments, the rest of the network parameters, training settings and datasets used are the same, and the performance evaluation indicators are AP , AP 50 , and AP 75 .
1) ε in the GAM: We take ε ∈ [0:0.005:0.02] to carry out comparative experiments. The ε of 0 means that all locations of the feature map are not filtered and set anchors. As shown in Table III, the highest AP and AP 75 are obtained by setting ε as 0, and AP and AP 75 are at least 0.8% and 0.3% higher than other ε, but AP 50 is 0.5% lower than the highest value. The highest AP 50 was obtained at ε of 0.01, which is at least 0.2% higher than other ε, but AP and AP 75 are 0.8% and 0.3% lower than the highest values, respectively. In the target detection task, when the IoU of the bounding box and the GT box is higher than 0.5, the prediction of the bounding box is considered as correct. The AP 50 commonly used in the evaluation of Pascal VOC is also calculated when the IoU threshold is 0.5. Although the highest AP and AP 75 are obtained when ε of 0, the lower AP 50 indicates that fewer ships are detected. In addition, ε of 0 will generate a large number of redundant negative samples and increase the computational cost, so we choose 0.01 as ε in the GAM.
2) σ 1 and σ 2 in the GAM: Since the docking direction of ships is arbitrary, the proportion of ships in the manually annotated horizontal bounding box is random. To obtain the hard negative samples, we set σ 2 to 0.5 or 0.6 for hyperparameter experiments [35], [44]. We set σ 1 ∈ [0.1:0.1:σ 2 -0.1], and the experimental results are shown in the  and AP 75 are obtained by setting σ 1 = 0.2 and σ 2 = 0.5, and the AP 50 and AP 75 are at least 0.1% and 0.4% higher than other combinations of σ 1 and σ 2 , but the AP is 0.4% lower than the highest value. The highest AP and AP 50 of 95.1% are obtained when σ 1 = 0.3 and σ 2 = 0.6. The AP 50 is 0.6% lower than the highest value, which means that fewer ships are detected than σ 1 = 0.2 and σ 2 = 0.5. Therefore, the σ 1 and σ 2 in our method are set as 0.2 and 0.5, respectively.
3) α in the RFEM: In the SAR ship detection, small ships make up a very small proportion of the SAR image, so we choose α ∈ [2, 3, 4, 5] to carry out comparative experiments. In the Table V, we can observe that the highest AP and AP 75 are obtained at α of 2, while AP 50 is only 0.1% lower than that at α of 3. The network gets the highest AP 50 when α is set to 3, however, its AP and AP 75 are 0.4% and 0.3% lower compared with α of 2, respectively. From the Table V, with the increase of α, AP and AP 75 show a decreasing trend. To sum up, setting α as 2 can get more accurate bounding boxes, and can also detect more ships, so we set the α in the RFEM as 2.

D. Ablation Experiment
In order to fairly verify the effectiveness of the two components in our method, we conduct ablation experiments under the same experimental setup and data configuration. The baseline is faster R-CNN, and five indicators are used in this experiment. The AP , AP 50 , and AP 75 are used to evaluate the improvement of the component on the overall dataset. The AP s and AP 50 s can verify the improvement of the component's detection ability for small ships.

1) Effect of GAM:
We first investigate the effect of adding the aspect ratio loss in loss function of the GAM. The results are shown in Table VI. After adding the aspect ratio loss, the AP and AP 75 gain 1.5% and 1.6% improvement, respectively, and the AP 50 decreases by 0.1%. These indicate that the network benefited from the aspect ratio loss predicts more accurate bounding box. This means that the aspect ratio loss supervises the GAM to generate higher quality anchors. We then evaluate the performance of adding GAM to the baseline. The Fig. 5 shows the distribution of proposals predicted on the input SAR image by the RPN of baseline with or without the GAM, respectively. We can observe that after inserting the GAM, the overall number of proposals is greatly reduced compared  to the baseline, and most proposals are concentrated on the ships. The distribution of proposals in marine area is sparse, and the shape of proposals on the ships is closer to GT boxes. If without GAM, too many redundant proposals will be fed into the head of network resulting in extra computation. The results in Table VII show that the GAM increases AP , AP 50 , and AP 75 by 3.0%, 3.2% and 1.9%, respectively. This means that the high-quality anchors generated by the GAM can adapt to longer or wider ships, allowing the network head to regress more refined bounding boxes. The GAM improves AP s by 2.6% and AP 50 s by 2.9% on the baseline, because the GAM can predict anchors more matching than the preset anchors on position and shape for small ships, as shown in the top row of Fig. 5.
2) Effect of RFEM: We explore the impact of the RFEM by adding it to the baseline. From the Table VII, the RFEM achieves improvement of AP , AP 50 , and AP 75 by 1.4%, 0.9%, and 0.9%, respectively. The RFEM expands the receptive field of each element in the feature map, so the information around the ships is collected and assists the network to detect the ships. The top row of Fig. 7 is a near-shore scenario, the baseline produces false positives in the area close to the shore. The RFEM uses information from the ocean and coastal areas around the ships to eliminate false positives and facilitates the network to predict more accurate bounding boxes. The RFEM increases AP s by 0.8% and AP 50 s by 1% on the baseline, which shows that the RFEM improves the sensitivity of the network to small ships. The bottom row of Fig. 7 is a scene that small ships park densely. After the RFEM is inserted into baseline, the semantic information of the feature map becomes more abundant, which reduces the number of missed ships of the baseline by three, so the RFEM can further reduce the missing rate for small ships.
The above content verifies and analyzes the effectiveness of the two components, the GAM and the RFEM, respectively. As shown in Table VII, all metrics used in combination with the two components are further improved compared to using the components alone. Compared with the baseline, our method  gains improvement of AP , AP 50 , and AP 75 by 3.5%, 4.2%, and 3.4%, respectively. The corresponding PRC in Fig. 6 displays more comprehensive results. These mean our method can detect more ships and predict more accurate bounding box of ships. In addition, the AP s and AP 50 s increasing by 3.3% and 4.1% are obtained, which means that our components greatly improves the detection ability of small ships.

E. Comparison and Discussion
In this experiment, we fairly compare the performance of our method with some CNN-based methods, and the experimental settings and data configuration of the comparison experiments are exactly the same. The CNN-based methods we compare are divided into anchor-based methods and anchor-free methods. The anchor-based methods are RetinaNet [15], faster R-CNN [19], cascade R-CNN [45], Libra R-CNN [46], GA R-CNN [35], and HR-SDNet [41], the last five are developed based on the two-stage network framework like our method. The anchor-free methods selects FCOS [23], CP-FCOS [44], Autoassign [47], and ATSS [48]. The last two anchor-free methods optimize the strategy of sample assignment. The HR-SDNet and CP-FCOS are specifically designed for ship detection in SAR images.
In order to compare the performance of each method more comprehensively, we add AP , AP 50 , and AP 75 of inshore ships   Table VIII-X. Our method achieves certain advantages on AP 50 of the entire, inshore, and offshore ships on both datasets. The recall and AP 50 s of our method are the highest on both datasets compared with the other methods. These mean that our method can detect more ships in complex scenes and is more capable of detecting small ships than these CNN-based methods. Although both RetinaNet and our method can solve the problem of imbalance between positive and negative samples, the RetinaNet is inferior to our method on all metrics due to the lack of high-quality anchors. Compared with these anchor-based and anchor-free networks, our method also has advantages on AP and AP 75 , indicating that the receptive field improvement by the RFEM and high-quality anchors generated by the GAM in our method can help the network regress more accurate bounding box of ships. It is worth noting that some AP and AP 75 of our method on SSDD are slightly lower than cascade R-CNN. That is due to the lack of multilevel refinement of bounding boxes in our method, compared to the cascade R-CNN with more parameters and longer inferential time. The cascade R-CNN can obtain more accurate bounding boxes of easily detectable ships, however, our method is much higher than cascade R-CNN on AP 50 of both two datasets. And our method obtains better performance of AP and AP 75 on entire and inshore samples in HRSID with larger number of samples, so our method is more practical for ship detection in SAR images. As shown in Figs. 8 and 9, we visualize the detection results of some methods in four types of complex scene. We selected four representative methods to more highlight the strengths and weaknesses of our method. The selected methods are ATSS, faster R-CNN, and cascade R-CNN. The ATSS with the highest AP among the anchor-free methods in Tables VIII and IX. Comparing with the baseline network (faster R-CNN) of our method can fully show the improvement effect of our designed GAM and REFM on ship detection. The multistage detection network cascade R-CNN has more parameters and it uses the faster R-CNN as the baseline network like our method. Comparing with it can show that our method improves detection performance without increasing the number of parameters too much. We can observe the small ships with dense distribution, as shown in the first row of Fig. 8. The ATSS detects all targets but gives false alarms. Both faster R-CNN and cascade R-CNN have  missing ships, and our method detects all small ships without false alarms simultaneously. In the same scenario in the first row of Fig. 9, our method detects all small ships, while the other three methods produce missing ships. This means that our method has stronger detection ability for small targets, which is consistent with the results in Tables VIII and IX. In the scene where ships are parked adjacently (the second row in Figs. 8 and 9), both ATSS and baseline have missing ships and false alarms on the SSDD. And on the HRSID, our method detects adjacent ships targets with confidence close to one and no false alarms. In the scene where the ships appears near the island (the third row in Figs. 8 and 9), our method overcomes the interference caused by the island and detects all ships accurately. The ATSS and faster R-CNN regard the small island as a ship on the SSDD. And on the HRSID, the ATSS misses small ships near the small islands, while faster R-CNN and cascade R-CNN mispredict the island as a ship. The detection of inshore ships is very prone to false alarms and missing alarms. The fourth row of Figs. 8 and 9 display that our method detects all ships and predicts the bounding box of ships most accurately without false alarm compared to the other three CNN-based methods. In general, our method can achieve better detection effects for small targets. In complex scenes, our method can detect more ships and can predict more accurate bounding boxes of ships without adding stages of refinement boxes.

IV. CONCLUSION
In this article, we propose a two-stage network that can generate anchors by the network for detecting small ships in SAR image. We introduce the GAM and purposely add aspect ratio loss to its loss function for capturing ships. The redesigned GAM can generate higher quality anchors, which is more conducive to regress bounding box of ships. In addition, we propose a RFEM and embed it into FPN. The RFEM sets atrous convolutions with different dilation rates for feature maps of different resolutions, which expands the receptive field of elements in the feature map and enriches their semantic information. The information about the region around the ships is collected to help the network improve the accuracy of the ship's location. The experimental results show the effectiveness of our designed components. And compared with some CNN-based methods, our method can detect more ships, and the detection ability for small ships of our method is stronger than the state-of-the-art networks, which show the superiority of our method.