Improved YOLOv4 Based on Attention Mechanism for Ship Detection in SAR Images

Ship detection in synthetic aperture radar (SAR) images is an important and challenging work in the field of image processing. Traditional detection algorithms usually rely on handmade features or predefined thresholds, the different performance is obtained with varying degrees of prior knowledge, and it is difficult to take advantage of big data. Recently, deep learning algorithms have found wide applications in ship detection from SAR images. However, due to the complex backgrounds and multiscale ships, it is hard for deep networks to extract representative target features, which limits the ship detection performance to a certain extent. In order to tackle the above problems, we propose an improved YOLOv4 (ImYOLOv4) based on attention mechanism. Firstly, to achieve the best trade-off between detection accuracy and speed, we adopt the off-the-shelf YOLOv4 as our basic framework because of its fast detection speed. Secondly, a thresholding attention module (TAM) is introduced to suppress the adverse effect of complex backgrounds and noises. Besides, we embed channel attention module (CAM) into improved BiFPN as the feature pyramid network (FPN) to better enhance the discrimination of the multiscale target features. Finally, the decoupled head with two parallel branches improves the performance of classification and regression. The proposed method is evaluated on public SAR dataset and the experimental results demonstrate that it has higher efficiency and feasibility than other mainstream methods, yielding the accuracy of 94.16% at intersection over union of 0.5 and 58.19% at intersection over union of 0.75.


I. INTRODUCTION
With the continuous improvement of space remote sensing imaging technology, high-resolution and wide-scale remote sensing images are becoming more and more enriched and facilitate a large range of applications. Remote sensing applications make remote sensing images into plug and play products, which are widely used in all aspects of social and economic life, such as traffic control [1]- [2], geological and mineral exploration [3], environment monitoring [4], and urban construction [5]. As the key target of marine monitoring and wartime attack, the detection of ships has an important practical value for both civil and military fields [6]- [10]. In recent years, many researches in this field have prioritized synthetic aperture radar (SAR) images and ship detection in SAR images has become one of the most important remote sensing applications [11]- [16]. Compared with optical sensors, SAR is an active microwave remote sensing imaging sensor, which has the all-day and allweather surveillance capabilities, making it possible to continuously monitor targets at sea [17]- [20]. Therefore, it is very important to study the ship detection in SAR images.
Many studies have been carried out about ship detection in remote sensing images in recent years [21]- [24]. Traditional feature extraction methods are usually based on handmade features such as scale-invariant feature transform (SIFT) [25], histogram of oriented gradient (HOG) [26] and local binary patter (LBP) [27], followed by shallow classification modules, e.g., support vector machine (SVM) [28], extreme learning machine (ELM) [29], and Adaboost [30]. Most of the traditional algorithms show great performance for ideal-quality images. However, they are highly dependent on manual feature extraction and availability of prior knowledge such as predefined thresholding and the distributions of sea clutters, let alone the influence of complex backgrounds and noises. As a result, their generalization ability is weak, and the detection performance is far from satisfactory.
In recent years, driven by extensive remote sensing images, deep learning methods have achieved great success in object detection. State-of-the-art deep learning-based ship detection methods include one-stage and two-stage detectors. The one-stage detectors directly convert the object detection into a regression problem which is fast running. You only look once (YOLOv1) [31] as the end-toend algorithm for object detection processes the input images only once, and this reduces the computational redundancy and improves the detection speed; Single Shot Detector (SSD) [32], RetinaNet [33], YOLOv2 [34], YOLOv3 [35], and the latest YOLOv4 [36] are the typical one-stage detection algorithms; In two-stage detectors, the first stage generates a set of candidate proposals while filtering out the majority of negative locations, the second stage classifies the proposals into background or foreground. Region CNN (R-CNN) [37] introduces deep learning methods to the field of object detection and outperforms most of the traditional detection methods; Subsequently, a series of two-stage algorithms are proposed, such as Faster R-CNN [38], Mask R-CNN [39], and Cascade R_CNN [40]. Compared with the one-stage detectors, the two-stage detectors offer high positioning accuracy with low running speed.
With the rapid development of SAR sensors, the volumes of SAR images are getting larger and the data are easier to obtain which lead to the possibility of deep learning algorithms for SAR object detection. However, some challenges still exit: 1) complex backgrounds on land and strong backscatters usually result in missing detections and false alarms, and 2) ships are often clustered and the shapes of targets in SAR images have an extreme aspect ratio. Most of all, small ship objects restrict deep networks to extract representative target features, which further limits the ship detection performance. Researchers in deeplearning community for ship detection in SAR images have made a lot of attempts to exploit CNN-based ship detection frameworks. Based on the original Faster R-CNN, researchers have made some typical improvements such as adding hard negative mining [41] and dense connection [42]. There are also some methods dedicated to building a more complex structure to improve the performance for some tough problems like dense small ships [43]. Zhao proposes a cascade coupled convolutional network with attention mechanism to detect ships which shows a promising result for small objects [44]. A novel dense pyramid network with attention weighting is utilized and solves the problem of multiscale ship detection [45]. Besides, some training techniques such as training from scratch are also introduced in the SAR ship detection problem, and the final results outperform other pretrained ship detectors [46]. To achieve real-time ship detection in SAR images, some methods based on one-stage detectors have been gradually explored. For instance, Wang [47] applies the end-to-end RetinaNet to SAR ship detection, and constructs a multi-resolution and complex background dataset, achieving a high detection accuracy. Du [48] uses two identical sub-networks to extract features from the input SAR image and the corresponding saliency map at the same time, then the salient features are integrated to the deep CNN features. Zhang [49] introduces a channel attention module and a spatial attention module in the highspeed and high-precision SAR ship detection network and obtains very excellent detection performance. As far as we know, most of the researches either focus on high-accuracy or high-speed, and only a few researches focus on both. However, both of two indicators are very import for SAR ship detection.
In this paper, we propose a novel one-stage ship detector named improved YOLOv4 (ImYOLOv4) based on attention mechanism [50] for accurate ship detection in SAR images. Firstly, to achieve the best trade-off between detection accuracy and speed, we adopt the off-the-shelf YOLOv4 as the inspiration of our basic detection framework. Secondly, we design a thresholding attention module (TAM) that is embedded in very first layer of the network to perform denoising in the image-level. The TAM block can adaptively learn a set of thresholding according to the global information of the image to suppress noises, avoiding the invalid data flow of the network. Besides, in order to improve the detection performance of multiscale ships, we obtain the optimal sizes of multiscale anchors by K-means [51] clustering according to the SAR dataset, and we improve the state-of-the-art feature pyramid network (FPN) BiFPN [52] with channel attention module (CAM) to complete the fusion operations. Finally, we use a decoupled head structure to deal with the ship classification and bounding box regression tasks separately. Based on these novel techniques above, our experiments on the public SAR Ship Detection Dataset (SSDD) [53] show that ImYOLOv4 could significantly improve the detection performance on the ship targets with multiscale sizes in front of complex backgrounds.
The main contributions of this paper are as follows: (1) A novel one-stage ship detector named ImYOLOv4 based on attention mechanism is proposed which meets the requirement for both high-accuracy and high-speed detection.
(2) We design an embedded TAM block to perform denoising due to the considerations of complex backgrounds and strong backscatters for SAR ship detection.
(3) We integrate the CAM block with BiFPN module as the feature pyramid structure to better complete the fusion operations for the salient feature maps. The CAM block helps ImYOLOv4 pay more attention to the targets of interest, which ensures the effectiveness of detecting small ships.
(4) We replace the YOLO's head with a decoupled head to deal with the ship classification and bounding box regression tasks separately, the decouple head is validated on public SAR dataset and the comparison results confirm its improvement of detection performance. The rest of this paper is organized as follows. Section 2 briefly reviews the related work that are close to our method. Section 3 introduces the framework of our proposed method in detail. Dataset and implementation settings are described in Section 4. A series of experiments and results are presented in Section 5. Finally, we summarize this paper in Section 6.

II. Related Work
Deep learning-based methods have made a significant advancement in the field of SAR ship detection. Based on deep learning, researchers have introduced methods that have shown good performance in order to get better detection results. Ma [54] designs an Accelerated-YOLOv3 method which aims to reduce the computational time with relatively competitive detection accuracy by constructing a new architecture with less layers and channels. Chang [55] proposes an enhanced GPU based deep learning method called YOLOv2-reduced to detect ship from SAR images, and the authors prove the method can make a big leap forward in improving the detection performance. These models with fewer number of layers sacrifice the accuracy to achieve a trade-off between detection accuracy and speed. In order to achieve accurate detection under poor image quality and complex backgrounds, some improvements have been proposed. Han [56] studies how the detection performance varies from images with different complexity, backgrounds, surroundings, and quality. Fu [57] designs a fast ship detection method which consists of two cascade deep convolutional networks: scene classification network (SCN) and single shot detector (SSD), the SCN can quickly eliminate the sub-images that may not contain ships, and then the remaining sub-images are input into the SSD to implement refined ship target detection. Sun [58] introduces a category-position module based on attention mechanism to improve the positioning performance in complex scenes by generating guidance vectors. Wang [59] proposes a mask to guide attention maps, which performs well in the instance segmentation field. Masks are used to enhance ship position information in ship detection field and to eliminate the influence of complex backgrounds.
These improvements usually bring a large amount of redundant information that greatly affect the detection efficiency. Different from the related works, we design a lightweight embedded TAM based on attention mechanism to filter the adverse effect of noises. In order to ensure the ability of detecting multiscale ships, Lin [60] proposes a new network architecture based on the Faster R-CNN by using squeeze and excitation mechanism to enhance the salient features of ship targets. Kang [61] discloses a contextual region-based convolutional neural network with multilayer fusion, the framework fuses the deep semantic and shallow high-resolution features, improving the detection performance for small-sized ships. Sun [62] introduces a novel bi-directional feature fusion module to the YOLO framework to efficiently aggregate multiscale features which can be helpful for detecting multiscale ships. Cui [63] designs a feature pyramid network integrating dense attention mechanism, which made the features extracted by the network contain rich resolution and semantic information, and the proposed method proved to be suitable for multiscale ship detection. A receptive pyramid network extraction strategy and attention mechanism are also proved to be effective in the ship detection task, but the processing efficiency is low due to the complex model structure [64]. Although the CNN-based detection algorithms can automatically capture the features of ships, the detection performance of these existing methods still needs to be improved. In this paper, the proposed ImYOLOv4 integrates the CAM block with BiFPN module as the feature pyramid structure to better complete the fusion operations for multiscale ship detection, and the salient feature maps will not make the deep CNN features disappear. The details of ImYOLOv4 model are introduced in Section 3.

III. Methodology
The proposed method will be described in detail in this section. First, the overall framework of ImYOLOv4 is introduced. Afterwards, the mechanism of every key module will be explained. Other strategy validated efficient for detection such as K-means clustering for anchor box will be described at last.

A. Overall Framework
The overall scheme of the proposed method and the network architecture of ImYOLOv4 are illustrated in Figure 1. Firstly, the resized input image (taking 416 as an example) is send into the TAM to perform denoising operations. Next, we adopt CSPDarknet53 [36] as the backbone to extract feature maps at three different branches. Then, the multiscale feature maps are feed into FPN structure to obtain fused features. Specifically, the outputs (P3, P4, and P5) of CSPDarknet53 are transported to the ImBiFPN module to generate corresponding salient feature maps (P3', P4', and P5'). In ImBiFPN module, we apply up-sampling and downsampling operations by the factor of 2 and merge the feature maps of same spatial resolution via concatenation, given to the fact that different inputs should have different weights, we design the CAM_Concat Unit by using CAM to obtain channel-wise coefficient tensor while concatenating. In the end, the decoupled head with two parallel branches is used to predict a 3D tensor detection result of bounding box, object, and classifications. The whole detection pipeline of ImYOLOv4 is in a single network, so it can be optimized end-to-end directly.

B. Thresholding Attention Module
The radar receives echo signals from ground, including ground-based clutter and detection targets because of its unique imaging technique. As a coherent imaging system, SAR inevitably generates speckle noises from the complex backgrounds, resulting in the missing detection of weak ship targets. Besides, the metal materials and the superstructure of the ships usually produce strong backscatters which will reshape the ship appearances in the SAR images and interfere with the detection process. Figure 2(a) and 2(b) show the noises mentioned above respectively. Considering the adverse effect of these noises, we design an embedded TAM block to perform denoising in the image-level. In TAM block, we integrate the thresholding algorithm and attention mechanism to automatically learn a set of thresholding which can be used to transform the nearzero to zero for signal reconstruction. Compared with the traditional SAR feature enhancement methods, TAM does not require high expertise in signal process and its lightweight architecture has additional advantage of lower computational complexity and memory consumption.
As for a SAR image obtained by radar system, it can be decomposed as follows: (1) where X is the considered scene, N is noise matrix of the same size as X which denotes the difference between the reconstructed image and real scene. Considering the sparsity of SAR image, we can recover the considered scene by dealing with the following optimization problem: the optimization problem can be solved by iterative thresholding algorithm, however, the number of iterations has a great impact on the sparsity and precision of the considered scene. Inspired by LeakyReLu [65] activation function, we would like to optimize the function by equation (3): where μ is the thresholding used to filter the noises, α gives us a non-zero gradient so that useful negative features can be well preserved. Figure 3 illustrates the detailed architecture of TAM block which is designed upon the transformation mapping between the input X ∈ R C H W   and its reconstruction feature map X  ∈ R C H W   . We adopt the channel attention module to generate a channel-wise thresholding tensor μ ∈ 1 1 R C  . Specifically, we first squeeze the input along the spatial dimension H × W by using both average pooling and max pooling operations to obtain two channel tensors of 1 1 R C  , then, we merge the two tensors via element-wise summation and forward the output s to a network which consists of two fully connected (FC) layers. To reduce the complexity of TAM, the activation size of the first FC layer is set to / 1 1 R C r  , where r is the reduction ratio. A sigmoid function is also employed at the end of network as a simple gating mechanism to get a scaled output tensor z of (0,1). Finally, to prevent the thresholding from being neither negative nor too large, we obtain the product μ by elementwise multiplication from the scaled tensor z and the global information tensor s. Therefore, the thresholding is expressed as: Figure 3. The architecture of TAM.

C. Feature Pyramid Network
For deep learning-based detection methods, FPN [66] plays an important role in solving the multiscale problems and acts as a feature extractor with the consideration of the low-level high-resolution and high-level low-resolution semantic meaning. In general, more intensive sampling can get more detailed features, while more sparse sampling can more clearly reflect the overall trend. Fusing features of different scales can capture ample semantic information which help improve the accuracy of ship detection. After the multiscale feature maps are extracted by CSPDarknet53 network, we forward them to the ImBiFPN structure to complete the fusion operations for salient feature maps. As depicted at the left-bottom of Figure 1, there are two main data flows in ImBiFPN, the bottom-up downsampling and top-down up-sampling pathways. And the CAM_Concat Unit completes the feature fusion of the same spatial resolution. In the process of concatenating, we apply CAM block to automatically learn the channel-wise attention coefficients which denote the significant degree of different inputs. As shown in Figure 4, we first squeeze the concatenated feature map along the spatial dimension H×W by using max pooling operation to focus on what is important in the given input. Then, two FC layers and a simple gating mechanism via sigmoid function are employed to obtain the final channel attention map Xc. Finally, we also add a residual input for the consideration of preventing the problem of gradient-vanishing. After element-wise multiplication and summation operations, we generate the refined output Xo of CAM block: In summary, there are two differences between BiFPN and our ImBiFPN. The one is that the input of ImBiFPN is 3level multiscale feature maps obtained by CSPDarknet53 network, while the input of BiFPN is 5-level features, the same goes for the output of both FPN structures. The second is that we design a weights generator by using CAM block to assign the different importance of inputs while concatenating. These improvements reduce the network parameters while maintain the BiFPN performance.

D. Decoupled Head
In object detection, the conflict between classification and regression tasks is a well-known problem. The two different tasks which share almost the same parameters in YOLO head could hurt the detection process. This is inspired by the nature insight that for one instance, the features in some salient area may have rich information for classification, while these around the boundary may be good at bounding box regression. Based on that case, we design a decoupled head with two branches to solve the object functions from different spatial dimensions. As depicted at the right-bottom of Figure 1, we first use a convolutional layer with kernel size 1×1 to perform the dimension reduction. Then, in the up branch, a two-layer fully connected network is employed to obtain the classification-specific output Cls. While in the down branch, two shared 3×3 convolution and two 1×1 convolution operations are used to obtain the regressionspecific outputs Reg and Obj. Finally, the outputs of two branches are merged into a tensor for the task of ship prediction.

E. K-Means Clustering
Anchor box mechanism for object detection was proposed to solve the problem of multitarget in one predicted box and has been used in many detectors. There are 9 predefined anchor boxes in our method for different scale detection. K-means clustering is adopted on the overall SSDD data to automatically find the prior boxes. Most ships in SAR images are small and weak targets, which occupy few pixels and have lower contrast. If we use the standard Euclidean distance of the conventional K-means algorithm, the bounding boxes with larger scale generate more error than the smaller scale boxes, which will lead to missed detections of small and sparse ships. What we want in the final detection are the priors that will lead to high intersection over union (IoU) scores, thus, the distance metric in this paper can be expressed as: d(anchor box, cluster centroid)=1-CIoU(anchor box, cluster centroid) where d(anchor box,cluster centroid) is the new distance metric that needs to be minimized, and CIoU(anchor box,cluster centroid) means the CIoU [67] values of the anchor box and different cluster centroids. The specific size of anchor boxes for three scales are shown in Table 1. The optimal cluster centroids obtained by K-means are significantly different than previous hand-picked anchor boxes and have better performance for both precision and recall on SAR ship detection.

A. Dataset
The dataset used in this paper is a SAR dataset for ship detection published by the Digital Earth Laboratory of the Aerospace Information Research Institute, Chinese Academy of Sciences. SSDD is generated from 102 Gaofen-3 [68] images and 108 Sentinel-1 [69] images. As for Gaofen-3, the resolution of these images involves 3m, 5m, 8m and 10m The SSDD has 43819 ship chips and 59535 ship targets in total. The pixel of each image is 256×256. The ship targets are marked in a similar format to Pascal VOC [70]. The statistical distribution of the ship size over the SSDD is presented in Table 2, where "Size", "Min" and "Max" mean ship pixels, minimum ship size and maximum ship size, respectively. "Number" represents the total number of ships, "Percentage" denotes the percentage of the ship in whole ship targets.
From Table 2 and Figure 5, we can see that the dataset has the following characteristics. Firstly, there are multiscale SAR ships in these chips, and the size conversion range is large. Small ships and medium ships account for a large proportion of whole targets. Secondly, there are complex backgrounds in the ship chips. Some of ships are on the open sea, some in the port. All of these have brought difficulties to ship detection, and put forward higher requirements for the performance of ship detection. In the experiment, we split the training, validation and testing set randomly according to rate of 7:2:1. The training set and the validation set are used for training models and the testing set is used for testing models.

B. Evaluation Metrics
In order to quantitatively evaluate the detection performance of ImYOLOv4, we adopt four widely used criteria, namely, precision, recall, mAP (mean Average Precision) and F1 score. The precision measures the value of detections that are true positives and the recall measures the value of positives over the number of ground truths.

TP precision
TP FP   (7) TP recall TP FN   (8) where TP, FP and FN represent the number of true positives, false positives and false negatives. As for detection, a higher precision and a higher recall are both expected. However, the two metrics are a pair of contradictory indicators. It means that a higher precision will result in a lower recall and a higher recall will result in a lower precision. F1 score is then used which can comprehensively combine precision and recall. A higher F1 score indicate a more ideal detection performance. F1 score is defined based on the harmonic average of precision and recall: Precision, recall and F1 score are all calculated based on the single point threshold. AP can solve the limitations of single point threshold and get an indicator that reflects the global performance. AP is obtained by the integral of the precision over the interval from recall=0 to recall=1, that is, the area under the precision-recall (PR) curve.

C. Implementation Settings
All experiments are implemented using the deep learning framework Pytorch and executed on a PC with TITAN XP GPU (11G memory), the PC operating system is Ubuntu 16.04. At the beginning of network training, we use the parameters pre-trained on ImageNet to initialize the network. Then, we utilize the end-to-end training strategy to train our model, in which the gradient descent algorithm is used to fine-tine the network weights. The weight decay and momentum are set to be 0.0001 and 0.9. The reduction parameter r and α used for gradient preserved in TAM block are set to 16 and 0.1 which will be explained in the following experiments. Smooth-L1 [36] Loss function is applied to calculate classification loss and a total of 2k iterations are performed for training our ImYLOLv4 model.

A. Performance of TAM
In this section, we first examine the impact of parameters r and α and select the best combination of parameters for TAM module. The parameter r is designed to decrease the calculation complexity of the fully connected layers and α guarantees that most neurons won't be dead during the training process. We measure the AP50 (IoU=0.5) and AP75 (IoU=0.75) in the case of different parameter values and list the results in Table 3 and Table 4. As we can see from the results, adding the parameters brings the improvements in both AP50 and AP75 compared with condition when r = 1 and α = 0. And we can find out that the combination of r = 16 and α = 0.1 obtains the best detection precision. The reduction parameter r avoids overfitting caused by too many training parameters to a certain extent, and α expands the values of the activation function in the part of less than the thresholding -μ, which further demonstrates that avoiding neurons being dead is more important than obtaining sparsity. To verify the effectiveness of TAM, we conduct experiments comparing the detection performance between YOLOv4, ImYOLOv4 without TAM (DeTImYOLOv4) and ImYOLOv4. For a fair comparison, we set the other hyperparameters consistent in the experiments. And the results are displayed in Table 5. As we can see from the results, adding the TAM block brings 3.18%, 0.05, 2.29% and 8.15% increment in AP50, F1 score, precision and recall versus DeTImYOLOv4, and outperforms YOLOv4 by 0.47%, 0.01, 2.00% and 1.00% in AP50, F1 score, precision and recall, respectively. When IoU is set to 0.75, adding the TAM block brings 9.49%, 0.05, 7.60% and 3.47% increment in AP75, F1 score, precision and recall versus DeTImYOLOv4, and outperforms YOLOv4 by 7.77%, 0.03, 6.08% and 0.81% in AP75, F1 score, precision and recall, respectively. Specifically, we present some denoising results of ImYOLOv4 to further demonstrate the validity of TAM. We visualize the spatial response of the input and output feature map of TAM block by heatmap where the blue color denotes low spatial response, and the red indicates a high response. We resize the heatmaps to the same size of the SAR image and the results are shown in Figure 6. By comparing Figure  6(b), (e), and Figure 6(c), (f), we can see that the complex background triggers very low response and the irrelative information brought by background can be effectively suppressed because of TAM. While the noises are suppressed, ImYOLOv4 can focus on and extract more discriminative features of targets, which is very helpful for the ship detection.
In addition, as shown in Table 5, we also compare our TAM with some state-of-the-art attention modules, such as ECA [71], BAM [72] and CBAM [73]. We replace the TAM block with attention modules while keeping other subnets consistent to ImYOLOv4. By analyzing the results, TAM and ECA obtain better performance than the other two modules, this is mainly because that BAM and CBAM are proposed based on optical images and irrelative spatial feature would be falsely enhanced for SAR images. The TAM block can adaptively learn the channel-wise thresholding according to the global information of the image, and the experiment results demonstrate its suitability for SAR ship detection task.

B. Performance of FPN
We also conduct an experiment to validate the performance of FPN. FPN from YOLOv3 (YOLOv3FPN), PANet [36], BiFPN are embedded into ImYOLOv4 as substitutions of FPN respectively. YOLOv3FPN simply contains an upsampling pathway for fusing the features at different resolutions. PANet is originally applied in the field of image segmentation, which increases a down-sampling pathway on the basis of YOLOv3FPN. BiFPN introduces a weighted feature fusion strategy to better balance the feature information of different resolutions. The comparison results are listed in Table 6. As it is seen in Table 6, different feature fusion methods bring different detection performance. And our FPN and BiFPN achieve better performance for salient feature extraction which contributes to ship detection. Apart from the precision, we also evaluate the models by the running speed. Unlike BiFPN, our FPN uses CAM block as the weights generator, and the improvement makes our FPN achieve better accuracy and efficiency trade-offs.

C. Performance of Decoupled Head
In this part of experiments, we design several variants of decoupled head and make comparison to the YOLO head baseline. The variants are described as follows: 1) YOLO-Head (baseline): The coupled head is widely used in YOLO series detectors, the classification and regression tasks are solved by the single network.
2) Decoupled-Head (ours): The head splits the classification and regression on a fully connected head and a convolution head respectively.
3) Decoupled-Conv-FC-Head: The head splits the classification and regression on a convolution head and a fully connected head respectively. 4) Decoupled-FC-Head: Double fully connected heads which have the same structure as the up branch of our Decoupled-Head. 5) Decoupled-Conv-Head: Double convolutional heads which have the same structure as the down branch of out Decoupled-Head.
The comparison results between the variants are listed in Table 7. From the results, we can observe that decoupled head has a better performance than the single network baseline for ship detection, this is mainly because that classification and regression focus on the different problems, and different branches used for different tasks are conducive to the improvement of performance. This significant observation motivates us to rethink the architecture of the decoupled head. By comparing the variants of decoupled head, we can conclude that the fully connected head is more suitable for classification while the convolutional head has more advantage on the task of regression.

D. Comparison with State-of-the-Art Methods
In this section, we compare our ImYOLOv4 model with some state-of-the-art object detection models on SSDD, including RetinaNet, CenterNet [74], YOLOv3, YOLOv4, and Faster-RCNN. The experimental results are displayed in Table 8, and Figure 7 shows the precision-recall curves of all the detectors. In addition, as reflected by Figure 7, our method possesses a higher precision and recall curve than the state-of-the-art methods, which further shows the superiority of ImYOLOv4 over the others. When it comes to the running speed, our ImYOLOv4 is slower than CenterNet, YOLOv3, and YOLOv4 with 42 fps, but it is faster than RetinaNet and Faster-RCNN. In short, ImYOLOv4 achieves the better trade-off between detection accuracy and running speed, and we believe that the efficiency and simplicity of our method will benefit ship detection applications in the future research. To further demonstrate the effectiveness in dealing with multiscale ship detection of ImYOLOv4, we divide the SSDD into three sub-datasets according to Table 2 and calculate evaluation metrics APL, APM. APS for large, medium, and small objects, respectively. From the results shown in Table 8, we can find out that the models present different detection abilities for multiscale ships. This is mainly because that the shapes of the ships in SSDD have a relatively extreme aspect ratio, and with the deepening of the network layers, the features of ships become weak, especially small-sized ships, so the detection accuracy is hard to guarantee. Moreover, to achieve a better performance, the models should take into account the effect of the complex backgrounds and noises. We embed TAM block to perform denoising operations and design the FPN structure to extract salient feature maps of small ships, which ensure the effectiveness of detecting small ships in front of complex backgrounds.

E. Analysis on Missing Ships and False Alarms
To show the detection performance of ImYOLOv4 vividly, we test it in some typical SAR images and the detection results are displayed in Figure 8. The different environment conditions include quiet sea, sea with waves, inshore land, backscatters noises and small ship cluster. And the rectangle box with different color represents different detection result, the rectangle with green, red, blue, and yellow color denotes the ground truth, detection target of detectors, false alarm and missing target, respectively. In Figure 8, (a) is the original SAR image and (b) represents the ground truth. (c)-(h) denotes the detection results of RetinaNet, Faster-RCNN, CenterNet, YOLOv3, YOLOv4, and ImYOLOv4, respectively. It is clear that our ImYOLOv4 model can distinguishes the ship targets better than the state-of-the-art methods even though the interference of complicated conditions. Although our method achieves excellent performance on SSDD, a few missing ships and false alarms still exist. As shown in the first and third column of (h) row, non-ship object is recognized as ship target due to similar features, and some ships are detected as one target because of their close distance. For missing ships, non-NMS [75] may improve the performance by adjusting the scores of other detection boxes so that close targets are not eliminated in the process. And sea-land semantic segmentation method [76] could serve as a supplement in image preprocessing which will benefit for the false alarms.

VI. Conclusion
In this paper, we propose a one-stage ship detector named improved YOLOv4 (ImYOLOv4) based on attention mechanism for accurate ship detection in SAR images. First, to achieve high accuracy of ship detection, we adopt YOLOv4 as the basic framework and apply CSPDarknet53 to extract multiscale feature maps. Then, the TAM module is designed based on attention mechanism to enhance the representational power of the network by dynamic feature denoising and recalibration. In addition, we construct a new FPN structure which combines the meaningful semantic information to solve with the problem of multiscale ship detection. Finally, we design a decoupled head with two branches to solve the conflict between classification and regression tasks. Extensive experimental results demonstrate that ImYOLOv4 has a promising performance on detecting ships in SAR images, while achieving a fast speed. We hope this report could help scholars get better experiences in future researches.