A Survey on Deep-Learning-Based Real-Time SAR Ship Detection

Recently, deep learning has greatly promoted the development of synthetic aperture radar (SAR) ship detection. But the detectors are usually heavy and computation intensive, which hinder the usage on the edge. In order to solve this problem, a lot of lightweight networks and acceleration ideas are proposed. In this survey, we review the papers that are about real-time SAR ship detection. We first introduce the model compression and acceleration methods. They are pruning, quantization, knowledge distillation, low-rank factorization, lightweight networks, and model deployment. They are the source of innovation in real-time SAR ship detection. Then, we summarize the real-time object detection methods. They are two-stage, single-stage, anchor free, trained from scratch, model compression, and acceleration. Researchers in SAR ship detection usually learn from these ideas. We then spend a lot of content on the review of the 70 real-time SAR ship detection papers. The years, datasets, journals, deep-learning frameworks, and hardwares are introduced first. After that, 10 public datasets and the evaluation metrics are shown. Then, we survey the 70 papers according to anchor free, trained from scratch, YOLO series, constant false alarm rate+convolutional neural network, lightweight backbone, pruning, quantization, knowledge distillation, and hardware deployment. The experimental results show that the algorithms have been greatly developed in speed and accuracy. In the end, we pointed out the problems of 70 papers and the directions to be studied in the future. This article can enable researchers to quickly understand the research status in this field.


I. INTRODUCTION
S YNTHETIC aperture radar (SAR) is an all-day and allweather sensor that is widely used in military, agricultural, geological, ecological, marine, and other fields. SAR ship detection is an important tool for marine monitoring in illegal fishing, oil spill detection, and maritime traffic management [1], [2], [3].
The traditional detection methods are based on constant false alarm rate (CFAR). It firstly models the clutter, and then obtains the threshold value according to the false alarm rate. The pixels above the threshold are regarded as ship pixels, and those below the threshold are regarded as background. The performance of this method largely depends on the statistical modeling of sea The authors are with the Naval Submarine Academy, Qingdao 266000, China (e-mail: lgm_jw@163.com; grass2009@163.com; hagooddemon@163.com; qdqyzt@163.com; yulu_china@163.com; cheng.chihhu@163.com).
Digital Object Identifier 10.1109/JSTARS.2023.3244616 clutter and the parameter estimation of the selected model. In general, when the scene is simple, the CFAR method can achieve good result. However, for complex offshore scenarios, it will have more false positives and poor performance [4]. After CFAR, discrimination is used to classify the ship and nonship areas. The features that are used are length, width, HoG, Surf, LBPs, and so on. The classifiers that are used are SVM, MLP, and so on. But since deep learning revives in 2012 [5], the above ideas show disadvantage in accuracy. The deep-learning method works in an end-to-end manner. This means that it does not need to optimize multiple steps individually. It optimizes the whole detection system simultaneously. It can adapt to various complex scenes and has strong robustness [6], [7], [8]. Due to its great advantage, SAR researchers began to use these methods.
The beginning is the emergency of SSDD [9]. Based on it, researchers can train and test their deep-learning-based detectors, and also can compare their algorithms each other in the same evaluation metrics. Since then, a large number of datasets and algorithms have been proposed, further promoting the development of this field. Li et al. [10] reviewed these achievements comprehensively.
The above deep-learning-based detectors are all based on convolutional neural networks (CNN)'s strong feature extraction ability. In order to get good performance, most of the CNNs are deep and wide, which have high computational complexity and memory consumption. For example, AlexNet has 60 million parameters with five convolutional layers and three fully connected layers [11]. It is difficult to deploy such large model on edge devices with limited computing resources and memory. In order to realize real-time SAR ship detection, lightweight network designing, model compression and acceleration, and hardware deployment are three ways researchers usually take, as shown in Fig. 1 [12].
The contributions of this article are as follows. Firstly, we summarize the model compression methods, acceleration methods, and the real-time object detection methods. They are often used in real-time SAR ship detection. Secondly, the datasets and the evaluation metrics those are usually are systematically summarized here. Thirdly, 70 papers those are related to realtime SAR ship detection are divided into 7 categories and are reviewed thoroughly. Finally, the future directions are given for the following work. As far as we know, this is the first comprehensive survey in this field.
The rest of this review is arranged as shown in Fig. 2. Section II briefly analyzes some work related to this article. Section III summaries the model compression and acceleration methods. It mainly includes pruning, quantization, knowledge distillation, low-rank factorization, lightweight networks, and model deployment. Section IV introduces the real-time detection methods. It mainly includes two-stage, single-stage, anchor free, trained from scratch, compression, and acceleration. Section V is the main content of the article. We introduce the general situation of the 70 papers. Then the datasets, evaluation metrics, anchor free, trained from scratch, YOLO series, CFAR+CNN, lightweight backbone, pruning, quantization and knowledge distillation, and hardware deployment are analyzed, respectively. Section VI discusses the problem and the future direction of this field. Section VII is the conclusion of the article.

II. RELATED WORK
As far as we know, there is no comprehensive paper on realtime SAR ship detection. This is mainly because this field is relatively new, with few achievements, and is difficult to carry out a review. At present, the review papers those are related to this article are shown in Table I. Mao et al. [83], Stefanowicz et al. [84], Zhang et al. [85], and Li et al. [86] are summaries of SAR ship detection based on deep learning. Mao et al. [83] compared the classical deeplearning-based detectors on SSDD. It provided a benchmark for researchers in this area. In essence, it is not a survey of papers. Jerzy et al. [84] surveyed the papers about SAR ship detection from 2015 to 2020. In this article, the CFAR, CNN, GLRT, feature extraction, weighted information entropy, and variational Bayesian inference methods are studied. The deep-learningbased SAR ship detection is not studied thoroughly compared with other papers. Zhang et al. [85] provided an official version of SSDD. It introduced the drawbacks and success of SSDD. It provided three versions of SSDD. It also provided the seven using standards of the official SSDD. But it only reviewed the papers that are related to SSDD, the papers using other  [86] finished the first comprehensive survey of SAR ship detection. It analyzed the past, present, and future of this area by the 177 published papers. It can make researchers better understand these algorithms and encourage more researchers to this field. But it only introduced the real-time SAR ship detection roughly, which is not sufficient as real-time SAR ship detection is becoming hot and hot recently.
Jiao et al. [87], Liu et al. [88], and Jiao et al. [89] are summaries of the object detection papers in computer vision, which has a comprehensive summary of the real-time object detection algorithms. The real-time SAR ship detection can learn from these. Cheng et al. [90], Goel et al. [91], Mishra et al. [92], and Zhang et al. [93] are reviews on the compression and acceleration of deep-learning-based computer vision. These achievements point out the direction for realizing real-time SAR ship detection.
In short, this article is different from the above surveys. It is the first comprehensive review of real-time SAR ship detection.

III. MODEL COMPRESSION AND ACCELERATION METHODS
CNN-based deep-learning models have shown great advantage in computer vision tasks especially object detection. But CNN is heavy in computational cost and memory storage, which hinders the usage on some real-time applications, for example, processing on the edge (airplane and satellite). So a lot of methods have been proposed to compress and accelerate the CNN models. They can be divided into six categories: parameter pruning, quantization, knowledge distillation, low-rank factorization, lightweight networks, and model deployment, as shown in Fig. 3. The above methods can be used alone or in combination.
In modern CNN, the fully connected layers gradually disappear, and the convolution layer occupies most of its storage and  calculation. Therefore, the main objectives of the above model compression and acceleration methods are the convolution layer. It will take several steps to achieve the goal generally. For example, Han et al. [94] promoted a method to compress CNNs with three steps: pruning, quantization, and Huffman encoding. By this, AlexNet could be compressed by 35 without drops in accuracy, as shown in Fig. 4.

A. Pruning
Generally, the CNNs after training are overparameterized, and they have millions or even billions of parameter. For example, ResNet-50 needs 95 MB memory for storage and over 3.8 billion floating number multiplications when processing an image.  There are significant redundant weights or neurons, which are unimportant or unnecessary [95], [96], [97]. So in order to compress the CNNs, we can prune the weighs and neurons those are less important. After the pruning, we can get a small network, and the CNNs still work as usual but save more parameters and computational time, as shown in Fig. 5. The importance metric of a weight can be L1 norm or L2 norm. The importance metric of a neuron can be the number of times it was not zero after training. After sorting the importance of all weights and neurons, removing unimportant weights or neurons, we can get a smaller network.
Generally, the classification or the detection result will decrease after the pruning. So we should retrain the CNN model to promote the accuracy further. In order to get a good performance on accuracy, we should not prune too much at once [98]. We should gradually prune them as shown in Fig. 6.
Pruning will increase the sparsity of parameters, thereby reducing storage and computing. According to different pruning objectives, pruning can be divided into fine-grained pruning, vector-level pruning, kernel-level pruning, group-level pruning, and filter-level pruning [99].
Fine-grained pruning methods remove parameters with unstructured [100]. The parameters of CNNs can be pruned thoroughly, and are very sparse. These methods have a better compression ratio. But what should be noticed is that as the CNNs are not structural, and they have a lot of fragmented operations which are hard to accelerate on the hardware, e.g., GPU. In other words, although we get fewer parameters, the calculation speed will not be accelerated. So we should pay more attention to the structured pruning. Vector-level methods prune vectors in the convolutional kernels. Kernel-level pruning methods prune convolutional kernels in the filters. They are seldom used, as most pruning methods mainly focus on fine-grained pruning or filter-level pruning. Group-level pruning methods prune the parameters according to the sparse pattern on the filters. By this, convolutions can be implemented by thinned dense matrices multiplication. So the BLAS can be utilized to achieve a higher speed-up. Filter-level pruning methods prune the filters or the channels. It can make CNNs much lighter. After filter-level pruning, the input channel of the next layer is also smaller. Filter-level pruning methods are structured with less fragmented operations and are easy to be processed by hardware (CPU or GPU). So filter-level pruning methods are the most efficient for compressing CNNs.
The processing of pruning is very trivial as CNNs should be pruned and trained several times. The fine-grained pruning should be used on custom hardware or special data structures for sparse matrices. But the filter-level pruning is not needed and can be used on general processors. Pruning is usually with quantization and encoding for further accelerating CNNs.

B. Quantization
Network quantization compresses CNNs with less number of bits to represent each weight. It can significantly reduce the memory and computation with less loss on accuracy [101], [102], [103], [104].
Generally, the weights in CNNs are saved as 32-bit floatingpoint numbers. But for the accuracy of deep-learning task, the number of bits has a weak impact on it. So we can use fewer bits to represent the weights. This process is called the quantization. It can reduce the computation and memory size. For example, if we use 16 bits to store a parameter rather than 32 bits, the model size of the network can be reduced half. The weights can be quantized to 16-bit, 8-bit, 4-bit, or even with 1-bit [105]. The quantization can be also used on gradient and activation. The gradients quantization can accelerate the training stage. The weight and activation quantization methods can accelerate the inference stage.
We can create the clusters of the weights, and all the weights which fall into that cluster can share the same weight value. When storing the network, the real values are not needed to be saved. We only need to record the class ID of the weights and use the means of the weights in the classes to represent the real values of the weights [106]. It can compress the network, but decrease the performance.
The parameters can be represented by a codebook and a set of quantization codes [107]. It represents frequent clusters by fewer bits, and represents rare clusters by more bits. For example, Huffman encoding. It represents the common tokens with less bits, and represents the rare tokens with more bits. By this, the network can be compressed further. It can achieve 4-6 speed-up and 15-20 compression ratio with little accuracy loss.
These methods can also be categorized into two main groups: quantization after training and quantization when training. The former is used to reduce the inference time and save energy. And the latter is used to reduce the network size and make the training process more computational efficient.
The quantization can use shift or XNOR rather than multiplyaccumulate operations in custom hardware [108]. And thus reduce the energy consumption. But the CNNs need to be trained several times, which make the training process tedious.

C. Knowledge Distillation
Hinton put forward the concept of knowledge distillation for the first time in distilling the knowledge in a neural network. And he introduced the soft targets of teacher to induce the training of students' network. The knowledge distillation is classified into three categories, they are logits transfer, teacher assistant, and domain adaptation [109].
Generally speaking, the teacher model has strong ability and performance, while the student model is compact. The knowledge distillation methods transfer the generalization ability of the teacher model to the compact student model to improve its performance with less complexity. The basic idea of knowledge distillation is to transfer the dark knowledge in the complex teacher model to the simple student model. These methods match or outperform the teacher's performance, while requiring notably fewer parameters and multiplications [110], [111], [112], as shown in Fig. 7.
The parameter T represents temperature. Generally, T is 1. When T is larger, a softer probability distribution will be obtained. There are two loss functions. The first loss function requires that the student model and the teacher model use the same T when calculating the softmax layer. The second loss function requires the student model T to be taken as 1, and the loss function is the weighted average of the two objective functions. Soft prediction carries more and more useful information than hard prediction. The knowledge distillation can get a lightweight CNN model with high accuracy [113].
The softmax function is formulated as follows: (1)

D. Low-Rank Factorization
Low-rank factorization is a straightforward way for model compression and acceleration. It is based on the fact that the weight vector is mainly distributed in some low rank subspaces, and a few bases can be used to reconstruct the weight matrix. The low-rank decomposition factorizes multidimensional tensors (in convolutional and fully connected layers) into smaller matrices to eliminate redundant computation. For examples, we can decompose the K convolutions into two separable convolutions of size 1×K and K×1. By this, we can remove redundancy and reduce weight parameters. Low-rank factorization can reduce the computation costs in CNNs. It can be used in both convolutional layers and fully connected layers. It has only a small accuracy loss [114].
Convolution operations contribute most of the computations in CNNs, so we can make the inference process faster by decomposing the convolution layer. A convolutional kernel can be represented as a 4-D tensor w×h×c×n tensor. They are kernel width, kernel height, and the number of input and output channels, respectively. Ideas based on tensor decomposition is derived by the intuition that there is a significant amount of redundancy in the 4-D tensor, which is a particularly promising way to remove the redundancy. Based on how many components the filters are decomposed into, the low-rank method can be divided into three categories: two-component decomposition, three-component decomposition, and four-component decomposition [115].
The fully connected layers can be view as a 2-D matrix, and it contains around 89% of the parameters in CNNs like AlexNet. Low-rank factorization can also be applied to the fully connected layer. It can make the model storage-friendly [116].
Singular value decomposition (SVD) is a common and popular factorization scheme for reducing the number of parameters. Besides, canonical polyadic decomposition, batch normalization decomposition, Tucker-2 decomposition, and the SVD are the usually used matrix factorization techniques [117].
Low-rank factorization is taken layer by layer. After factorizing the parameters of one layer, subsequent layers are then factorized based on some reconstruction error. It is difficult to be used on deep CNNs, as with the increase of depth, the search space of the decomposition hyper-parameters is huge.

E. Lightweight Networks
Since AlexNet won the first place in 2012 ILSVRC, deep CNNs encounter another revival. A lot of CNNs have been proposed with excellent performance, for example, VGGNet [118], Inception [119], ResNet [120], ResNeXt [121], and DenseNet [122]. These CNNs mainly consider accuracy but not speed. So they are computation extensive and are very heavy in model size.
In order to deploy them on edge devices, a lot of ideas about designing lightweight CNNs have been proposed as shown in Fig. 8. The development of lightweight CNN also promotes the progress of real-time object detection.
1) SqueezeNet Series: SqueezeNet [123] has 98% fewer parameters than AlexNet and achieves the same accuracy on the ImageNet dataset. It uses three strategies to reduce the number of parameters. It proposes a new network architecture called fire module. It replaces 3×3 convolution with 1×1 convolution, and reduces the channel of the input of 3×3 convolution. Downsampling is carried out in the later stage of the network, so that the convolution layer has a larger activation feature map. SqueezeNext [124] uses a two-stage squeeze operation to achieve a significant reduction in channels. With separable 3 × 3 convolution, it further reduces model size. By using element by element addition similar to ResNet, a deeper network can be trained without the problem of gradient disappearance. SqueezeNext has fewer parameters and higher accuracy than SqueezeNet.
2) MobileNet Series: MobileNetv1 [125] decomposes the standard convolution into 2-D and 1-D convolutions, which effectively reduces the amount of calculation and model parameters. MobileNetv1 uses separable convolution to compress the feature map, which significantly reduces the amount of parameters and significantly improves the speed.
Depthwise convolution converts N×H×W×C into C groups, and then each group is calculated by 3×3 convolution, which is equivalent to collecting the spatial characteristics of each channel. Pointwise convolution does k ordinary 1 ×1 convolutions to N×H×W×C, which is equivalent to collecting the features of each point.
MobileNetv2 [126] was proposed by Google in 2018. The model reduces the amount of computation and memory usage while maintaining accuracy. It has three characteristics: reverse residual, linear bottleneck, and depthwise convolution. Mo-bileNetv2 is spindle type, with large middle and small sides. This is because MobileNetv2 uses 1×1 convolution first to increase the number of channels, and then 3×3 depthwise convolution can reduce the amount of calculation. Although there are many intermediate channels, the depthwise convolution calculation is small. MobileNetv2 also removes the last ReLU of the bottleneck layer. MobileNetv2 model is compact, with small amount of computation and good classification performance. It shows good performance in detection and segmentation tasks.
MobileNetv3 [127] is improved on MobileNetv2, it explores how neural architecture search (NAS) and manual design work together and complement each other. It first uses MnasNet [128] to search the rough structure, then uses reinforcement learning to select the optimal configuration from a set of discrete choices, and then uses MnasNet to fine tune the architecture.
3) ShuffleNet: Although the pointwise convolution reduces the number of parameters, there is a problem of low computational efficiency, because a large number of 1 × 1 convolution will consume a lot of computing resources. ShuffleNet uses pointwise group convolution to reduce the amount of computation, but there is no connection between groups, which will affect the performance of the network. Therefore, channel rearrangement is used to strengthen the connection between different groups.
Due to the addition of pointwise grouping convolution and channel rearrangement, the calculation of ShuffleNet is more efficient. Compared with ResNet and ResNeXt, the computational complexity of ShuffleNet is the smallest, for example, considering the input size of c×h×w. Bottleneck channel is m, ResNet unit needs h×w×(2×c×m+9×m×m)FLOPs, ResNeXt unit requires h×w×(2×c×m+9m×m/g)FLOPs, but the Shuf-fleNet unit only needs h×w×(2c×m/g +9×m)FLOPs, where g represents the number of groups of convolution.
ShuffleNetv1 [129] uses pointwise group convolution and channel shuffle to greatly reduce the amount of calculation and ensure the accuracy. It has a lower error rate than MobileNet (the top-1 error rate is 7.8%), and is 13 times faster than AlexNet on the ARM chip of mobile devices.
ShuffleNetv2 [130] proposes that the following two factors must be considered in the structural design of CNN: first, the direct measurement standard (speed), rather than the indirect measurement standard (flop); second, measurement needs to be carried out on the target platform. The article gives four practical guiding principles: principle 1: the same number of input and output channels can minimize memory access; principle 2: excessive use of group convolution will increase memory access cost (MAC); principle 3: network fragmentation will reduce parallelism; and principle 4: element level operations cannot be ignored. According to the guiding principles, the author designed ShuffleNetv2, which achieved a good tradeoff in speed and accuracy. 4) Others: PeleeNet [131] used two-way dense layers, stem block, dynamic number of channels in a bottleneck, transition layer compression, and conventional post activation to reduce computation cost and increase speed. The stem block was used to alleviate information loss. PeleeNet consists of a stem block, four stages of modified dense and transition layers, and ultimately the classification layer.
The CNN structure mentioned above is basically designed manually by experience, and these networks are not necessarily optimal. NAS [132] can find lightweight network models. It can use search strategy to automatically find the optimal CNN structure in the search space. Mnasnet is an automated NAS approach. It formulates the search problem as multiobject optimization aimed at both high accuracy and low latency. MnasNet was almost twice as fast as MobileNetv2 while having better accuracy.
The above lightweight CNNs relied on depthwise separable convolution, which lacked efficient implementation on most hardware. It should be noted that when these CNN are deployed on hardware devices, some models seem lightweight, but the hardware does not support some calculations, resulting in poor real-time performance.

F. Model Deployment
Computational acceleration is also a method to realize real time. It includes using FFT-based convolutions and fast convolution using the Winograd transformation. Winograd transformation is another equivalent method for convolution, which improves the calculation speed. Convolution in the time domain is equivalent to pointwise multiply in the frequency domain.
The key to realize real-time object detection is to deploy lightweight CNN models on the customized hardware. Model deployment refers to deploying the training model generated by deep learning to various cloud, edge, and edge devices to make it run efficiently, so as to apply the algorithm model to various tasks in reality. The common process of model deployment is shown in Fig. 9. It includes four steps: deep-learning framework, intermediate representation, inference engine, and embed platform.
The deep-learning framework is used to define the network structure and determine the parameters in the network through training. Tensorflow [133], Pytorch, and Caffe [134] are the usually used deep-learning framework.
The intermediate representation can solve the problem of neural network model transformation between different training and inference frameworks. It only describes the structure and parameters of the network. ONNX (open neural network exchange) is an intermediate representation [135]. It defines a set of extensible calculation diagrams and a series of standard data types and operators. Currently, most training, deployment frameworks, and reasoning acceleration engines of hardware manufacturers support ONNX format.
The inference engine written codes with high-performance programming frameworks (such as CUDA and OpenCL), it can efficiently execute operators in deep-learning networks. The ONNX, RunTime, TensorRT, MNN, NCNN, and OpenVino are the inference engine that we can use [136], [137].
The embedded platform is used to realize CNN's real-time processing under the condition of limited power consumption and computing power. The NVIDIA Jetson TX2, Xilinx Ultra96  with UltraScale and ZU3, Huawei Atlas 200 with Hi3559, and Baidu EdgeBoard with FPGA are the embedded platform we can use [138], [139].
The best solution for realizing real-time processing is algorithms and hardware codesign as shown in Fig. 10. The algorithms for efficient inference include pruning, quantization, low-rank factorization, winograd transformation, and so on. The hardware should be efficient in inference. It should minimize memory access. It needs to support some lightweight operations. The hardware also needs to optimize some special operations, for example, 1×1 convolution, 3×3 convolution, group convolution, and depthwise convolution.

IV. REAL-TIME OBJECT DETECTION METHODS
The methods to realize real-time object detection can be summarized as Fig. 11. They are two-stage, single-stage, anchor free, trained from scratch, and the compression and acceleration.
According to the number of stage, the deep-learning-based object detection algorithm can be divided into single-stage detectors and two-stage detectors. During the development of object detection, both of the single-stage detectors and twostage detectors are continued to pursue speed and accuracy. For example, from R-CNN [140], SPP-Net [141], and Faster R-CNN [142] to R-FCN [143] and Light-head R-CNN [144], the detectors are becoming faster and more accurate. More than this, the YOLOv1 [145], YOLOv2 [146], YOLOv3 [147], YOLOv4 [148], YOLOv5 [149], and YOLOX [150] also reflect this phenomenon. We will introduce it in detail below.
The anchor-free detectors have been widely studied and a large number of achievements have emerged recently. The anchor-free detectors show great potential in real-time SAR ship detection. This is because ships in SAR images are very spare, and most of the anchors are invalid in the anchor-based detectors. Besides this, the trained from scratch technique also show great advantages in realizing lightweight detectors. This is because SAR images are different from optical image of ImageNet, the  loaded pretrained parameters are not suitable, and are redundant for SAR ship detection.
Besides the above ways, the compression and acceleration of CNNs (pruning, quantization, and knowledge distillation) are also used on the object detection method.

A. Two-Stage Real-Time Detectors
The principles of two-stage detection algorithms are shown in Fig. 12. The two-stage detectors use a CNN to classify and regress these anchor boxes twice to obtain the detection results.
Classical two-stage detectors are faster R-CNN, R-FCN, feature pyramid networks (FPN) [151], cascade R-CNN [152], mask R-CNN [153], and so on [154]. Faster R-CNN is the foundation work, and most of the two-stage detectors are improved based on it.
The evolution process of R-CNN, fast R-CNN, and faster R-CNN are shown in Fig. 13. We can see that CNN not only is used for extracting features but also can be used for generating candidate region.
R-CNN uses the selective search algorithm to generate the candidate box of the object, and inputs it as a sample into the CNN. The CNN generates the positive and negative sample features, and forms the corresponding feature vector. Then, the support vector machine classifies the feature vector. After the regression, the category and location are output. R-CNN gets mAP of 53.3%, which is 13.4% higher than the best traditional detector. R-CNN shows the great advantage of deep learning.
SPP-Net proposed spatial pyramid pooling (SPP) to solve the problem of feature extraction in R-CNN. SPP-Net can get fixed length feature vectors through CNN no matter how length is the input.
Fast R-CNN adopts the idea of multitask loss function. The classification loss and regression loss are unified to train. It no longer requires additional hard disk space to store the middle layer features. And the gradient can be directly propagated through the ROI pooling layer. Fast R-CNN processes one image Faster R-CNN adopts the shared CNN to predict the region proposal. It includes regional proposal network (RPN) and fast R-CNN. RPN is used to generate candidate windows. It uses the anchor box mechanism, which greatly reduces the amount of computation by directly generating candidate windows on the feature map. Faster R-CNN slightly improves the accuracy, and greatly improves the speed [17 frames per second (FPS)]. It can be processed in an end-to-end way. GPU can be used to accelerate the calculation throughout the whole process.
Fast R-CNN and faster R-CNN apply a per-region subnetwork many times to classify and regress the targets, which is time consuming. R-FCN uses a fully convolutional structure and all the computation shared on the entire image. It is more accurate and efficient. As deeper layers in the convolutional network are translation-invariant, making them ineffective for localization tasks. R-FCN proposed the position-sensitive score maps to solve this problem. It uses position-sensitive score maps to address the dilemma between translation-invariance in image classification and translation-variance in object detection. R-FCN achieves 83.6% mAP on the PASCAL VOC 2007 with ResNet-101 as the backbone.
Faster R-CNN has two fully connected layers for RoI recognition, and R-FCN has a large score maps. They both perform an intensive computation after or before RoI warping. Though the backbone is lightweight, the above two-stage detectors are still slow. Light-head R-CNN solves this shortcoming of faster R-CNN and R-FCN. The head of light-head R-CNN uses a thin feature map and a cheap R-CNN subnet, which makes the head of the detector as light as possible. It gets 30.7 mAP at 102 FPS on COCO.
From R-CNN, SPP-Net, fast R-CNN, and faster R-CNN to R-FCN and light-head R-CNN, we can find that the two-stage detectors always pursue the improvement of accuracy and speed by sharing features on the backbone or the head layers. But as the two-stage detectors accurately locate and classify targets through two stages, which increase the amount of computation. So it is inherently not suitable for real-time object detection. Researches usually seek solutions in single-stage detector when designing real-time object detection.

B. Single-Stage Real-Time Detectors
The principles of single-stage detection algorithms are shown in Fig. 14. It uses a full convolution network to classify and regress the boxes once to get the detection results. The singlestage detectors only need to look at the picture once, and can predict what the object is and where the object is. It is similar to human eyes. So they are faster than two-stage detectors. Classical single-stage detectors are YOLO, SSD [155], Reti-naNet [156], and CornerNet [157]. YOLO is most popular single-stage detection algorithms, and most of the subsequent single-stage works are based on them.
Among the single-stage detectors, YOLO series are popular in real-time applications due to excellent speed and accuracy tradeoff. The structures of YOLO series object detector are shown in Table II.
YOLOv1 regards object detection as a regression problem, and it outputs the spatially separated bounding box and related class probability simultaneously. YOLOv1 divides the input picture into S × S grids, each grid cell predicts B boxes and the confidence scores corresponding to these boxes. For PASCAL VOC, S = 7, B = 2, C = 20. The final prediction output is a 7×7× 30(5×2+20)tensor.
YOLOv2 uses the multiscale training method. It predicts the offset rather than the parameter itself. It uses an anchor mechanism to obtain anchor box parameters by clustering the object size in the dataset. And every cell predicts five anchor boxes. The backbone network is DarkNet-19. The detection head has changed from 7 × 7 to 13 × 13. Batch normalization, pass-through, high resolution, and multiscale training are used to promote the performance further.
YOLOv3 uses DarkNet-53 as the backbone. DarkNet-53 reduces the output feature map to 1/32 of the input, which is stronger than DarkNet-19 and more efficient than ResNet-101 and ResNet-152. The predictions are done on three different branches: 13 × 13, 26 × 26, and 52 × 52. The anchor mechanism of YOLOv3 is the same as that of YOLOv2. YOLOv3 can process an image of 320 × 320 in 22 ms.
YOLOv4 consists of CSPDarknet-53 backbone, SPP + path aggregation network (PAN) based neck, and head of YOLOv3. It uses two anchors for one ground truth, while YOLOv3 uses only one anchor for one ground truth. It also uses several techniques to achieve state-of-the-art results. It uses bag-of-freebies to improve the performance without increasing the inference time. For example, cutMix and mosaic data augmentation, dropBlock regularization, class label smoothing, CIoU-loss, self-adversarial training, and so on. It uses bag-of-specials to improve the performance, and only a small amount of calculation is increased. For example, mish activation, cross-stage partial (CSP) connections, SPP, SAM (spatial attention module), PAN, DIoU-nonmaximum suppression (NMS), and so on. YOLOv4 gets 43.5 mAP with 65 FPS on MS COCO.
YOLOv5 adopts adaptive anchors and uses the network to learn anchor parameters. Its backbone is based on DarkNet53 with focus and CSP. In the neck part, the structure of an FPN and PAN are adopted. Its prediction head is the same as YOLOv3 and YOLOv4. YOLOv5 is a state-of-the-art object detection algorithm with fast inference speed and exact accuracy.
YOLOX introduces advanced anchor-free method to improve the performance of detector, significantly outperforming YOLOv5 in terms of precision. YOLOX uses a decoupled head to generate two-way feature maps by two separate 1 × 1 convolutional layers. The decoupled head improves the performance of the detector. The SimOTA is used to dynamic assign label. YOLOX achieve 50.0% AP on MS COCO with 68.9 FPS on Tesla V100, which surpasses YOLOv5-L by 1.8% AP. And the numbers of parameters of them are roughly the same.
YOLO series show great potential for real-time object detection. And a lot of researchers apply them into SAR ship detection and get good results. We will review them in Section V.

C. Anchor-Free Detectors
Deep-learning-based object detection algorithms can be divided into anchor-based and anchor-free-based. The anchorbased algorithm uses the anchor box as a reference to search the region that may contain the object. These scale and aspect ratio of the anchor box are designed according to the statistical of the dataset. The anchor-based detectors make great contribution to the development of deep-learning-based object detection, for example, faster R-CNN, YOLOv2, YOLOv3, and SSD. However, it is discrete sampling and needs to be designed manually by experience, so there will be the following disadvantages [161]: 1) The anchor box needs to be "carefully" designed according to different datasets. This will introduce many hyperparameters that need to be optimized, such as the number, size, and aspect ratio of the anchor box. If the parameters change, the detection performance will fluctuate. 2) In order to obtain a good recall rate, it will generate a large number of anchor boxes. For example, SSD has 8732 anchor boxes, and few can match the ground truth, most of them are invalid, so there is a large redundancy. 3) There are fewer positive samples and most of them are simple negative samples. This imbalance between positive and negative samples will reduce the performance of the model.

4)
When the dataset changes, it is necessary to change the anchor box according to the size and shape of the targets in dataset. The anchor-free detector opens up another idea by eliminating the predefined anchor box. It can directly predict several key points of the target from the feature map. These algorithms are exploring how to efficiently use points to represent a bounding box. CornerNet uses the upper left and lower right points of the box to represent the box. FCOS [162] and FoveaBox [163] do the detection process by pixel prediction.
The anchor-free detectors can avoid various problems, and has great application potential in SAR ship detection. For example, due to the small size and sparse distribution of ships, most of the candidate anchor boxes are invalid negative samples. The anchor-free detectors can neglect the invalid anchors and reduce the amount of the predicted boxes, and thus improve the accuracy and speed simultaneously.
For the special scene of SAR ship detection, due to the small size and sparse distribution of ships, most of the candidate anchor boxes are invalid negative samples. The anchor-free detector avoids various problems in anchor box generation. So it has great application potential here.

D. Trained From Scratch Detectors
Since Ross proposed R-CNN, object detection algorithms need to load the weight pretrained on classification dataset (ImageNet) and fine-tune parameters to adapt to the new detection task. This transfer learning can make the detection algorithm initialize better and make up for the problem of insufficient samples. But there will be the following problems [164]: 1) The loss function and category distribution between classification and detection are different, so the transferred parameters are not suitable for detection, which will make the detection algorithm less than optimal. 2) The detection includes two subtasks: classification and localization, which are optimized at the same time. However, these two tasks are contradictory in nature. For classification, translation invariance is required, but for detection, translation invariance is required. It is unreasonable to use the pretrained model for object detection. 3) Most networks will produce high receptive fields through multiple downsampling in the latter layers, which is good for classification. However, this will sacrifice the spatial resolution of the feature map, and it is difficult to accurately locate object. 4) Most detection algorithms directly borrow the model structure and parameters after pretraining and cannot modify the structure. It hinders researchers from designing CNN flexibly according to their needs. 5) At present, the network is generally designed for three channel natural images. For single-channel SAR images, there are too many channels and parameters are redundancy.
In order to solve the problems of transfer learning, algorithms trained from scratch are proposed, for example, DSOD, DetNet [166], ScratchDet [167], and so on [168].
DSOD and GRP-DSOD [165] realize training from scratch through well-designed backbone network and frontend network. The parameters of the detection algorithm are greatly reduced, and the accuracy is equivalent to the most advanced detection algorithm at that time. It summarizes four principles for designing backbone networks: single-stage, dense prediction structure, stem unit, and deep supervision.
DetNet designs a backbone network for detection tasks. Considering the contradiction between detection and classification tasks, the backbone network retains a larger scale in the last few layers, which can retain more location information.
ScratchDet proposed that increasing the learning rate while using BN in each layer can make the detection algorithm more robust and converge faster. At the same time, based on ResNet-18, the article proposes a backbone network root Root-ResNet for detection algorithm, which uses three stacked 3 × 3 convolution instead of 7 × 7 convolution, and removes the max-pool layer at the front to reduce the information loss.
He et al. [168] uses group normalization and asynchronous BN to increase the batch size, and the direction of gradient descent is more accurate. So it can accelerate convergence and improve the accuracy of convergence.
The model trained from scratch not only has high accuracy, but also greatly reduces the size and amount of calculation of the model. Due to the above advantages, it is also used in SAR ship detection.

E. Compression and Acceleration Methods
To detect objects on platforms with limited computing power and memory resources, researchers also used CNN's compression and acceleration methods for object detection. For example, pruning, quantization, knowledge distillation, and low-rank factorization are usually in the inference engine, such as Ten-sorRT, ONNX RunTime, NCNN, and OpenVINO. Generally, the knowledge distillation, pruning, and quantization are used from front to back with multiple times. Among them, the knowledge distillation is usually used in real-time object detection [169], [170], [171], [172], [173].
V. REAL-TIME SAR SHIP DETECTION As far as we know, there are 70 public papers those are about real-time SAR ship detection. We divide them into the following seven categories as shown in Fig. 15. They are anchor free, trained from scratch, YOLO series, CFAR+CNN, lightweight backbone network, model compression and acceleration, and hardware deployment. The datasets and evaluation metrics used in this area are also reviewed in this part.    TABLE IV  TIMES THAT DATASETS USED   TABLE V  DEEP LEARNING FRAMEWORKS THAT ARE USED this area in 2017. We can also find that the numbers show a trend of increasing. This shows that more and more people begin to pay their attention on real-time SAR ship detection. Table IV shows the times that datasets are used. We can find that SSDD is used 46 times among the 70 papers. The reason why SSDD is welcomed are that SSDD is the first public datasets (it is more than 1 year earlier than the second dataset) and are friendly to use. But as the emergency of other big datasets, SSDD shows some drawbacks gradually. And the new datasets are better than SSDD in some extent. SAR-Ship-Dataset and HRSID are also usually used in this area. AIR-SARShip-1.0/2.0 and LS-SSDD-V1.0 are less used; this is partly because they are large SAR images and are hard to use.

A. Seventy Public Papers About Real-Time SAR Ship Detection
From Table IV, we can see that researchers have too much choice when conducting experiments. This is not good for the development of algorithms. What is more, the small datasets are prone to overfit when training. So in the future, it is necessary to merge the several public datasets into a large one.
Among the 70 papers, 18 papers are from conferences and 52 papers are from journals. The IEEE TRANSACTIONS ON GEO-SCIENCE AND REMOTE SENSING (Letters) and MDPI Remote Sensing are the two mostly appeared journals.
The deep-learning frameworks that are used are shown in Table V. We can find that Pytorch is the most welcomed framework. This is because it is easy to use in research.
Among the 70 papers, 6 papers trained on GPU and tested on the edge, e.g., NVIDIA Jetson TX2, FPGA, and so on, the other  papers trained and tested on GPU. In the following part, we will survey them in detail.  [179], and RSDD-SAR [180]. SSDD is the first dataset that is used in SAR ship detection. It brings SAR ship detection into deep-learning era. Since then, many researchers have used deep-learning methods to detect ships in SAR images. As the deep-learning methods are data hungry, they need lots of images to train the huge models. SSDD faced the problem of insufficient data volume. So other datasets are proposed successively. The samples of them are shown in Figs. 12-23. Table VI shows the other information of 10 public datasets. From the table, we can see that most datasets are annotated with vertical bounding box. SSDD+, SRSDD-v1.0, and RSDD-SAR are annotated with oriented bounding box. HRSID and Official-SSDD are annotated with polygon bounding box. In the following part, we will introduce the details of the datasets and evaluate the advantages and drawbacks of them.
2) SSDD, SSDD+, and Official-SSDD: SSDD is proposed at the conference of 2017BIGSARDATA in December 1, 2017. It has 1160 images and 2456 ships. The data resources are RadarSat-2, TerraSAR, and Sentinel-1 with resolutions from 1 to 15 m. The length or the width is about 600 pixels. The samples of SSDD are diverse, which are helpful for training a robust detector. The length, width, and aspect ratio of the ship bounding box in SSDD are counted, which is helpful for designing anchor boxes for the detectors. SSDD+ is the improved version of SSDD with oriented bounding box. SSDD and SSDD+ share the same images.
Due to the insufficient understanding of deep-learning object detection algorithm, there are some problems in SSDD, for example, the coarse annotations and ambiguous standards of use. It hinders fair comparisons and effective academic exchanges in this field. In order to solve this problem, Zhang et al. [85] proposed Official-SSDD. It has bounding box, rotatable bounding box, and polygon segmentation. The five using standards are also formulated, they are the training-test division determination, the inshore-offshore protocol, the ship-size reasonable definition, the determination of the densely distributed small ship samples, and the determination of the densely parallel berthing at ports ship samples. Official-SSDD is beneficial for fair method comparison and effective academic exchanges in the future.    to 800 × 800 pixels small images with overlapped ratio of 25%. HRSID has 5604 cropped SAR images and 16 951 ships. A total of 65% of the SAR images are training set, and the other 35% are test set. It follows the principle of MS COCO in annotation and scale division. They are 54.5%, 43.5%, and 2% small, medium, and large ships, respectively. The bounding box area of small, medium, and large ships accounts for 0%-0.16%, 0.16%-1.5%, and 1.5% of SAR images, respectively. So HRSID has the characteristics of small objects but large scenes; ships are sparsely distributed in SAR images. HRSID has the features of small and sparse ships with large scenes.   (25), dredger ships (263), and container ships (89). The image size is set to 1024 × 1024. The annotation format is the same as DOTA. The coordinates of the four corners of the box, the category, whether it is difficult to identify is given on annotation files. It contains 666 images. A total of 420 images with 2275 ships include land cover. A total of 246 images with 609 ships only contain sea in the background. It has six categories.
8) RSDD-SAR: RSDD-SAR is proposed at July 7, 2021. The data sources are Gaofen-3 and TerraSAR with resolutions of 2-20 m. RSDD-SAR has 84 GF-3 scenes, 41 TerraSAR-X scenes scenes uncropped large images, including 7000 slices and 10 263 ships. It is annotated by automatic annotation and manual correction with oriented bounding box. The angle of ships in the dataset is evenly distributed between 0°and 180°, and the aspect ratio is concentrated between 2 and 6. The training set has 5000 samples and the testing set has 2000 samples. It has a large number of small ships. It contains vast sea areas, ports, docks, waterways, and other scenes with different resolutions, which are suitable for practical applications.

C. Evaluation Metrics
The confusion matrix is shown in Fig. 24. From this, we can see the concept of TP, FP, FN, and TN. For example, TP means that the ground truth is positive and the prediction is also positive, FP means that the ground truth is positive but the prediction is also negative.
Based on the confusion matrix, the false positives rate and the true positives rate are calculated as follows: The accuracy is generally used to evaluate the global accuracy of a model. It cannot contain too much information and cannot comprehensively evaluate the performance of a model. It is calculated as follows: Precision represents the proportion of ships that were correctly detected in all positive detected result. It is calculated as follows: Recall represents the proportion of ships that were correctly detected in the ground truth. It is calculated as follows: Precision and recall are contradictory, that is to say, the higher the recall, the lower the precision is. In order to give consideration to precision and recall at the same time, F1 score is proposed, which is the harmonic average of precision and recall, with the maximum of 1 and the minimum of 0. It is calculated as follows: PRC (precision recall curve) is also usually used in object detection. The x-axis and y-axis of PRC are recall and precision, respectively. Average precision (AP) is calculated by using the integral area of the PRC. AP was the average of the precision obtained by IoU at intervals of 0.05 from 0.5 to 0.95. The AP is calculated as follows: Among them, R represents the recall rate and P represents the precision. AP50 is the AP calculated when IoU was 0.5. mAP is the average of multiple categories of AP.
The evaluation metrics of object detection that are usually used are precision, recall, and AP.
The model size and the FLOPs are usually used for evaluating the performance of the running speed of the detector. For a convolution layer, suppose its size is h × w × c × n, where c is the number of input channels, n is the number of output channels, and the size of the output characteristic diagram is h × w.
The parameter quantity of the convolution layer is The FLOPs of the convolution layer is But FLOP is an indirect indicator. The direct metric is the speed or latency, or known as FPS that we really care about. The correlation between delay and FLOPs and parameter quantities is weak. For example, ShuffleNetv2 has a high number of parameters but a low latency. The discrepancy between the indirect and direct metric can be attributed to two main reasons. First, MAC constitutes a large portion of runtime in certain operations like group convolution. This cost should not be simply ignored during network architecture designing. Second, some operators are not optimized for the hardware, for example, depthwise convolution, pointwise convolution, 1 × 1 convolution, and so on. They have small model size and few parameters but do not run fast. Therefore, using the indirect metric for computation complexity is insufficient and could lead to suboptimal design. So when designing real-time object detection algorithm, we should not consider indirect indicators, but also consider direct indicators.

D. Anchor-Free-Based SAR Ship Detectors
Section IV-C describes the motivation and advantages of anchor-free detectors. It also indicates that the ancho-free detectors are especially suitable for SAR ship detection. This is because the ships in SAR images are very sparse, most anchor boxes are redundant and will lead to the computation burden. What is more, the sizes of ships in SAR images are small, anchor boxes are hard to match them with the ground-truth, which leads to poor performance on small ships in large scenes. Last but not the least, the anchor-free detectors have great potential to realize real-time SAR ship detection with high accuracy. So a lot of researchers use anchor-free ideas to detect ship in SAR images. We will survey them in the following part. There are 18 papers that are anchor-free-based SAR ship detector. They are shown in Table VII.  Table VII shows the dates, authors, titles, journals/ conferences, datasets, and performances of the 18 papers. We can find that the earliest paper is published at April 3, 2020, which is far later from the first dataset was open to the public at December, 1 2017. This is because the anchor-free detectors are proposed and get popular at 2019. After the anchor-free detectors appear in large numbers, the researchers of SAR ship detection draw lessons from them gradually. So the date is later than 2019.
We can also find that there are 16 journals and 2 conferences among the 18 papers. The two most frequent journals are Remote Sensing and IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. This phenomenon shows that the anchor-free detectors are advanced and highly innovative. We further classify the above 18 papers into four classes, as shown in Fig. 25 below. They are CenterNet-based SAR ship detectors, FCOS-based SAR ship detectors, YOLOX-based SAR ship detectors, and others. Next, we summarize them, respectively.
1) CenterNet Based: CenterNet estimates the center point of the bounding box and regresses the width and height of the object. It is an anchor free, end-to-end differentiable, simpler, Cui et al. [22] introduced the spatial shuffle-group enhance (SSE) attention module to improve the performance of Center-Net. SSE divides the feature map into G groups along the channel. The channels then are shuffled to improve the interaction of different groups. By this, the stronger semantic features are extracted, which can suppress the noise to reduce false positives. The experiments on SAR-Ship dataset demonstrate the effective of the proposed method. It achieved 94.7% AP50 with 18 ms testing time on 1080Ti GPU. Guo et al. [35] improved CenterNet with three modules, and the new detector called CenterNet++. The feature refinement module is used to extract multiscale information, which is helpful for detecting small ships. The feature pyramids fusion module is used for producing features with more semantic information. The head enhancement module is used for balancing the ratio of foreground and background. CenterNet++ achieved AP50 of 73.9%, 95.1%, and 95.4% on AIR-SARShip, SSDD, and SAR-Ship, respectively, with 33 ms testing time on TITAN RTX GPU. Wang et al. [38] introduced the spatial groupwise enhance (SGE) attention module to Cen-terNet to detect the dense docked SAR ships. SGE uses the similarity of global and local features to generate attention mask. It has strong semantic information. It reduces the calculations and improves the spatial features of each group. It achieved AP50 of 93.9% on SAR-Ship with GTX 1080Ti GPU. Jiang et al. [51] proposed R-Centernet+ which dedicated to solve the problem of sparse and small properties of ships in SAR images with rotatable bounding box. The convolutional block attention module is used to improve attention of the small ships. The foreground enhance module is used to reject the disturbing of the background. It achieved AP50 of 95 2) FCOS Based: FCOS is a fully convolutional single-stage anchor-free detector. It predicts objects in a per-pixel way. The anchor-free idea avoids the computation about anchor boxes Fu et al. [23] proposed feature balancing and refinement network (FBR-Net) to solve the sparsity and anchor settings problem of anchor base detector. FBR-Net improved FCOS in the following aspects. It uses attention-guided balanced pyramid to fuse features along different levels, which are helpful for detecting small ships in SAR images. The feature refinement module is used to prevent the interference near the ship, which can improve the accuracy of localization. FBR-Net achieved 94.1% and 84.6% AP50 on SSDD and AIR-SARShip-1.0, respectively, with 32.5M parameters and 40.1 testing time on RTX 2080Ti GPU. Mao et al. [29] proposed ResSARNet with 0.69M parameters as the backbone. And it improved FCOS in four aspects. They are centerness on bounding box regression branch, center sampling, generalized intersection over union, and adaptive training sample selection. It achieved 61.5% AP with 1.17M parameters on SSDD and GTX 1080 GPU. Sun et al. [48] improved FCOS with category-position (CP) module to make it more suitable for detecting small and complex ships. CP can produce guidance vector from classification branch to improve the localization performance. The classification and regression are redesigned to prevent the interference of fuzzy areas. It achieved 96.01% AP50 with 32.1M parameters, 228 MB model size, 64 ms testing time on HRSID, and GTX 1080Ti GPU. Zhu et al. [66] used FCOS to improve the ability of detecting sparsity, small, and interference ships. A new sample definition is proposed to replace the IoU according to the characteristics of ships. It is effective for improving the accuracy. The same resolution feature convolution module, multiresolution feature fusion module (FF-Module), and feature pyramid module are proposed to improve the feature representation for small ships. The complete intersection over union loss is used to improve the localization accuracy. It achieved 97.8% and 75.5% AP50 on SSDD and LS-SSDD-v1.0, respectively, with 32 FPS on RTX 2080Ti GPU. Zhu et al. [74] used FCOS and ATSS to improve the ability of detecting small ships and ships under complex scattering interferences. The improved residual module and deformable convolution are used in backbone to improve the performance of feature extracting. The combined classification score and localization quality is used to address the inconsistent problem. It achieved 89.8% AP50 on HRSID with 60.8FPS on RTX 2080Ti GPU. Xiao et al. [82] proposed power transformations and feature alignment guided network to extract multiscale features. The power-based convolution block is used for suppressing speckle noise. The feature alignment block is used for avoid the dislocation problems. Experiments on SSDD and HRSID show that it can achieve 96.35% AP50, 89.74% AP50 with 136 MB model size, and 31 FPS on NVIDIA RTX 2080Ti GPUs.
3) YOLOX Based: YOLOX is an anchor-free detector with decoupled head and the leading label assignment strategy SimOTA. It achieves good results on accuracy and speed. For example, YOLOX-L has the same number of parameters with YOLOv5-L, but it achieves 50.0% AP on COCO with 68.9 FPS on Tesla V100, surpassing YOLOv5-L by 1.8% AP.
Feng et al. [73] proposed a lightweight position-enhanced anchor-free SAR ship detection algorithm called LPEDet based on YOLOX. The lightweight backbone called NLCNet with separable convolution is used for balancing the speed and accuracy. The position-enhanced attention strategy is used for suppressing clutter by adding position information to the channel attention. It achieved 97.4% and 89.7% AP50 on SSDD and HRSID, respectively, with 18.38G FLOPs, 5.68M parameters, and 7.01 ms testing time on RTX 2060 GPU. Peng et al. [80] proposed an anchor-free detector for detecting small spare and dense arranged ships. It used ICEIoU to improve the regression. The adaptive-NMS and atrous convolution are used to improve the performance further. It achieved 91.76% AP50 and 11 ms testing time on HRSID and RTX3060ti GPU. Yu et al. [81] proposed a lightweight ship detector based on YOLOX. It only uses one-level of the FPN to get a higher efficiency. The receptive field and the semantic information of the one-level feature are expanded to relieve the decrease of accuracy. By four branches with different dilation rates, it can capture various ships in complex backgrounds. The center-based uniform matching is used to tackle the imbalance problem in training stage. It achieved 95.5% and 88.39% AP50 on SSDD and HRSID, respectively, with 10.3 MB model size and 7.1 ms testing time on Quadro P6000 GPU.

4) Others:
Mao et al. [19] proposed an anchor-free detector to improve the efficiency and avoid the massive hyperparameters. Its backbone is based on the simplified U-Net. It only contains 0.47 million learnable weights. It achieved 94% AP50 on SSDD with 0.93M parameters. In order to avoid tuning of anchor-related parameters, reduce the computation, and improve the results of small ships. Gao et al. [25] proposed an anchor-free detector with dense attention feature aggregation. The inverted residual blocks with depthwise separable convolution, dense attention feature aggregation, spatial and channel squeeze, and excitation block are proposed to improve the feature extracting ability of the detector. It achieved 86.99% AP50 on AirSARShip-1.0 with 0.83M parameters and 33 ms testing time on Tesla K20c GPU. An et al. [37] proposed an anchor-free rotatable detector with flexible structure for ships in SAR images. It achieved 90.75% AP50 on SSDD. Hu et al. [67] proposed an anchor-free balance attention network to improve the accuracy and generalization ability for multiscale ship detection. The local attention module is used based on the deformable convolution to obtain local information of ships. The nonlocal attention module is used to extract the nonlocal features of the SAR image. It achieved 95% AP50 on HRSID with 14 FPS. He et al. [72] proposed an anchor-free detector to detect small ships. The adaptive feature encoding module uses deep semantic features into shallow layers and realizes the adaptive learning of the spatial fusion weights. The Gaussian guided detection head is used to assigning different weights to the detected bounding boxes at different locations in the training process. It achieved 96.5% and 92% AP50 on SSDD and HRSID, respectively, with 0.356 s testing time on CPU.

E. Trained From Scratch Based SAR Ship Detector
In order to solve the problems of transfer learning, detectors trained from scratch are proposed, for example, DSOD, DetNet, ScratchDet, and so on. The model trained from scratch not only has high accuracy, but also greatly reduces the size and amount of calculation of the model. Due to these advantages, it is also used in SAR ship detection. There are 11 papers those are trained from scratch. They are shown in Table VIII.  Table VIII shows the dates, authors, titles, journals/ conferences, datasets, and performances of the 11 papers. We can find that the earliest paper is published at February 4, 2019. And the first trained from scratch detector in computer (DSOD) is proposed at 2018. We can also find that all the 11 papers are journals. This phenomenon shows that the trained from scratch detectors are advanced and highly innovative.  We further classify the above 11 papers into 4 classes, as shown in Fig. 26 below. They are DSOD-based SAR ship detectors, CenterNet-based SAR ship detectors, DetNet-based SAR ship detectors, and others.
DSOD is the first detector that is trained from scratch. It summarized several principles for training detectors from scratch. They are deep supervision, anchor free, stem block, and dense prediction structure. It achieved better results than other detectors with smaller models. Inspired by these ideas, Deng et al. [13] and Han et al. [26] and [31] proposed several methods to improve the performance of the trained from detectors on SAR ship detection. Deng et al. [13] proposed the condensed backbone that made the earlier layers receive additional supervision from the objective function, which make easy to train. It can be freely designed and trained from scratch without a large amount of SAR images. The feature reusing strategy, cross-entropy loss, and the position-sensitive score maps are used to improve the performance further. It achieved 73% AP on OpenSARShip with 18.4M parameters. Han et al. [26] proposed an asymmetric and square convolution block to SSD. It can be trained from scratch with less parameters and computations without serious damage to detection accuracy. It achieved 81.17% AP on SSDD with 18.9M parameters, 19.73G FLOPs, and 72.1 MB model size. Han et al. [31] proposed asymmetric and square convolution feature aggregation block, asymmetric and square convolution feature fusion block to DSOD. It achieved 79.79% AP on SSDD+SAR-Ship-Dataset with 8.22M parameters, 7.94G FLOPs.
DetNet is a backbone network specially designed for object detection. It includes the extra stages against traditional backbone network for image classification, while it maintains high spatial resolution in deeper layers. Due to the advantage of DetNet, Zhao et al. [43] used it and stacked convolution to solve the problem of small object detection. It achieved 92.1% precision, 87.5% recall, and 89.8% F1 score on SSDD with 9.7 FPS.
Zhang et al. [16] designed a lightweight feature optimizing network, which can be trained from scratch and can reduce the testing time without accuracy cost. It used a simpler structure LSSD, a bidirectional FF-Module, and attention mechanism to realize the above purpose. The experiments on SSDD shows that it has 80.12% AP and 9.28 ms testing time on GTX 1080Ti with 300 × 300 input.
Zhang et al. [21], [24], and [40] and Sun et al. [47] designed ShipDeNet-20, ShipDeNet-18, HyperLi-Net, and DSDet, respectively, to detect ships in SAR images by training from scratch. We will review them in Section V-H. Besides, the above papers, Guo et al. [35] and Peng et al. [80] adopted CenterNet and YOLOX as the basic detector to train SAR ship detectors from scratch.
Through the above analysis, we know that the core of training from scratch is to design a good backbone network. Because training from scratch requires the backbone network to have strong feature expression ability and strong supervision information. In SAR ship detection, there are also a lot of papers that train detectors from scratch by designing backbone elaborately. In the next, we will review them.

F. YOLO Series Based SAR Ship Detector
The two-stage detectors are seldom used in real-time SAR ship detection as the heavy computation. Most of the real-time SAR ship detectors are single-stage. Among them, YOLO series algorithms are naturally designed for real-time detection. Therefore, for real-time SAR ship detection, many researchers use YOLO-based algorithms, which are mainly summarized here. There are 37 papers those are YOLO based SAR ship detector. They are shown in Table IX. Table IX shows the dates, authors, titles, journals/conferences, datasets, and performances of the 36 papers. We can find that most of the papers are among 2020-2021. And SSDD are the most frequently used dataset. The AP50 on SSDD vary from 88.04% to 99.1%%. The numbers of parameters vary from 42.6M to 0.857M. The test times vary from 228 to 3.9 ms. The model sizes vary from 31.34 to 2.38 MB.
We further classify the above 37 papers into five classes, as shown in Fig. 27. We can find that YOLOv1 and YOLOv2 are used less, and YOLOv3 is used more. This is because they were proposed in 2015, 2016, and 2018 respectively, and deep learning was introduced into SAR ship detection in December 2017. After December 2017, researchers will of course adopt more advanced YOLOv3 instead of YOLOv1 and YOLOv2. Second, because YOLOv3 is the most innovative and has a better result, YOLOv4 and YOLOv5 have not made major changes to the network structure. In the following content, we will survey the corresponding papers.
The innovations of YOLOv3 can be summarized as the DarkNet-53+CSP backbone, the FPN neck, and the multibranch prediction. Researchers in this area also improve it in the above components.
Zhang et al. [15] proposed grid CNN based on YOLO and depthwise separable convolution. It has a backbone CNN and a detection CNN. It improved the detection speed. It achieved 90.16% AP50 and 10.94 ms testing time on NVIDIA GTX1080 GPU and SSDD. Zhang et al. [17] proposed depthwise separable convolution neural network for high-speed SAR ship detection. It has a depthwise convolution and a pointwise convolution. The multiscale mechanism, concatenation, and anchor box mechanism are also used. It improved the detection speed. It achieved 94.13% AP50 and 9.03 ms testing time on NVIDIA GTX2080 GPU and SSDD. In order to realize real-time SAR ship detection, Zhang et al. [20] improved YOLOv3 in the following aspects: reduce the size of network, delete the repeated layers, and add two feature concatenation paths. It achieved 90.08% AP50 on SSDD.
Li et al. [27] improved YOLOv3 by adopting dense connection and spatial separation FPNs. It reduces parameters and optimizes the network. Zhou et al. [28] proposed Lira-YOLO based on LiraNet. The backbone LibraNet includes the dense connections, residual connections and group convolution, and stem blocks. The prediction uses a two-layer YOLO prediction layer and adds a residual module for better feature delivery. It achieved 2.980 Bflops, 4.3 MB model size, and 85.46% AP50 on SSDD. Wang et al. [34] proposed SSS-YOLO for detecting multisale ships. The backbone is redesigned for enriching the spatial and semantics information. The path argumentation fusion network is used to fuse the up and down information. They enhance the detection for small ships. It achieved 67.24% AP and 25.84 ms test time on SAR-Ship-Dataset. Chen et al. [42] used predefined anchor boxes, Darknet-53 with residual units, top-down pyramid structure, soft NMS, mix-up, mosaic, multiscale training, and hybrid optimization to balance the accuracy and speed of SAR ship detector. It achieved 95.52% AP50 with 72 FPS on SSDD and Tesla V100 GPU on SSDD. Hong et al. [46] improved the YOLOv3 in the following aspects. The anchor boxes are redesigned by linear scaling based on the k-means++ algorithm. The Gaussian parameter uncertainty estimators are used for locating. Every scale has four anchor boxes rather three as the difference of ship sizes. It achieved 95.52% AP50 and 21.3 ms test time on SAR-Ship-Dataset. Zhang et al. [53] proposed LSSNet for detecting ships in SAR images. The depthwise separable convolution is used in the early layers, and the stacked dense blocks is used in the deep layers. It achieved high-speed (10.1 test time on GeForce GTX 1660 GPU) and high-accuracy (98.6% AP50 on SSDD). Zhang et al.   [54] proposed high-speed and high-accurate detector for balancing the accuracy and speed. The fewer convolutional layers, CSP, and rectangle filling is responsible for high speed. The SPP, bottom-up path augmentation, and mosaic data augmentation is responsible for high accuracy. It achieved 95.52% AP, 3.6 ms testing time, 278 FPS on SSDD, and achieved 92% AP50, 3.9 ms testing time, and 256 FPS on HRSID. Yu [33] and Yash et al. [36] adopted YOLOv3 without other improvements.
The innovations of YOLOv4 can be summarized as the FPN+PAN+SPP neck and the DIoU Loss. It also uses bag-of-freebies and bag-of-specials to improve the performance further. Researchers in this area also improve it in the above components.
Jiang et al. [44] proposed YOLO-V4-light for real-time SAR ship detection. It greatly reduced the number of convolutional layers in CSPDarkNet53. It only has two prediction branches. YOLO-V4-light decreases from 60 million parameters into 6 million, resulting in a significant reduction in model size and prediction speed. It achieved 88.08% AP50 22.5MB model size on SSDD. Xu et al. [45] used YOLOv4 as the detector that attached after CFAR obtain more accurate final results. YOLOv4 was not improved at this paper. Lin et al. [50] proposed an improved YOLOv4. The cosine annealing, label smoothing, and mosaic are used. The anchor boxes are selected by K-means clustering on SSDD. It achieved 95.2% AP50 with 13.94 FPS on SSDD on GTX1050TI GPU. It is improved by 2.87% compared with the YOLOv4. Zhou et al. [55] proposed a lightweight YOLOv4 for SAR ship detection. The backbone is the MobileNetv2. The depthwise separable convolution is used for reducing parameters. It achieved 95.5% AP50 on SSDD and the number of parameters is reduced by 40% compare with the original YOLOv4. Gao et al. [56] proposed a high-precision, high-efficiency detector based on YOLOv4. The backbone is SAR-Net, which is similar to CSPDarkNet53 besides the input channel. The neck can balance the relevance of multiscale semantic information for detecting targets of different sizes. The three branches head are redesigned with classification and regression tasks. It achieved 87.49% AP50 on HRSID and 76.2% AP50 on LS-SSDD-V1.0 with 42.6M parameters. Ma et al. [57] compressed the YOLOv4 through sparsity training, pruning, and knowledge distillation. YOLOv4 was not improved at this article. Miao et al. [65] bring attention mechanism into YOLOv4. The threshold attention module is introduced to suppress the adverse effect of complex backgrounds and noises. The channel attention module is embedded into FPN to better enhance the discrimination ability. The decoupled head with two parallel branches improves the performance of classification and regression. It achieved 94.16% AP50 on SSDD with 42 FPS. Liu et al. [68] proposed a lightweight detector based on the YOLOv4-Lite. The backbone is MobileNetv2. The receptive field block is used to improve the feature extraction ability. It achieved 95.03% AP50 with 47.16 FPS and 49.34M model size on SSDD. Yu et al. [76] proposed an efficient lightweight network for SAR ship detection. The ECIOU is proposed to improve the localization accuracy and convergence speed. The SCUPA module is proposed to enhance the multiplexing of picture feature information. The GCHE module is proposed to strengthen the network's ability to extract feature information. It achieved 93.56% AP50 with 68.52 FPS and 31.34M model size on SSDD.
The innovations of YOLOv5 can be summarized as the DarkNet53+Focus+CSP backbone, the FPN+PAN+ SPP+CSP neck, and the CIoU Loss.
Tang et al. [41] proposed N-YOLO to detect ships under noises. The YOLOv5 is used after the SAR target potential area extraction module. The YOLOv5 is not improved here. Zhu et al. [60] proposed DB-YOLO to detect small ships and improve the speed. The backbone of DB-YOLO is a single-stage network and has cross-stage partial block, which are helpful for real-time detection. The neck of DB-YOLO used the duplicate bilateral FPN to fuse the semantic and spatial information. The head of YOLO put the bounding boxes and confidence scores as the inputs. It achieved 97.8% AP50 and 64.9% AP on SSDD, and 94.4% AP50 and 72.0% AP on HRSID with 10.8M parameters, 25.6G FLOPs, and 48.1 FPS on RTX 2060 GPU. Zhou et al. [63] proposed multiscale ship detection network based on YOLOv5s for detecting small ships in SAR images. The cross-stage partial network is used for fusing feature maps adaptively. The FPN with fusion coefficients module is used for choosing the best features to fuse for small ship detection. It achieved 95.6% AP50 and 61.1% AP on SSDD, and 95.1 AP50 and 60.1% AP on SAR-Ship on 2080Ti GPU. Xu et al. [64] and [69] proposed Lite-YOLOv5 and L-YOLO based on YOLOv5 for lightweight on-board SAR ship detection. They have small model size, less FLOPs, and are running on-board without sacrificing accuracy. Xiao et al. [71] proposed YOLO-v5-Light based on YOLO-v5 for detecting ships on the embedded platform. The backbone used separable convolution to reduce the amount of computation, and uses 1 × 1 convolution to fuse channels. The lightweight attention mechanism is also used here. The parameters, model complexity, and weight file size are reduced to 41.4%, 30.3%, and 43.0% of the original network. It achieved 88.7% AP50 on SSDD. Xie et al. [75] proposed YOLO coordinate attention SAR ship for real-time on-board SAR ship detection. It shows advantage in efficiency and performance. It achieved 65.6% AP and 97.0% AP50 on SSDD.
The innovations of YOLOX can be summarized as the decoupled head, the anchor free, and the SimOTA label assign strategy.
Feng et al. [73] proposed lightweight position-enhanced detector. NLCNet is the lightweight feature enhancement backbone with deeply separable convolution. It balanced the speed and accuracy. The position-enhanced attention strategy is used for suppressing background clutter. It achieved 97.4% AP50 on SSDD, 89.7% AP50 on HRSID with 18.38G FLOPs, 5.68M parameters, and 7.01 ms testing time on GeForceRTX2060 GPU. Peng et al. [80] improved YOLOX with corner efficient intersection over union, adaptive-NMS, atrous convolution, and coordinate attention mechanism for detecting sparse and small ships. It achieved 91.76% AP50 with 11 ms testing time on HRSID RTX3060ti GPU. Yu et al. [81] proposed a detector based on YOLOX-s for detecting ships on the board. The one-level feature is used for higher efficiency. The residual asymmetric dilated convolution is used for enlarging the semantic information. The center-based uniform matching is used as the balanced label assignment strategy. It achieved 95.5% AP50 67.47% AP on SSDD, and 88.39% AP50 63.66% AP on HRSID with 10.3 MB model size 7.1 ms testing time On Quadro P6000 GPU.

G. CFAR+CNN Based SAR Ship Detector
In fact, the ocean area occupies the most of the SAR images, and most areas have no ships. We should discard them before inputting these pure backgrounds into deep-leaning-based detectors. This can significantly reduce the amount of computation. Fortunately, the CFAR detector can distinguish the pure backgrounds and ship areas with lower computation. Thus, it is necessary to integrate the traditional CFAR methods with deep-learning-based detector when conducting on-board SAR ship detection. The process of CFAR+CNN is shown in Fig. 28.
Through the above process, the CFAR can quickly exclude chips without ships and prevent CNN with large amount of computation from wasting computing resources. For targeted slices, CNN with high accuracy but slow speed can accurately identify and position them. There are four papers those are based on CFAR+CNN as shown in Table X. Li et al. [62] proposed CFAR+CNN model for balancing the speed and accuracy of SAR ship detector. The CFAR is used for detecting candidate ship chips, and the CNN is used for removing false alarms generated in the CFAR step. In fact, the combination is the traditional idea. And the deep-learning-based object detectors are not used here. So the accuracy is lower. In order to improve the accuracy and speed of SAR ship detection, Souad Chabbi et al. [78] proposed a CFAR-CNN detector. The generalized gamma distribution is used for modeling the seal clutter. The CNN local detector is applied to improve the accuracy. But it is also traditional combination. Xu et al. [64] proposed Lite-YOLOv5 for detecting ships on-board. It includes a histogram-based pure backgrounds classification module, a shape distance clustering module, a channel and SAM, and a hybrid SPP module to improve detection performance. It is also transplanted into the embedded platform NVIDIA Jetson TX2. Xu et al. [45] combined CFAR and lightweight deep-learning method for detecting ships on-board. CFAR is used for finding potential ships, and YOLOv4 is used for obtain more accurate final results. It achieved 93.46% AP50 on SAR-Ship dataset with 22.4MB model size.

H. Lightweight Backbone Networks for SAR Ship Detection
Backbone network is used to extract features, which is an important part of detection algorithm. It occupies a very large amount of computation. Therefore, in order to realize real-time detection, many researchers are studying how to design a small and powerful backbone network, such as MobileNets, Shuf-fleNets, and so on. Many similar results have been achieved in real-time SAR ship detection, as shown in Table XI. We will introduce them one by one below.
Zhang et al. [21] designed a lightweight SAR ship detector ShipDeNet-20 with only 0.82 MB model size. The FF-Module, feature enhance module, and scale share feature pyramid module (SSFP-Module) are used to compensate for the accuracy loss. The backbone has 15 layers, FF-Module has 2 layers, and SSFP-Module has 3 layers. All the convolution layers are depthwise convolution, which makes it more lightweight. It is also trained from scratch. It achieved 97.07% AP50, 233 FPS, and 0.82 MB model size on SSDD. Zhang et al. [24] proposed HyperLi-Net for high-accurate and high-speed SAR ship detection. HyperLi-Net used five modules to ensure the accuracy: multireceptive-field, dilated convolution, channel and spatial attention, feature fusion, and feature pyramid. HyperLi-Net used five modules to ensure the speed: region-free, small kernel, narrow channel, separable convolution, and batch normalization fusion. It is also trained from scratch. It achieved 96.08% AP50, 222 FPS, 0.69 model size, and 4.51 testing time on SSDD with RTX2080Ti GPU. Zhang et al. [40] proposed Shipdenet-18 with only 1 Mb model size for lightweight SAR ship detection. It has fewer layers and fewer kernels. The deep and shallow FF-Module and the feature pyramid module are used for fusing different features. It is also trained from scratch. It has 228 246 parameters, 456 042 FLOPs, and 1 MB model size. It achieved 93.78% AP50 on SSDD. Zhang et al. [17] proposed a high-speed SAR ship detector based on depthwise separable convolution. The conventional convolution is substituted by depthwise convolution and pointwise convolution. The multiscale, concatenation, and anchor box mechanism are used for the real-time detection. It achieved 94.13% AP50 with 9.03 ms testing time and 111 FPS on SSDD and NVIDIA RTX2080Ti GPU. Feng et al. [73] proposed LPEDet for real-time SAR ship detection. The backbone discarded the squeeze-and-excitation module and designed a lightweight convolution block. It shows advantages on accuracy and speed than other methods on SSDD and HRSID.
Most of the above papers are based on the depthwise convolution, which is not optimized on the hardware, e.g., GPU, FPGA, and DSP. So although the detection algorithm is lightweight, it does not mean that it can be processed in real time on hardware. We can also find that training from scratch and lightweight detection network designing are strongly correlated. Training from scratch needs to design the backbone network, and the designed lightweight network generally needs be trained from scratch to avoid the disadvantages brought by pretraining on ImageNet.

I. Pruning, Quantization, and Knowledge Distillation on SAR Ship Detection
The deployment of deep CNNs in real-time SAR ship detection is largely hindered by huge storage and computational cost. Model compression and acceleration are necessary approaches to realize real-time target detection. The real-time detection of ship targets in SAR images also requires pruning quantization and knowledge distillation on large models to achieve lightweight and high accuracy detection models. There are six papers that are about this direction, which are shown in Table XII Chen et al. [18] slimed SAR ship detector by pruning and knowledge distillation. The pruning makes the backbone channel-level sparsity. The network weights and scaling factors are jointly trained with L1 regularization in channelwise scheme. The FIR-KD is proposed to make up for the accuracy decline of pruning. It redefines the extracted knowledge as the relationship between different levels of feature maps, and then transfers it from a large network to a smaller network. It achieved 94.6% AP50 1.94G FLOPs 258.6 FPS 3.9 ms testing time 0.6M parameters and 2.8 MB model size on SSDD. Mao et al. [30] also slimed SAR ship detector by pruning and knowledge distillation. The detector is pruned on filter level to get lightweight models. The Kullback-Leibler divergence based knowledge distillation is proposed to train small student network and large teacher network (YOLOv3@EfficientNet-B7) to make up for the accuracy decline. It achieved 92.6% AP50, 56.5% AP, 61.5M parameters, and 17.27 ms testing time on SSDD with GTX 1080Ti GPU. Ma et al. [57] compressed YOLOv4 for designing an edge-device-oriented lightweight detector. The sparsity training on channels and layers is used by L1 regularization. The channel pruning and layer pruning are used to prune the less important parts, which reduce the width and depth. Then knowledge distillation is used to improve the accuracy, and the model is quantized to FP16 to further accelerate the model. At last, it is deployed on NVIDIA Jetson TX2. It achieved 93% AP50, 15.12 FPS, 5.183G FLOPs, 3.5MB model size, and 0.857M parameters on SSDD with 416 × 416 input size on SSDD. Chen et al. [61] proposed a lightweight detector by feature-map-based knowledge distillation. When training the lightweight student network, the similarity between pixels is treated as transferred knowledge in heatmap distillation. It achieved 80.71% AP50 with only 9.07M model parameters on HRSID. Yang et al. [79] proposed an efficient and lightweight detector with soft quantification for real-time SAR ship detection. The split bidirectional FPN is used to compensate for the lack of accuracy. The soft quantization simulates the quantization process of training and learns variable parameters to adjust the pixel value results of each channel, so as to adjust the distribution of the feature map to make it as similar as possible to the original feature map. It achieved 97.0% AP50 on SAR ship detection dataset with less than 15× parameters and less than 6× the FLOPs. Xu et al. [64] designed a lightweight cross-stage partial module to reduce the amount of calculation and pruned it for a more compact detector. The detector is sparse regularization trained first. Then it is pruned to get sparse channels. Finally, the model is fine-tuned to restore the accuracy by iterating the pruning procedures.

J. Hardware Deployment of SAR Ship Detector
Hardware deployment is the last step of SAR ship detection. GPU is the common hardware for desktop real-time target detec-tion. NVIDIA TX2 and FPGA are the commonly used hardware for end-to-end real-time object detection. There are six papers that are about hardware deployment of SAR ship detection as shown in Table XIII. We will survey them in the following parts.
Xu et al. [45] proposed an on-board ship detection method based on the CFAR and lightweight deep learning. It can be used by the SAR satellite on-board computing platform. The Jeston TX1 is used as the hardware to realize on-board ship detection. The intelligent application module is used on HISEA-1 satellite. The on-board ship detection method extracts the ship chips and position information, and transmitted to the ground. The model parameters in the satellite's intelligent processing unit can be updated. It shows good results on HISEA-1 SAR images. Jerzy et al. [52] described the modern FPGA SoCs, SAR systems, and on-board detection system. It proposed an SAR ship detection system on SoC-based radar payloads. The paper is concluded with a few observations on how implementing such a system could affect existing radar platforms. Xu et al. [64] proposed Lite-YOLOv5 based on YOLOv5 and other lightweight ideas. It conducted on-board ship detection with 4.44G FLOPs and 73.15% AP50 on S-SSDD-v1.0. It is transplanted to NVIDIA Jetson TX2. Xu et al. [69] proposed lightweight SAR ship detector named L-YOLO. It is also transplanted to NVIDIA Jetson TX2 to validate the practicability. It achieved 73% AP50 with 8.1G FLOPs on LS-SSDD-v1.0. Ma et al. [57] proposed Light-YOLOv4 for edge-device-oriented SAR ship detection. The sparsity training, channel and layer pruning, knowledge distillation, and quantization are used to compress the detector further. It is also deployed on NVIDIA Jetson TX2.The experiments on SSDD show that the detection speed is increased to 4.2×. Yang et al. [70] proposed an algorithm and hardware codesigning method for real-time SAR ship detection. The proposed OSCAR-RT is the first end-to-end algorithm and hardware codesigning method for real-time on-satellite CNN-based SAR ship detection. It proposed a fully pipelined inter layer flow acceleration architecture, in which all layers of CNN model can use FPGA resources on chip for concurrent processing. It proposes a hardware-guided, progressive, and structured pruning method, which is guided by hardware metrics. The coarse-grained and fine-grained filter pruning and mixed precision quantization are also used. A highly optimized CNN component library is designed. The trimmed CNN model is mapped to these hardware library components in a fully pipelined inter layer flow manner. The proposed method achieves an AP50 of 94% on SSDD, with speed of 652 FPS on Xilinx VC709 FPGA while consuming only 5.8 W power.

VI. DISCUSSION
Here, we reviewed the papers about real-time SAR ship detection. The model compression and acceleration methods, the realtime object detection methods are introduced first. The years, datasets, journals, deep-learning frameworks, public datasets, and the evaluation metrics of the 70 papers are introduced second. These 70 papers are reviewed from the following aspects: anchor free, trained from scratch, YOLO series, CFAR+CNN, lightweight backbone, model compressing, and hardware deployment. The speed and accuracy show the rapid development of these algorithms in recent years in this field. Based on the above review, we find that the real-time SAR ship detector should have the following attributes: single-stage, trained from scratch, anchor free, lightweight backbone, and head networks, using model compression and acceleration, optimized and transplanted on the edge hardware. Researchers should follow most of the above attributes when realizing real-time SAR ship detection. What is more, due to the fact that ships in SAR images are extremely sparse, and most areas are without targets, the CFAR+CNN show great potential for handle this situation. What is more, the lightweight networks with much depthwise and pointwise convolution will not have a fast speed. As these operations are not optimized on the hardware, e.g., GPU, FPGA, and DSP. So they should be used less. And researchers should both consider the direct (speed) and indirect (FLOPs) indicators. Compared with computer vision, real-time SAR ship detection is less popular, with fewer researchers and less achievements. Most of the achievements are sporadic references from computer vision, with the problem of incomplete innovation. In the future, we should pay more attention and produce more results on this area.

VII. CONCLUSION
This article gives a comprehensive overview of real-time SAR ship detection. First, we introduce the model compression and acceleration methods. They are the sources of innovation in real-time SAR ship detection. The principle and research status of pruning, quantization, knowledge distillation, low-rank factorization, lightweight networks, and model deployment are introduced in detail. Second, we introduce the real-time object detection methods. The also provide inspirations for real-time SAR ship detection. The two-stage real-time detectors, singlestage real-time detectors, anchor-free detectors, trained from scratch detectors, compression, and acceleration in object detection are introduced, respectively. Third, 70 public papers about deep-learning-based real-time SAR ship detection are reviewed comprehensively. Ten public datasets and the usually used evaluation metrics are introduced in the beginning. Then 70 papers are categorized into 7 types and are reviewed in detail. They are anchor free based, trained from scratch based, YOLO series based, CFAR+CNN based, lightweight backbone networks based, model compression based, and hardware deployment. The principle, innovation, and performance are reviewed, respectively. Finally, the problems existing in this field and the future direction are described. In the future, we should pay more attention to the lightweight CNN designing, model compression and acceleration, and hardware deployment in this field.
As far as we know, this is the first review on real-time SAR ship detection. It can provide a reference for researchers in this area or who are interested in it. It can make researchers quickly understand the research status.