Light-SDNet: A Lightweight CNN Architecture for Ship Detection

Ship detection plays a vital role in monitoring and managing maritime safety. Most recently proposed learning-based object detection methods have achieved marked progress in detection accuracy, but the size of these models is too large to be applied to mobile devices with limited resources. Although some compact models have been presented in the previous study, they achieve unsatisfactory results in ship detection, especially under extreme weather conditions. To address these challenges, this article presents a lightweight convolutional neural network (CNN) called Light-SDNet to perform an end-to-end ship detection under different weather conditions. In the proposed model, we introduce the improved CA-Ghost, C3Ghost, and DepthWise Convolution (DWConv) into the You Only Look Once version 5 (YOLOv5) to reduce the number of model parameters, while remaining its powerful feature expression ability. We use parallel attention to highlight the features that contribute to the ship detection in the marine surveillance. To enhance the adaptability of the proposed model, a hybrid training strategy with generating synthetically-degraded images is proposed to augment the volume and diversity of the original datasets. The proposed strategy enables Light-SDNet to improve the ship detection results under severe weather conditions such as haze, rain, and low illumination. We compare Light-SDNet with other competitive approaches on a large-scaled ship dataset called SeaShips. We show that Light-SDNet achieves a better balance between the detection accuracy and the model complexity. The ship detection results on degraded marine images have proven the superior performance of the proposed model in terms of detection accuracy, robustness and efficiency.


I. INTRODUCTION
It is increasingly important to enhance maritime traffic safety with the development of offshore economic activities and the exploration of marine resources. In particular, ship collision accidents occur frequently under extreme weather conditions. The Automatic Identification System (AIS) [1] has achieved remarkable results in maritime surveillance. However, the Class A AIS that can send self-ship information is only mandatory on ships that can load more than 300 tons, so it may lead to omissions in the detection of small and mediumsized ships. Meanwhile, some illegal ships deliberately turned off related equipment in an attempt to evade detection and surveillance. Thus, the video surveillance system is essential The associate editor coordinating the review of this manuscript and approving it for publication was Kegen Yu . to further improve maritime supervision and security. The maritime supervisors can obtain intuitive visual information by observing surveillance video images, while their visual fatigue caused by the long-term observation may result in the neglect of important information. With the rapid development of deep-learning technology, many advanced ship detection methods have been proposed, which provides a strong support for building an intelligent maritime video surveillance system.
Unlike traditional methods that require the hand-crafted features and suffer from unpleasant detection results, the progressive learning-based methods achieve an end-to-end object detection and better performance by extracting useful features automatically [2], [3]. Currently, learning-based methods can be typically divided into two categories. One is a single-stage framework such as SSD [4] and YOLO series [5], VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ [6], [7], [8], [9], and the other is a two-stage framework such as R-CNN [10], Fast R-CNN [11], and Faster R-CNN [12]. The former enables faster detection with lower computational burden, while the latter tends to achieve more accurate results in exchange of slower detection speed [13], [14]. The aforementioned methods are not lightweight enough so that they are unsuitable to be applied in the maritime video surveillance systems with limited memory and computation power.
To address this problem, many efforts have been paid to develop compact and efficient CNN models. EfficientNet [15] adopts a compound scaling method, which markedly reduces the model parameters and improves the classification speed via scaling up any dimension of the network (width, depth or resolution). MobileNetV3 [16] is a very lightweight and low-latency model obtained by neural architecture search, whose modules used internally are inherited from the depthwise separable convolution [17] and the inverted residual structure with a linear bottleneck [18]. In addition, it uses a SENet [19] attention module after the pointwise convolution to enhance important features. Considering the detection accuracy and processing speed, Shuf-fleNetV1/V2 [20], [21] manages the exchange of information between the groups via the channel shuffle operations. Ghost-Net [22] uses the Ghost modules to extract more features from cheap operations. The aforementioned infrastructures are capable of the extraction of effective features, but in the expense of decreased detection accuracy.
In reality, the visual quality of images captured from maritime surveillance systems is generally affected by poor imaging conditions, such as rain, haze, and low illumination [23]. Image deterioration adversely affects vessel traffic safety and security, thus the accurate ship detection in the degraded maritime images becomes intractable. To improve the accuracy of ship detection under bad weather conditions (e.g., rain, fog, low illumination), the degraded maritime images are typically recovered before ship detection using image restoration algorithms [24], [25], [26]. However, using these restored images tends to a decline of the accuracy and robustness of ship detection due to the loss of detailed features.
To make ship detection more robust and accurate under different weather conditions, a hybrid data training strategy is introduced to enlarge the diversity and volume of the original dataset. In addition, we propose a compact and efficient network based upon improved YOLOv5 for the ship detection on the mobile or embedded devices. By combining the proposed model and the hybrid data training strategy, there is a great potential for the proposed method to obtain a reliable ship detection with higher accuracy, efficiency, and robustness. The contributions of this study can be summarized as follows: (1) We propose a lightweight network for ship real-time detection named Light-SDNet. To assign greater weights to more valuable information, both the coordinate and parallel attention mechanisms are introduced into the proposed lightweight network. Specifically, the attention-guided CA-Ghost and the C3Ghost module extract features and fuse features at the Backbone and Neck, respectively.
(2) Extensive experiments on the large ship dataset called SeaShips show that the proposed Light-SDNet can achieve higher detection accuracy with comparative model parameters and computation burden, which is suitable for mobile terminals or embedded systems with the limited computation power and memory capacity.
(3) A hybrid data training strategy is proposed to solve ship detection in adverse weather conditions. Experimental results on degraded ocean images demonstrate the superior performance of our proposed model in terms of detection accuracy, robustness, and efficiency.
The remainder of the paper is organized as follows. Section II briefly reviews existing ship detection methods. Section III presents the proposed YOLOv5-enhanced ship detection framework. Section IV describes the proposed hybrid training strategy and exhibits extensive experimental results on the SeaShips dataset. We finally summarize the main contributions of this study in Section V.
YOLOv5 algorithm [9] adopts various enhancement techniques at the input, such as mosaic, adaptive image scaling, and adaptive anchors. The main purpose of the first convolution in the Backbone is to reduce model parameters, floating point operations (FLOPs), and memory overhead, so that the forward and backward speed is increased with marginal effects on detection accuracy. By applying gradient change to the feature map, the C3 module is capable of tackling the issue of repeating gradients in the Backbone of a largescale neural network. Additionally, the network incorporates Spatial Pyramid Pooling-Fast (SPPF) and Path Aggregation Network (PANet) to enhance its feature fusion capabilities.

B. ATTENTION MECHANISM
The attention mechanism is a resource allocator that adaptively assigns weights to features via channel or spatial modeling. SENet [19] captures cross-channel information 86648 VOLUME 10, 2022 with global average pooling, while extracting all channel features may be inefficient and unnecessary. To improve the cost-effective performance of network models, ECA-Net [37] adopts a local cross-channel interaction, i.e., a onedimensional convolution is used to screen the strong interchannel dependencies. In this way, ECA-Net markedly improves the network performance yet effectively reduces the number of model parameters. However, they merely focus on the relationship between channels, ignoring the significance of spatial features. In contrast, BAM [38] and CBAM [39] can extract the inter-channel relationship of features and the intra-spatial relationship of features, so they obtain richer high-level features for the vision tasks. In addition, coordinate attention (CA) [40] helps to localize objects of interest more accurately via embedding location information into channel attention. The CA can also be flexibly inserted into the mobile network without any computational overhead.

C. SHIP DETECTION IN MARITIME SURVEILLANCE SYSTEM
Ship detection plays an important role in maritime traffic safety, so extensive efforts have been made in the field of automatic detection of moving ships. For example, Zhang et al. proposed a ship detection method based upon discrete cosine transform (DCT) [41]. The detection method primarily includes three stages, i.e., background modelling, background subtraction and horizon detection, which can achieve robust detection results under complex sea conditions with surface waves. According to the visual attention model, Shi et al. obtain the saliency maps for the ship detection via fusing directional features, color features, and motion features [42]. Chen et al. proposed a real-time ship detection and tracking system based on mean shift, which is able to achieve good automatic tracking performance [43]. However, conventional ship detectors typically endure unsatisfactory detection accuracy under severe marine imaging conditions (e.g., haze, rain, and low-luminance), due to it being highly dependent on hand-crafted features.
CNNs provide a new avenue for accurate and efficient detection of the moving ships owing to its powerful feature extraction capability. A number of CNN-based methods have been recently developed for the ship detection in different maritime environments. For example, Cui et al. improved CenterNet with a spatial shuffling attention module to achieve a large-scale ship detection in the synthetic aperture radar (SAR) images [44]. To address the issue of difficult deployment of the existing models on the edge devices with limited memory resources, Ma et al. proposed a lightweight object detector via compressing YOLOv4 [45]. To effectively identify ships with various scales in high-resolution optical remote sensing images, Li et al. generated candidate ships from the feature maps using a region-proposal network [46]. So far, most detection tasks are implemented in the SAR and optical remote sensing images, while SAR and optical remote sensing images typically endure low signal to noise ratio (SNR), which results in the difficulty in detecting small objects.
To address this problem, great attempts have been undertaken to develop efficient CNN-based models for ship detection in the natural images. For example, Shao et al proposed a saliency-aware CNN framework to achieve an accurate and real-time ship detection in the surveillance video images [47]. The framework includes coastline priors, deep features, and saliency maps. In addition, coarse-to-fine cascaded CNNs for ship detection and tracking have received extensive attention, leading to autonomous maritime surveillance [48]. To develop a robust ship detector under severe weather conditions, an enhanced YOLOv3 is proposed with data augmentation training, whose results demonstrate its effectiveness for ship detection [49]. The existing CNN-based ship detection methods have achieved marked progress, while they may be typically unsuitable for use on the embedded devices and mobile terminals with limited computation power and storage capacity because of their highly computational complexity and large model size.
To achieve a better balance between the model complexity and detection accuracy, we aim to develop a lightweight CNN architecture for ship detection in the maritime video surveillance via improving YOLOv5 and introducing a hybrid training strategy. We also provide an ablation study to show the functions of critical components of light-SDNet, and describe extensive results to verify its good performance in the ship detection under different maritime environments, especially under extreme weather conditions.

III. THE PROPOSED SHIP DETECTION FRAMEWORK
To solve the problems of low detection accuracy and difficult deployment of redundant networks in maritime surveillance, we propose a lightweight ship detection network (Light-SDNet) based on YOLOv5s. In this section, we describe the proposed method's exploration trajectory and overall framework.

A. EXPLORATION TRAJECTORY
We fine-tuned the YOLOv5s network in this part. On the premise of ensuring detection accuracy, the network parameters are compressed to reduce the computation burden. Figure 1 shows the exploration trajectory from YOLO5s to Light-SDNet.
To achieve a lightweight and powerful network, the following changes have been made to the original YOLOv5s: (1) For extract better location features of the shallow network, the Ghost module [50] with CA replaces the common convolution module of the Backbone to perform 2× downsampling.
(2) DWConv replaces the convolution module used in the Neck network, reducing computational bottleneck and memory overhead.
(3) The C3Ghost replaces the C3 module as the main feature fusion module of the Neck network, guaranteeing the lightweight nature and detection accuracy of the target network.
(4) Parallel attention (PA) is additionally introduced to further enhance the ability of Neck feature fusion.
In addition, the depth multiplier of the target network is increased from 0.33 to 0.50 to enhance its learning ability. The final architecture of the proposed model is shown in Table 1 and Figure 2.

B. BACKBONE
In the Backbone network, the ship image first goes through a convolution with a size of 6 × 6, a stride of 2, and a padding of 2 to perform downsampling. Then it goes through four stages, all of which contain the Ghost module with CA and C3 module. The process is summarized as follows: the CA-Ghost module performs 2× downsampling of the input from the previous stage, and the C3 module performs feature extraction to obtain a total of four feature maps with different scales.
In the feeding process of feature maps in the Backbone, scale compression and channel expansion lead to the gradual transmission of spatial information to the channel. To compensate for the loss of shallow features, the CA [40] is introduced into the Ghost module to construct the CA-Ghost module. We focus on the width and height of feature maps to improve model performance at a low cost. Figure 3(a) provides the four types of CA-based Ghosts we designed based on the location of the embedded CA, namely Ghost-a, Ghost-b, Ghost-c, and Ghost-d. Experimental results show that we obtain the largest mAP when integrating CA after DWConv in Ghost. Thus, we use Ghostc as CA-Ghost. Figure 3(b) depicts the structure of the CA block. The input of the CA-Ghost module goes through two branches: the left branch passes through DWConv to reduce the size of feature maps, and then the convolution module is used to double the channel; the right branch first performs the convolution operation by the ghost module, and DWConv is used under the guidance of CA for downsampling, and then the ghost module doubles the channel again. Finally, the two branches are directly added together as the output. As shown in Table 1, the parameters of CA-Ghost module are reduced by more than two times over those of the ordinary convolution, which reduces computation cost and reinforces positional features.

C. NECK
When the feature map goes the Neck, the channel dimension reaches the maximum, while the resolution of network reaches the minimum. The SPPF module then focuses on spatial information to solve the problem of excessive changes of object scales. As shown in Table 1, we also replaced the common convolution of the Neck with DWConv. Unlike traditional convolution, DWConv is a convolution kernel responsible for one channel, convolving channel by channel, which can markedly reduce the model size. However, this is at the expense of some network fusion capabilities.
The specific structures of three types of C3 modules are shown in Figure 4. As shown in Figure 4 To focus more on the features of the ships in the image, we add a parallel attention mechanism after the C3Ghost module. As shown in Figure 5, the parallel attention module is implemented by combining BAM [38] with ECA-Net [37]. That is, the spatial attention comes from BAM, and the channel attention comes from ECA-Net. By combining the channel attention Mc(F) and the spatial attention Ms(F) from the two attention branches, we can generate the 3D attention map M(F) by taking the sigmoid function. To obtain a refined feature map, this 3D attention map is element-wise multiplied by the input feature map F and then added to the original input feature map. Furthermore, the ablation study in Section IV shows that the parallel structure of ECANet and BAM is more 86650 VOLUME 10, 2022  efficient than the sequential structure, so we adopt a parallel design in the proposed attention module.

D. HEAD
In the detection Head, three sets of output feature maps are detected to generate a final output vector with class probability scores, bounding boxes, and confidence scores. According to the Non-Maximum Suppression (NMS), the output of the three detection layers are screened to obtain the final detection results.
The loss function of the proposed method consists of three parts: classification loss (cls_loss), localization loss  (loc_loss), and confidence loss (obj_loss), the formula is as follows, where λ 1 , λ 2 and λ 3 are coefficients to weight the loss contribution with values of 0.5, 0.05, and 1.0, respectively.
The confidence loss and the classification loss are calculated by combining the BCE (Binary Cross Entropy) loss with the logistic loss, and the CIoU loss is used to evaluate the localization loss of the predicted box and the ground-truth box.

IV. EXPERIMENTAL RESULT AND ANALYSIS
To assess the performance of Light-SDNet, we compare it with other methods qualitatively and quantitatively. To get good detection results in both normal weather and severe weather conditions, we propose an end-to-end mixed-data training strategy. The proposed training strategy has also been implemented to demonstrate good performance under poor imaging conditions.

A. IMPLEMENTATION DETAILS 1) DATASET DESCRIPTION AND SETTINGS
We use the public ship dataset named SeaShips [51] as the original dataset. The dataset includes 7000 images that cover 86652 VOLUME 10, 2022 6 types of ships, such as ore carriers, bulk carriers, general cargo ships, container ships, fishing boats, and passenger ships. The maritime images originate from video camera surveillance systems that track all ships near shore. Bad weather conditions such as fog, rain, and low light, tend to markedly deteriorate the quality of the images captured by a marine surveillance system, so we constructed three degraded datasets based on the classic SeaShips dataset. The extended degradation datasets are shown in Section C. For the SeaShips dataset and its degraded dataset, they are randomly divided into training, validation, and test sets with a 3:1:1 ratio for the experiments. The SeaShips_fog dataset is used to explore the model structure, while the SeaShips dataset and its degraded one are used to measure the impact of severe weather conditions on ship detection performance. The effectiveness of the proposed hybrid training strategy is verified below.

2) EXPERIMENTAL ENVIRONMENT AND PARAMETER SETTINGS
Our ship detection experiments use Pytorch (1.8.0) software library installed in Ubuntu 18.04. Specifically, all experiments are performed on a computer with an Intel(R) Xeon (R) Silver 4210R CPU @2.40 GHz and NVIDIA GeForce RTX 3090 GPU. For the optimal hyperparameters used in our network, the base learning rate, momentum and weight decay are, respectively, set to 0.01, 0.937, and 0.0005. In all experiments, the size of input images is 640 × 640 pixels, epoch is set to 300, and the batch size is set to 16. All the remaining parameters take the default values in the original YOLOv5.

B. METRICS
We follow the same criteria as PASCAL VOC [52] to evaluate the performance of Light-SDNet.
Precision is used to evaluate whether the prediction of ships is accurate, which reflects the proportion of actually positive samples over all predicted positive samples. Recall is used to evaluate whether all ships in the test dataset have been predicted correctly, which reflects the proportion of positive samples predicted correctly by the model over the total positive samples. F1 score is the harmonic average of precision and recall. mAP@0.5 and mAP@.5:.95 are comprehensive indicators to measure the precision and robustness of ship detection. Correct detecting ratio (CDR) is the proportion of correctly predicted samples over all samples. False alarm ratio (FAR) is the proportion of negative cases that are incorrectly classified as positive over all predicted positive samples.
Besides the above evaluation indicators, we provide the model parameters, FLOPs, and training time to verify the advanced nature of Light-SDNet. The less model parameters and FLOPs are, the lower the cost of the detection model is.

C. DATA AUGMENTATION AND THE HYBRID DATA TRAINING STRATEGY
The complex maritime environments such as fog, rain, and low light, typically enable the captured images to be blurred and blocked, bringing the huge difficulties to automatic ship surveillance. To explore the impacts of weather conditions on ship detection, we synthetically simulated the degraded images and constructed three degraded datasets that simulate fog, rain, and low light environments based on the classic Sea-Ships dataset. Great efforts would be devoted to the practical application of ship detection under severe weather conditions via this study. We also propose an end-to-end hybrid data training algorithm aimed at achieving ideal detection performance in normal and multiple severe weather conditions.

1) GENERATING SYNTHESIZED DEGRADED IMAGES
In the extended SeaShips_fog dataset, sea hazy images are generated based upon the atmospheric scattering model expressed as (2).
where I (x, y) is a hazy image, J (x, y) is a haze-free image, A is the global atmosphere light and t(x, y) is the medium transmission map, decaying exponentially with the increased distance, which is formulated as (3).
where β is the medium attenuation coefficient and d(x, y) is the scene depth. Several synthetically-degraded samples are shown in Figure 6. The rainy images can be synthesized via superimposing the simulated raindrop trajectories on a clear image. Thus, a synthetically-degraded image Z (x, y) with rain streaks can be formulated as follows J (x, y) is a latent sharp image and B(x, y) is the raindrop noise layer. As shown in Figure 7, different rainy images can be synthesized by adjusting the lengths and angles of rain streaks. In the experiments, the number of raindrops is set to 800, the length of the rain streaks is ranged between 20 to 80 pixels, and the angle of the rain streaks is randomly chosen between −50 and 50. A low-light maritime image is synthesized based on the Retinex theory, assuming an original image S is a product of the reflection image R and the illumination image L, i.e., where R may be seen as the latent sharp image, L represents the various intensities of light on the objects that are spatially VOLUME 10, 2022  smooth. To synthesize the low-light maritime images, we first convert the original RGB images into HSV images. The ocean images are visually degraded via multiplying the V layer of the original images by different attenuation coefficients ω ∈ (0, 1). As shown in Figure 8, the low-light images are generated with ω = 0.1, 0.2, 0.3, 0.4 and 0.5, respectively. Table 2 shows the results of Light-SDNet trained with three degraded datasets synthesized artificially to evaluate the impact of different imaging conditions on ship detection (i.e., mAP@.5/mAP@.5:.95), including normal, hazy, lowlight and rainy conditions. The results shown in Table 2 reveal that the precision of ship detection will increase markedly if the imaging conditions of model training and testing datasets maintain the same. Inspired by this [53], we propose a hybrid data training strategy. Each image has a probability of 3/4 to be randomly added with varying degrees of fog or rain or be converted to a low-light image before being input to the network for model training. To detect ships more effectively in dense fog and very low light conditions, we generate a wider range of fog concentrations and lower illumination 86654 VOLUME 10, 2022  levels to simulate ocean scenes. A rainy scene is additionally simulated to complete the task well in the rainy days.

2) THE HYBRID DATA TRAINING STRATEGY
In Algorithm 1, we describe the hybrid data training process in detail.

D. QUANTITATIVE EVALUATION
To verify the effectiveness of the proposed Light-SDNet, the quantitative evaluation was performed by comparison with state-of-the-art (SOTA) algorithms, including the YOLO series and the SOTA lightweight Backbone series. YOLO series include YOLOv3-lite, YOLOv4-lite, YOLOv5n and YOLOv5s, while popular lightweight Backbones include GhostNet [22], EiffcientNet-lite [15], MobileNetv3s [16], and ShuffleNet-v2 [21], specifically to replace the Backbone of YOLOv5s. The performance comparison was carried out on the synthetic Seaships fog dataset. Figure 9 shows the curves of mAP, Precision, and Recall for all detectors during model trainings. As shown in Figure 9(a) and (b), Light-SDNet is better than other models due to its slightly higher mAP values, while Figure 9(c) shows that the performance of YOLOv3-tiny and YOLOv4-tiny is much lower than that of other models since YOLOv5 improves their feature extraction network and data augmentation techniques. As can be seen from Figure 9, all curves rise gently and converge rapidly, thereby indicating that the model is well trained without overfitting.  Table 3 shows that Light-SDNet achieves the best performance among the YOLO series since it improves mAP performance by 1.8% compared with the original YOLOv5s and VOLUME 10, 2022  10.6% compared with YOLOv3-tiny. Moreover, the size of Light-SDNet is only 4.93 MB, accounting for 68.1%, 83.7%, and 56.8% of YOLOv5s, YOLOv4-tiny and YOLOv3-tiny, respectively. As shown in Table 4, Light-SDNet achieves detection accuracy higher than other lightweight Backbone networks though its model size is not the least among the comparative models. The comparison also reveals that the detection accuracy of Light-SDNet is the highest for ship detection in adverse weather conditions. For the detection speed of the model, the inference time of Light-SDNet is 2.0 ms per image (500 fps) (fps, frames per second), indicating that Light-SDNet enables real-time ship detection. The detection precision and computational burden of comparative models are visualized in Figure 10. We can see that Light-SDNet achieves the cost-effective performance better than its competitors due to it using the multi-feature fusion and channel-spatial parallel attention mechanism for ship detection. Table 5 shows the performance comparison between Light-SDNet with the hybrid training strategy and its 86656 VOLUME 10, 2022   competitors. We can observe that compared with its competitors, the proposed method achieves a marked improvement on the precision of ship detections in the maritime surveillance images. The reason is that Light-SDNet adopts the CA-Ghost module and the C3Ghost module guided by the attention mechanism, which achieves more fully shallow feature extraction and effective multiscale feature fusion. The hybrid training strategy is used to enhance the diversity of the training data and improve the robustness of Light-SDNet, which can further improve target detection accuracy.

E. QUALITATIVE EVALUATION
To qualitatively compare Light-SDNet with other models, we conduct experiments on synthetic sea fog dataset. Results are shown in Figures 11 and 12, where the rectangular boxes in Figures 11-12 mark the ships detected by different detectors. Special scenarios such as the simultaneous appearance of multiple ships, large overlapping areas of ships, small ships, and dense fog make it more difficult to detect ships, resulting in unreliable monitoring of maritime traffic. Figures 11-12 indicate that most of comparative models achieves unsatisfactory results on ship detection under complex conditions due to missed detections and false detections occurring frequently. We can see that YOLOv3-tiny cannot accurately identify bulk carriers when multiple ships appear concurrently, and MobileNetv3s misidentifies bulk carriers as ore carriers. Vessel detection in severe weather conditions is a challenge. Bad weather markedly reduces the quality of the image captured by maritime surveillance systems. YOLOv3-tiny and YOLOv4-tiny cannot effectively detect ships in dense fog conditions due to it being sensitive to unstable imaging. Small ship detection is also a challenge. The size of small ships in the original image is relatively small, resulting in too few discriminative features. As a result, the detector cannot identify these small target ships accurately with blurred features after many convolutional layers. Water surface reflections and ocean waves also cause confusion and interfere with imaging due to the particular marine imaging scene, increasing the difficulty of feature extraction for small target ships. However, GhostNet and Light-SDNet show good performance for small target ship detection because the Ghost module embedded into the model compensates each other for channel information and retains more underlying information beneficial to the small target detection. As shown in Figures 11-12, other models except YOLOv4-tiny exhibit accurate detection of multiple ships with a suitable occlusion rate. In contrast, Light-SDNet can achieve robust detection of VOLUME 10, 2022  moving ships under various surveillance conditions, providing strong support for maritime surveillance systems.

F. ABLATION STUDY
To find the effectiveness and efficiency of the proposed strategy, we conducted the different experiments on the components of the Light-SDNet architecture and the proposed hybrid training strategy. Details of the experimental formulations of the Light-SDNet architecture are presented in Table 6. We can observe that the original YOLOv5s yields the lowest mAP, resulting in unsatisfactory detection results. The improvement of detection accuracy brought by DWConv is not obvious, while the computational burden is reduced dramatically. As shown in Table 6, applying Ghost and C3Ghost in YOLOv5s can improve the detection accuracy markedly due to the optimization of feature maps. Moreover, the results indicate that the appropriate increase of network Depth provides 0.6% improvement in mAP performance. The comparison also shows that coordinate attention and parallel attention can improve the detection accuracy of the original YOLOv5s. Further, Light-SDNet improves the mAP performance by 1.8% compared with the original YOLOv5s. Thus, the proposed framework improves detection performance markedly by integrating multiple functional modules with YOLOv5s and appropriately increasing network depth.
To describe the impact of different types of ships on model performance, the PR curve and F1-score curve of Light-SDNet are shown in Figure 13 (a) and (b), respectively. It can   be derived from Figure 13 (a) that Light-SDNet is an optimal detector since it reserves high precision values along with increased Recall rate. Figure 13 (b) shows that Light-SDNet acquire the highest F1 score on the container ship images due to the salient characteristics of container ships, meanwhile, it keeps reasonable F1 score on all classes of ship images as confidence increases, especially on small fishing ships. Thus, the proposed detector possesses good generalization performance.
To validate the effectiveness of the proposed hybrid training strategy, the performance comparison of Light-SDNet with and without the proposed training method were carried out and the results are shown in Table 7. We can observe that the AP index of each type of ships is significantly improved, the recall rate is increased by 2.5%, and the mAP is increased by 4.6%. Thus, the comparison indicates that the hybrid training data strategy can improve the detection performance of the proposed detector markedly.
To examine the robustness of the proposed method, we conducted the experiments in different imaging environments, and results are shown Figure 14. It can be derived that the proposed detector can detect the moving ships accurately even if the visual quality is degraded markedly by bad weather conditions, due to the fact the proposed hybrid training strategies with synthetically-degraded images could improve the diversity of training datasets markedly. Thus, the learning and generalization abilities of Light-SDNet are improved in practice. It can be derived from Figure 14 that the proposed method can achieve reliable, efficient, and accurate ship detection under poor imaging conditions. The reliable detection of Light-SDNet contributes to tracking maritime objects, and detecting abnormal behavior, leading to enhanced management in the intelligent maritime surveillance systems.

V. CONCLUSION
In this study, we proposed a lightweight CNN framework and a hybrid training strategy for ship detection. The proposed network makes full use of the shallow location features via introducing CA-Ghost module to improve the feature extraction capability of the Backbone. And the C3Ghost module guided by the attention mechanism has been introduced in the Neck network to achieve more effective feature fusion. In addition, we presented the hybrid training strategy to enhance the diversity of the training data and improve the robustness of Light-SDNet in the adverse weather conditions. Compared with the recently proposed SOTA models, our method achieves a balance between model complexity and detection accuracy and can detect different types of moving ships in real time with high detection accuracy. Extensive experimental results have demonstrated good detection performance of Light-SDNet under adverse weather conditions, such as hazy, rainy, and low-light conditions. This study can be extended in the following directions to make ship detection more reliable and robust.
(1) The proposed hybrid data training strategy directly synthesizes degraded images by using a simplified image generation model. However, the synthetic ocean images differ from real ones in terms of the color and structure. The next step will focus on the generation of more realistic degraded images.
(2) Accurate detection of small moving ships in a maritime surveillance system is still challenge for Light-SDNet. It is hard for monitoring camera situated at a distance from the ships to capture high-resolution maritime images, thereby leading to unreliable detection in terms of robustness and accuracy. We will promote small-scale ship detection via increasing detection Heads for small target objects [55].
(3) Bad weather typically affects the quality of the images captured by the marine surveillance systems, which causes the difficulty for accurate multi-ship detection. Thus, there is potential for imagery data combined with oceanographic radar technology to detect and classify multiple targets [56].
Although the proposed method has huge space to further improve its performance, it is still worth exploring as it can realize the accurate detection of moving ships rapidly under severe weather conditions while remaining its lightweight nature, thereby achieving a better balance between the detection accuracy and model size. Light-SDNet has the potential to be putted in practical applications to enhance maritime safety and management.