YOLO-OSD: Optimized Ship Detection and Localization in Multiresolution SAR Satellite Images Using a Hybrid Data-Model Centric Approach

With the advancements in space technology and the development of lightweight synthetic aperture radar (SAR) satellites by commercial companies, such as ICEYE, Capella Space and Umbra, SAR images have become available on a wide scale. Ship detection is a classic problem in the interpretation and analysis of satellite images and has its significance both in maritime as well as defense applications. In the case of SAR images, ship detection becomes even more challenging due to the presence of large-scale distortions as well as interclass similarity signature problem. Moreover, the state-of-the-art (SOTA) object detection models have weak generalization capability over SAR datasets. To overcome these challenges, we propose a You Only Look Once (YOLO)-based optimized ship detection model called YOLO-OSD. Our optimized ship detector is based on a hybrid data-model centric approach, which utilizes the statistical characteristics of the datasets under observation and has an efficient model architecture. We also carry out a detailed comparative analysis of our proposed model with other SOTA deep learning models on three well-known publicly available datasets. Our results show that the proposed YOLO-OSD outperforms YOLO5, YOLO7, and RetinaNet on all datasets under observation in terms of F1 score and mean average precision. YOLO-OSD also has approximately 16% fewer network parameters as compared with the original YOLO5. Moreover, our proposed model is at least 37.7% faster than YOLO7 and 41.02% faster than the YOLO8 model in terms of training time and thus suitable for real-time satellite-based SAR ship detection.


I. INTRODUCTION
S PACE technology has shown remarkable progress in the past few decades from the manifestation of reusable launch vehicles to the development of compact and small-scale satellites.Satellite-based remote sensing has also received growing attention due to the development of lightweight and less power The authors are with the Department of Electrical Engineering and Computer Science, Institute of Space Technology, Islamabad 44000, Pakistan (e-mail: farhan.humayun2014@gmail.com).
Digital Object Identifier 10.1109/JSTARS.2024.3365807hungry satellites capable of taking high-resolution images of Earth's surface.With the launch of new firms, such as ICEYE, Capella Space, and Umbra, the availability of high resolution synthetic aperture radar (SAR) satellite images has increased manifolds both in public and private sector domains.[1].The problem of object detection in satellite images has important applications in civil as well as military domains.Small objects including ships, cars, air crafts, etc., can be detected using high resolution satellite imagery, whereas, satellite images with relatively medium resolution can be used to detect relatively bigger objects, such as airports, fields, and buildings [2], [3].Among the remote sensing satellites, the SAR sensors also known as active sensors work on the principle of radars.They transmit electromagnetic waves and pick up the reflected waves from the target object in order to form the shape of the object [4].These operate in the microwave region of the electromagnetic spectrum and therefore it is not possible to generate true color red, green, and blue images in the case of SAR sensors.However, SAR imaging is advantageous in the sense that it can penetrate clouds and even capture images during nighttime.Due to this capability, they are also called all weather, all time sensors [5].Fig. 1 shows a visual comparison between electro-optical (courtesy Maxar Technologies) and SAR (courtesy Capella Space) satellite images of the same area.
Satellite-based ship detection has significant importance in the maritime domain awareness and can assist in various applications including monitoring of illegal fishing activities, pirate threats, vessel traffic management along the coastlines and identification of noncooperative ships with no automatic identification system data [6], [7], [8].SAR sensors are considered particularly useful for maritime applications due to their all-weather and day-night imaging capability.Apart from that they have wider swath coverage as compared with their optical counterparts.This enables them to monitor large areas of oceans and coastlines within shorter amounts of time [9], [10].Despite these benefits, there still remain various challenges associated with SAR images, which makes satellite-based SAR ship detection a relatively arduous task.These are summed up below.
1) Involvement of deformed object boundaries due to various distortions and uneven scattering phenomena.2) For applications related to the maritime domain, SAR images may involve complex backgrounds including sea clutter, islands, harbors, and ports, which negatively affect detection accuracy.3) No specific color information is included in SAR images, so details in different bands cannot be leveraged as is the case in optical images.4) The issues of interclass similarity signature i.e., two very different objects may have similar reflectance patterns and thus appear to be similar in a SAR image.5) Most of the state-of-the-art (SOTA) object detection algorithms are tailored for optical images and thus possess weak generalization ability over SAR data.6) Lack of publicly available, labeled SAR datasets as compared with the optical datasets.In this article, we propose an you only look once optimized ship detection model named (YOLO-OSD) for multiresolution SAR satellite images using a hybrid data-model centric approach.Our main contributions are as follows.
1) We carry out detailed statistical analysis of three popular open source SAR ship datasets including SSDD, SAR-Ships, and iVision-MRSSD to perform anchor box customization for ease of model training and achieving better intersection over union (IoU) and mean average precision (mAP) scores.2) We optimize the network architecture of YOLO5 model to significantly reduce the model network parameters and training time while simultaneously improving the ship detection accuracy in satellite-based SAR images.3) We conduct extensive experiments, including detailed comparative and cross-dataset validation analysis, to assess the performance of our proposed YOLO-OSD approach.Our evaluations include quantitative and qualitative comparisons with SOTA object detectors, including YOLO5, YOLO7, YOLO8, and RetinanNet.The rest of this article is organized as follows.Section II includes a summary of past and recent works related to the domain.Section III discusses the methodology of the proposed YOLO-OSD approach including statistical analysis of the datasets involved.Section IV describes the experimental setups and details.Section V shows the quantitative and qualitative results of the experiments.Section VI provides a discussion on the results and analysis.Finally, Section VII concludes this article, with current limitations and future research directions.

II. RELATED WORKS
Satellite-based SAR ship detection has been the focus of researchers since the past two decades.Traditional approaches for ship detection in SAR images are based on the constant false alarm rate (C-FAR) algorithm [11], [12], [13], [14].It is an algorithm based on the statistical distribution of SAR image pixels and uses an adaptive thresholding strategy based on the false alarm rate.The thresholding mechanism employs the fact that in a SAR image, ships are normally characterized by the brightest pixels and other pixels can be treated either as background, sea clutter or other land-based features.However, the method heavily relies on predefined distributions to make detections and becomes irrelevant with changing backgrounds and imaging conditions, which is a frequently occurring case in satellite-based SAR images.Hence, techniques based on C-FAR algorithms generally have low detection performance especially in complex, in-shore scenes.Recently, Zhang et al. [15] proposed a novel ship detection method based on adaptive C-FAR for fully polarized SAR images with better detection performance and less false alarm rate as compared with previous methods.
A lot of work has also been carried out on autodetection of objects including ships using deep learning (DL) techniques.In the past decade, models based on convolutional neural networks (CNNs) have become popular in the field of computer vision, especially since the AlexNet [16] was proposed.Since then, majority of the work has been related toward the development of deeper and more complex neural network models in order to attain better accuracy [17], [18].With the advancements in DL, various improved models have been proposed including region-based R-CNN [19], visual geometry group (VGG) [20], you only look once (YOLO) [21], fast R-CNN [22], single shot detector SSD [23], mask R-CNN [24], RetinaNet [25], fully convolutional one stage (FCOS) detector [26], YOLO-R [27], and gate recurrent CNN (GR-CNN) [28].These models can be grouped into two broad categories, i.e., two-stage detectors e.g., R-CNN and one-stage detectors, such as SSD and YOLO.Table I enlists the popular models/algorithms used for object detection/image classification and their corresponding proposed years over the past decade.
Detailed works have also been carried out on SAR ship detection based on the DL models listed in Table I.For example, Fan et al. [29] studied ship detection in polarimetric SAR images using a modified faster RCNN model.Similarly, the authors in [30] and [31] performed SAR ship detection using modified architectures of YOLO2 and YOLO5, respectively.Wu et al. [32] carried out ship detection using a modified version of mask R-CNN.Wang et al. [33] discussed the possibility of combining SSD model with transfer learning for SAR ship detection.Xu et al. [34] studied large scale ship detection in SAR images using lite-YOLO5 model.Yang et al. [35] proposed a detection model based on coordinate attention and enhanced receptive fields and compared its performance with faster RCNN, SSD, FCOS, RetinaNet, and YOLO models.Similarly, Cui et al. [36]  proposed a model based on dense attention pyramid network and also compared its performance with faster RCNN and SSD.Comprehensive surveys have also been conducted on SAR ship detection using DL techniques [37], [38].
More recent research has been focused on the development of lightweight models with fewer layers and less network parameters to reduce the computational costs.For instance, Pang et al. [39] proposed a lightweight model called YOLO5-MNE by replacing the sigmoid linear unit (Silu) activation functions with rectified linear unit (ReLu) activation functions and also added a channel attention module to compensate for the loss in accuracy.The authors performed experiments on SSDD and AirSAR-Ship-1.0 datasets and compared the performance of their proposed approach with YOLO4 and YOLO5 models in terms of precision, network parameters, and GPU memory utilized.Yan et al. [40] proposed LssDet to reduce the floating point operations (FLOPs) and the number of network parameters by introducing a cross sidelobe attention module as well as a lightweight path aggregation feature pyramid network module.They performed experiments on SSDD and Ls-SSDD-v1.0 datasets and performed comparative analysis between different YOLO models in terms of average precision, network parameters, and FLOPs.Yang et al. [41] discussed a soft quantization approach to make the overall ship detection model small.The authors proposed a split bidirectional feature pyramid network to improve accuracy and a feature fusion module based on linear transformation to reduce the network size.They also performed extensive experiments on SSDD, SAR-Ships, and Air-SAR-Ship datasets and analyzed the results in terms of precision, recall, mAP, and network parameters.Similarly, Zhang and Zhang [42] proposed a ship detection model with only 20 convolutional layers for real-time SAR ship detection applications and performed experiments on SSDD dataset for comparative analysis.Yang et al. [43] proposed an algorithm/hardware co-design framework for on-board SAR ship detection with a focus on simultaneously increasing detection accuracy by increasing output feature sizes while minimizing the need for computational resources through implementation of less expensive operations.They also performed experiments on the SSDD dataset using evaluation metrics of average precision and network parameters.Gao et al. [44] provided a comprehensive overview for on-board processing of satellite images as well as information fusion including current challenges and future prospects in the domain.
Other works have focused on developing models with better feature extraction and integration strategies to improve the performance of SAR ship detection in complex cases.For instance, Zhao et al. [45] proposed a novel visual transformer-based network for extraction of global features in the case of multisatellite SAR images.Ai et al. [46] proposed a mechanism to extract low level features based on modified C-FAR algorithm and fuse them with high level features extracted from CNNs.Cui et al. [47] proposed a spatial shuffle group enhance attention model to extract better semantic features and suppress unnecessary features.Zhou et al. [48] proposed MSSDNet comprising modules for multiscale feature extraction as well as adaptive feature fusion.Wang et al. [49] also proposed a feature transformer module for CNNs to extract global features from SAR ship images.Similarly, Gao et al. [50] proposed a dualistic cascade CNN comprising of basic geometric feature extraction network, and polarization feature enhancement network for comprehensive feature extraction and fusion.
Apart from that, anchor free ship detection models have also been proposed in an effort to reduce the dependency of model detection on the size of anchor boxes to reduce computational cost [51], [52].Similarly, researchers have also explored the possibility of optical to SAR transfer learning to augment ship detection in SAR images.Bao et al. [53] proposed an optical-SAR pretraining approach to transfer characteristics of optical images to SAR images through common representation to improve model learning.Gao et al. [54] also proposed a novel method comprising of dense connection module and convolutional block attention module for enhanced feature extraction in the case of optical to SAR transfer learning for sparsely labeled datasets.
One of the major issues related to the applications of SAR imagery is the sparse availability of public datasets as compared with optical datasets.Majority of the traditional pretrained object detection models are trained on optical data or simple camera images, which become less relevant in the case of SAR images.Due to these reasons, researchers have also published their own satellite-based SAR image datasets for ship detection.Significant satellite-based SAR ship datasets include SSDD [55], SAR-Ships dataset [56], AirSAR-Ship-1.0 [57], HRSID [58], Ls-SSDD-v1.0 [59], SRSDD-v1.0[60], and the latest iVision-MRSSD [61], [62].The SRSSD-v1.0 is based on rotated bounding boxes (B-Boxes) whereas, rest of the datasets comprise of regular upright B-Boxes.Table II provides a comparative analysis of the publicly available SAR ship datasets in terms of various parameters including the number of images, individual image size, number of sensors used and coverage of imaging frequencies.From Table II, it can be inferred that the latest iVision-MRSSD dataset is the most diverse in terms of satellite sensors and frequency bands involved.

III. MATERIALS AND METHODS
This section describes the main idea of the proposed YOLO-OSD approach.Section II-A comprises detailed statistical analysis of the three SAR ship detection datasets under observation for the generation of customized anchor boxes with respect to each dataset.Section II-B discusses the proposed architectural changes pertaining to YOLO-OSD and compares them with the original YOLO5 architecture.

A. Custom Anchor Box Generation Strategy
We have taken into account three different datasets named SSDD, SAR-Ships, and iVision-MRSSD to implement our proposed strategy.The details of these datasets are already described in Table II.The core idea of this strategy is to optimize the initial anchor box sizes used by the SOTA object detection models based on the inherent dataset distributions.Anchor boxes play a vital role in object detection algorithms, such as YOLO and RetinaNet.They represent a set of predefined B-Boxes with different aspect ratios to facilitate object detection corresponding to different sizes and dimensions.A careful and systematic selection of anchor box sizes can allow the models to learn robust features and better adapt to the diversities within the datasets.It can also facilitate in the faster training process.
Initially, we extract the areas of the ship B-Boxes from labels of each dataset and categorize them into small, medium and large categories to visualize their distribution.Fig. 2 shows the area-wise categorization of B-Boxes in SSDD, SAR-Ships, and iVision-MRSSD, respectively.It can be seen that the proportion of small B-Boxes (covering less area) in all the datasets is large as compared with big B-Boxes.
Fig. 3 shows the point-wise distribution of bounding box areas in each dataset.It can be seen that majority of the points are closer to the x-axis again depicting a significant number of small ships.
To categorize the B-Boxes into a suitable number of clusters, we perform Elbow analysis.It is a method to determine the number of suitable clusters based on the dataset for a clustering problem.Elbow analysis basically runs K-Means clustering on the given dataset by setting different values of K i.e., the number of clusters.In our case, the value of K for the Elbow analysis ranged between 2 and 14.After that, it plots the sum of squared differences of the data samples with their allotted cluster centers also known as inertia.The point where the trend of the graph changes is called the elbow point and the number of clusters at that point are considered to be suitable for the dataset under observation.Fig. 4 shows the plot of inertia with respect to the number of possible clusters for each dataset.It can be seen that all the plots follow a similar trend with a typical elbow like shape being formed around 3 to 4 clusters.Looking at the trend in the elbow plots, we determine 4 to be the suitable number of clusters for all datasets.
After the Elbow analysis, we apply K-Means clustering with K = 4 to form suitable clusters.Fig. 5 shows a visualization of 4 possible B-Box clusters for each dataset.
We further analyze the ship B-Boxes on the basis of their width-height ratios.Fig. 6 depicts the width-height bounding box plots of SSDD, SAR-Ships, and iVision-MRSSD, respectively, with width on the x-axis and corresponding height plotted on the y-axis.It can be seen that most of the points are Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.concentrated in the lower left corner of each plot indicating a higher concentration of relatively small B-Boxes corresponding to small ships.This is a crucial finding, which also influenced the idea behind our proposed architectural changes in the YOLO5 backbone.
We again apply K-Mean clustering with K = 4 on the widthheight data of B-Boxes to determine the final clusters.The argument behind this is that clusters based on width-height ratios are a better representation of the data as compared with clusters based on area.This is because detection performance is sensitive to the widths and height ratios of B-Boxes and not just their area.Fig. 7 shows a visualization of probable B-Box clusters for each dataset in terms of width-height ratios for the three datasets under observation.
An important thing to consider here is that, the original YOLO5 model utilizes a total of 9 initial anchor box sizes, 3 for each scale pertaining to small, medium, and large scale detections.Based on our findings, we propose 12 anchor boxes i.e., 4 anchor boxes for each scale of detection.Fig. 8 shows the outputs of K-Means clustering with K = 12.
Table III summarizes the final values of anchor box dimensions achieved after minor adjustments with respect to all the datasets.The default values for the original YOLO5 model are also included for comparison purposes.

B. Network Architecture of the Proposed YOLO-OSD
This section focuses on the architectural aspects of the newly proposed model.We propose an optimized SAR ship detector, suitable for real-time ship detection in multiresolution SAR satellite images by modifying the YOLO5 backbone to achieve faster execution/training time as well as improve the model performance in terms of evaluation metrics including precision, recall, F-1 score, and mAP.The architectures of YOLO-based object detectors are comprised of three primary parts including the network backbone, neck, and head.The backbone comprising of various convolutional layers is responsible for feature extraction and learning.In YOLO5, the backbone is based on cross stage partial network [63].The neck of YOLO5 is based on the path aggregation network [64], which involves a feature pyramid network to enhance learning of the low-level features.The YOLO5 head responsible for final detection results is based on the principle of multiscale detection and uses three different sizes of feature maps to detect small, medium, and large objects.This is very important in our case as the sizes of ships vary throughout the datasets from very small boats to large ships, as described in the previous section.To develop an optimized SAR ship detection model, we first reduce the number of repetitive C3 modules in the YOLO5 backbone.C3 modules consist of triple convolution operations and are computationally very expensive.Reducing the number of C3 modules has two immediate effects: 1) it significantly reduces the overall number of network parameters thus making the model lightweight and 2) it results in increased size of feature maps because reducing the C3 modules means reducing the number of convolution operations.The size of intermediate feature maps and the number of convolution layers have an inverse relationship with one another.This is due to the fact that the convolution layers typically involve strided and down-sampling operations to reduce the computational costs but this also results in the reduction of spatial size of the feature maps.In case of multiple layers of convolutions the spatial size of the intermediate feature maps gets reduced too much, therefore, features pertaining to small objects do not propagate well through the network, effectively blocking the learning of small objects.This in turn leads to increased false negatives in the case of small objects [43], [47], [65].This issue becomes even more important in the case of ship detection, as detailed anchor analysis of ship datasets in the previous section indicates the presence of a large number of small ships.Fig. 9 shows the typical structure of the YOLO5 model consisting of C-Bn-Si, triple convolution (C3), and spatial pyramid pooling fast blocks.C-Bn-Si stands for Convolution-2D, batch normalization, and SiLU activation function, respectively.It can be seen in Fig. 9 that the C3 module is repeated a total of 21 times.Also, the default number of anchors in YOLO5 is 9.
Apart from that, we also change the normal C3 blocks with the triple cross-convolution (C3x) blocks.The C3 modules involve triple convolutions with a concatenation function at the end.On the other hand, the C3x module comprises of C3x.The basic idea behind cross convolutions also termed as asymmetric or spatial shuffle convolutions is that the normal 2-D convolutions are decomposed into two 1-D convolutions.For instance, a regular 2-D convolution filter of size n × n is modularized into two convolution filters of size 1 × n and n × 1, respectively.This results in a significant decrease in the overall network parameters, hence making the model lightweight [66].In our case, the C3x modules involve 1 × 3 and 3 × 1 convolutions instead of single 3 × 3 operations.These are also considered better feature extractors as compared with the regular convolutions when it comes to dealing with target objects oriented at varying angles, which is indeed the case with ship detection [67].Fig. 10 shows the comparison of C3 and C3x computation blocks, respectively.
The modified backbone of YOLO-OSD is shown in Fig. 11.It can be seen that we replaced the simple triple convolution blocks with C3x blocks to enhance the feature extraction process.Moreover, we reduced the number of triple convolution blocks from 21 to 15.This effectively reduced the number of network parameters by approximately 16% making the model lightweight as compared with the original YOLO5 model without compromising the detection performance.be seen that the number of layers in the YOLO-OSD is reduced to 326 as compared with 368 of the original YOLO5 model.Consequently, the network parameters are also reduced significantly.Similarly, our model also requires fewer floating point operations per second (GFLOPs).

IV. EXPERIMENTAL DESIGN AND SETUP
This section describes various experimental aspects of our work.Specifically, Section III-A delineates the environmental setup and Section III-B discusses various evaluation metrics Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE IV COMPARISON OF NETWORK ATTRIBUTES OF YOLO5 AND YOLO-OSD
used to assess the performance of different DL models on the datasets under observation.

A. Environmental Setup
All the experiments including statistical analysis, model training, validation, and testing of the datasets were performed on a local machine having Windows 10 Pro operating system, Intel core-i7 CPU, 32 GB RAM, 2 TB HDD, and NVIDIA RTX GeForce 2080 GPU with 8 GB graphics memory.All the programming was done in Python language using the Anaconda development environment.Moreover, a train: validation: test set ratio of 70:20:10 was used throughout the experiments.For SAR-Ships dataset, the training, validation, and test sets were generated by randomly picking tiles in the proposed ratios.In the case of iVision-MRSSD and SSDD datasets, they already contained training, validation and test sets, so no further processing was required.The batch size for SSDD and iVision-MRSSD was set to 16, whereas for SAR-Ships, it was set to 32 image tiles per batch.All the models were trained for 150 epochs.

B. Performance Evaluation Metrics
Following evaluation metrics were used to assess the performance of SAR ship detection on various datasets involved.(2) 3) F-1 Score: It merges the two metrics i.e., precision and recall into a single metric.To achieve a high F-1 score, both precision and recall need to be high 4) mAP: It is the mean of average precisions of all classes, where the average precision is simply the area under the precision-r-ecall curve.In our case, we have only a single class and hence N = 1 V. RESULTS AND ANALYSIS This section discusses the results of various models trained on the three datasets under observation and analyzes their performance, both quantitatively and qualitatively based on the evaluation metrics.We trained eight models, i.e., two variants each for the YOLO5, YOLO7, and YOLO8 object detectors, one for RetinaNet and one for our proposed YOLO-OSD on all the three datasets.A total of 24 models were trained.For the YOLO models, one variant was trained from scratch without using any pretrained weights, while the second variant was trained using transfer learning with pretrained weights of the MS COCO dataset.The MS COCO dataset [68] comprises of optical images and contains 80 classes.It also includes a class labeled "boat."We did this to analyze the effect of transfer learning on the task of ship detection in SAR images.

A. Performance Evaluation on iVision-MRSSD Dataset
Table V summarizes the results of YOLO5, 7, 8, RetinaNet, and YOLO-OSD object detectors on the iVision-MRSSD dataset in terms of various evaluation metrics.All the models were trained for 150 epochs and a batch size of 16.It can be inferred from Table V that the proposed YOLO-OSD outperforms all the models in terms of training time, whereas, it also performs better than YOLO5, YOLO7, and RetinaNet in terms of F1 score and mAP.YOLO8 (pretrained) model has the overall best detection performance and YOLO7 model shows the lowest performance in terms of mAP scores.It can also be seen that the pretrained models of YOLO7 and YOLO8 have slightly better performance as compared with the models trained from scratch.Apart from that the training time for RetinaNet model is the highest among all models.
Similarly, Fig. 12(a)-(c) shows F1-Confidence curves, mAP@0.5, and mAP@0.95graphs of all the models with respect to the number of training epochs on the iVision-MRSSD dataset.The plots pertaining to YOLO8 in red color show highest F1 and mAP values.Our proposed YOLO-OSD (brown color) is very close second to YOLO8 on these values.Fig. 13 shows the detection performance of various DL object detectors on a small subset of iVision-MRSSD test set.B-Boxes in blue color represent ground truths, whereas yellow, red, cyan, and green colors represent detections from YOLO7, 5, 8, RetinaNet, and YOLO-OSD, respectively.

B. Performance Evaluation on SAR-Ships Dataset
We also trained all the YOLO variants and the RetinaNet on the open source dataset named SAR-Ships for comparison purposes.Table VI sums up the results of object detectors on SAR-ships dataset.It can be seen from Table VI that the YOLO8 (pretrained) model also outperforms other models on the SAR-Ship dataset with our proposed model YOLO-OSD also outperforming YOLO5, YOLO7, and RetinaNet.On the other hand, the training time of YOLO-OSD is significantly less as compared with all the other models.
Fig. 14(a)-(c) shows F1, mAP@0.5, and mAP@0.95plots of all models with respect to the number of training epochs on the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE V PERFORMANCE COMPARISON OF YOLO VARIANTS ON IVISION-MRSSD DATASET TABLE VI PERFORMANCE COMPARISON OF YOLO VARIANTS ON SAR-SHIPS DATASET
SAR-Ships dataset.Again graphs of YOLO8 in red and purple colors depict the highest values with our proposed YOLO-OSD in brown color coming up very close second.Fig. 15 shows the detection performance of various DL object detectors on a small subset of the SAR-Ships test set.B-Boxes in blue color represent ground truths, whereas yellow, red, cyan, and green colors represent detections from YOLO7, 5, 8, RetinaNet, and YOLO-OSD models, respectively.

C. Performance Evaluation on SSDD Dataset
On a similar terms, we also trained the seven variants of YOLO and one RetinaNet on the open source dataset named SSDD for comparison purposes.Table VII summarizes the results on the SSDD dataset.All the models were trained for 150 epochs.
Table VII also confirms the superiority of YOLO8 (pretrained) model on the SSDD dataset as it gives best performance in terms of all evaluation metrics when compared with other models.Our  YOLO-OSD also outperforms all the models in terms of training time whereas it has better detection performance than YOLO5, YOLO7, and RetinaNet in terms of F-1 score and mAP.
Similarly, Fig. 16(a)-(c) shows F1, mAP@0.5, and mAP@0.95plots of YOLO5, 7, 8, RetinaNet, and YOLO-OSD models with respect to the number of training epochs on the SSDD dataset.Similar trends can be seen in the performances, as YOLO8 and the proposed YOLO-OSD outperform others on SSDD as well.
Fig. 17 shows the detection performance of various DL object detectors on a small subset of SSDD test set.B-Boxes in blue color represent ground truths, whereas yellow, red, cyan, and green colors represent detections from YOLO7, 5, 8, RetinaNet, and YOLO-OSD models, respectively.

D. Ablation Study of the Proposed YOLO-OSD
This section aims to determine the performance improvements brought by each modification in the proposed YOLO-OSD.As mentioned in the previous section, a total of three improvements/modifications have been carried out in the newly proposed approach.These include: 1) optimization of the initial anchor box sizes based on the statistical analysis of the datasets, 2) reduction of the repetitive triple convolution (C3) modules in the YOLO5 backbone, and 3) exchanging the C3 blocks with the C3x (C3x) modules.We have carried out ablation experiments with respect to all the three datasets under observation.Table VIII summarizes the results of the ablation study.It can be seen from Table VIII that the first rows pertaining to each dataset correspond to the results of the unmodified YOLO5.The next three rows show the effect of adding each individual modification as described above.It can be inferred from the results of the ablation study that reduction of C3 modules reduces the training time of the model significantly with similar detection performance in case of the MRSSD dataset, whereas the detection performance reduced slightly for SAR-Ships and SSDD datasets.Following that the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.exchanging of C3 modules with C3x modules further reduces the training time while minimally improving the detection performance in the case of SAR-Ships and SSDD datasets.Finally, the inclusion of optimized anchor boxes along with the other modifications leads to best detection performance with respect to all datasets with significant improvements in training times as well, signifying the efficacy of our proposed approach.

E. Cross Validation Analysis
This section focuses on the verification and finding the effective combination of DL models and the datasets on unseen data distributions.We performed cross dataset validation of all the YOLO models on the three corresponding datasets under observation.Effectively, each YOLO model trained on one dataset was validated on the validation sets of the other two datasets under observation.Table IX summarizes the results of cross validation analysis.
It can be seen from Table IX that the YOLO5 model trained on the iVision-MRSSD dataset performs comparatively better when evaluated on the validation sets of SAR-Ships and SSDD datasets.Similar trends can be seen in the case of YOLO7, YOLO8, and YOLO-OSD models.On the other hand, models trained on SAR-Ships and SSDD datasets have relatively low F-1 scores and mAP values when evaluated on the validation set of iVision-MRSSD.From these results, it can be concluded that the iVision-MRSSD dataset is more diverse and complex as compared with other datasets.This is also evident from the fact that it comprises of data from six different satellite sensors and covers a large range of spatial resolutions and imaging frequencies.Therefore, models trained on the iVision-MRSSD dataset learn features that generalize well on the unseen data with varying distributions.Similarly, it can be seen that the YOLO-OSD model, especially in combination with the iVision-MRSSD dataset has a better F-1 score as compared with other models in the cross validation results.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

VI. DISCUSSION
From the different experiments described in the previous section, it can be deduced that the YOLO-OSD has better detection results on all the three datasets as compared with YOLO5, YOLO7, and RetinaNet, whereas it has significantly improved training times as compared to all other models.Both data-centric and model-centric approaches contribute to the improvement of results.The data-centric approach involves the customization of initial anchor boxes based on the statistical analysis of the data distributions, whereas, the model-centric approach focuses on the optimization of YOLO5 backbone by replacing the C3 blocks with C3x block and reducing the overall number of repetitive blocks.This change in architecture reduced the number of layers as well as the overall network parameters while facilitating the learning of small targets by increasing the size of intermediate feature maps.
In terms of detection accuracy, YOLO8 has better performance as it is an anchor free algorithm but has significantly large training times.Overall, the RetinaNet model has the lowest performance in terms of almost all evaluation metrics.It also has the highest training times on all datasets.This is due to the fact that RetinaNet has a deeper and complex network architecture based on ResNet and utilizes two subnetworks for classification and regression as compared with the highly optimized YOLO architectures.Due to its complex and large backbone, RetinaNet requires more computational resources and training time to achieve results comparable to the YOLO models.For the sake of fair comparison and bench-marking, we had set the number of training epochs for each model to be 150 in our experiments and RetinaNet needs significantly large number of training epochs to produce better results.Also the inference speed of RetinaNet is less as compared with the YOLO models.Therefore, it can be suggested that it is not suitable for real-time ship detection in SAR images.Furthermore, it can be seen from Figs. 13, 15, and 17 that RetinaNet is not very robust against small targets and hence has a significantly high number of false negatives for small ships.This also explains the remarkably low recall values of RetinaNet compared with the other models.The reason for high false negatives in case of RetinaNet is because it uses a focal loss function, which is designed to address the class imbalance problem by down-weighting the easy examples.However, this also potentially, reduces the sensitivity of the model to the hard examples, such as small ships in our case.thus leading to low recall values.Apart from that, it is evident from Tables V-VII that the pretrained versions of all models (on MS COCO dataset) especially YOLO7, and YOLO8 perform slightly better as compared with the models, which are trained from scratch.This hints that certain features from the MS COCO dataset are still relevant for the task of ship detection in SAR satellite images and the initial weights trained on MS COCO dataset somewhat slightly facilitate in better learning of the ships in SAR datasets.
Moreover, the cross-validation analysis of all the models on different datasets show that the newly proposed iVision-MRSSD dataset can provide SOTA DL models with good generalization capability due to its rich diversity in terms of satellite sensors, spatial resolutions, imaging frequencies, and different scene types.Also our proposed YOLO-OSD has a slightly better capability of learning generalized features when trained on a complex and diverse dataset, such as iVision-MRSSD.
The reason for choosing the three latest models of the YOLO series instead of other object detection models to carry out comparative analysis is that it has already been established in previous studies that the YOLO models outperform others on the task of ship detection in SAR satellite images [30].The authors in [32] and [41] compared the performance of YOLO3 and YOLO4 with Faster-RCNN, SSD512 and MobileNetV3 on different SAR ship datasets and showed that the YOLO4 outperforms other models in terms of the evaluation metrics.We also trained the RetinaNet model using the Detectron2 library and as expected, it under-performs on almost all evaluation metrics when compared with the YOLO models.Moreover, the decision for further optimization of YOLO5 model among other YOLO models was taken because during the initial experiments, it was evaluated that the YOLO5 due to its simpler and relatively lightweight architecture had the least training times as compared with YOLO7 and YOLO8, which is also evident from the results summarized in Tables V-VII.Therefore, it was decided to further optimize the YOLO5 model.The decision was taken to support the case of real-time SAR ship detection without compromising the detection accuracy.Furthermore, since YOLO8 is an anchor free model, the idea of anchor box adjustment could not be incorporated into it.

VII. CONCLUSION
Ship detection in SAR satellite images poses various challenges including distorted ship boundaries, interclass similarity issues and weak generalization capabilities of SOTA object detection algorithms.To rectify some of these challenges, we proposed an optimized ship detection model named YOLO-OSD based on a hybrid data-model centric approach.The newly proposed model has approximately 16% less network parameters as compared with YOLO5 and is approximately 40% faster as compared with YOLO8 with comparable detection performance.Moreover, the proposed YOLO-OSD outperforms YOLO5, YOLO7, and RetinaNet in terms of F-1 score and mAP on all three datasets included in the study.We further analyzed the effect of transfer learning on the problem of SAR ship detection and conclude that the pretrained models on MS COCO dataset have slightly better detection performance when compared with models that are trained from scratch by assigning random weights.We also performed cross-validation of models on all datasets under observation to determine that the iVision-MRSSD dataset has rich diversity and in combination with our proposed YOLO-OSD provides better generalization capabilities over unseen data with different distributions.Our current work is limited in the sense that it analyzes the results of only one anchor-free model YOLO8.In future, we would like to investigate the possibility of anchor-free optimized SAR ship detection.Furthermore, we would also like to address the case of ship detection using oriented B-Boxes, which can significantly increase the IoU and mAP scores.

1 ) 1 ) 2 )
Precision: Determines how many positively detected ships are actually correct Precision = True Detections True Detections + False Detections .(Recall: It is also termed as sensitivity of the model, determines that how many of the actual ships have been detected correctly Recall = True Detections True Detections + False Omissions .

TABLE I POPULAR
DL-BASED OBJECT DETECTORS IN THE PAST DECADE

TABLE II COMPARISON
OF PUBLICLY AVAILABLE SATELLITE BASED SAR SHIP DATASETS

TABLE III SET
OF FINAL ANCHOR BOX SIZES BASED ON DATA DISTRIBUTION

TABLE VII PERFORMANCE
COMPARISON OF YOLO VARIANTS ON SSDD DATASET

TABLE VIII SUMMARY
OF ABLATION EXPERIMENTS RELATED TO THE PROPOSED YOLO-OSD MODEL

TABLE IX CROSS
VALIDATION RESULTS OF YOLO5, 7, 8 AND YOLO-OSD MODELS