A Deep Learning Model to Extract Ship Size From Sentinel-1 SAR Images

This study develops a deep learning (DL) model to extract the ship size from Sentinel-1 synthetic aperture radar (SAR) images, named SSENet. We employ a single shot multibox detector (SSD)-based model to generate a rotatable bounding box (RBB) for the ship. We design a deep-neural-network (DNN)-based regression model to estimate the accurate ship size. The hybrid inputs to the DNN-based model include the initial ship size and orientation angle obtained from the RBB and the abstracted features extracted from the input SAR image. We design a custom loss function named mean scaled square error (MSSE) to optimize the DNN-based model. The DNN-based model is concatenated with the SSD-based model to form the integrated SSENet. We employ a subset of the OpenSARShip, a data set dedicated to Sentinel-1 ship interpretation, to train and test SSENet. The training/testing data set includes 1500/390 ship samples. Experiments show that SSENet is capable of extracting the ship size from SAR images end to end. The mean absolute errors (MAEs) are under 0.8 pixels, and their length and width are 7.88 and 2.23 m, respectively. The hybrid input significantly improves the model performance. The MSSE reduces the MAE of length by nearly 1 m and increases the MAE of width by 0.03m compared to the mean square error (MSE) loss function. Compared with the well-performed gradient boosting regression (GBR) model, SSENet reduces the MAE of length by nearly 2 m (18.68%) and that of width by 0.06 m (2.51%). SSENet shows robustness on different training/testing sets.


I. INTRODUCTION
S HIP detection is of great significance to marine activities, such as marine transportation, fishery management, and maritime safety [1]. Spaceborne Synthetic Aperture Radar (SAR) can monitor targets under all-day and all-weather conditions, making it one of the most critical marine surveillance tools [2]. Ship detection from SAR images has always been a hotspot in marine applications [3]- [6]. With the advent of a new generation of satellites and fast-growing image analysis technology, it is feasible to extract more detailed ship information from SAR images in addition to its detection [7]. The ship's length and width provide essential information for ship classification and marine surveillance [8]. In most cases, it is difficult to identify the ship's type directly from the SAR image, and the size information can provide useful help. In addition, the elaborate geometric parameter estimation is also meaningful to SAR imagery interpretation. With the rapid increase in SAR data volume, an efficient and accurate ship size extraction method will bring a new idea for SAR image interpretation. Generally, as metallic objects, ships can reflect the electromagnetic waves of SAR sensors much more strongly than the surrounding water. Therefore, one can perceive most ships on SAR images as bright backscattering intensity targets, characterized by the highly normalized radar cross section (NRCS) values. The geometric feature of the ship's NRCS, such as the length and width of the minimum bounding rectangle (MBR), provides an initial size for estimating a ship's ground size. Meanwhile, the ship's superstructure, the environment, the sea-ship interaction, and the imaging system can influence the NRCS from the ship [8]. These factors lead to a huge gap between the initial size and the ground size. Therefore, it is challenging to extract the ship size from SAR images accurately.
In the literature, all SAR ship size detection algorithms include three steps: 1) binarization, 2) initial size extraction, and 3) accurate size estimation. The first procedure processes the SAR image and divides the pixels into ship signatures and nonship signatures. The second procedure extracts the initial ship size based on the binary results, such as the MBR of the ship signature. The third step estimates the accurate size based on the initial ship size and other relevant factors. Stasolla and Greidanus [9] employed the constant false alarm rate (CFAR) to do the first step. The CFAR family is [10]- [12] widely used in ship detection on SAR images to separate ship signature and background. Furthermore, for the second step, they employed the mathematical morphology method to refine the signature to extract the MBR of the ship. They did not develop the third step and adopted the MBR's length and width as the vessel's estimated length and width. The method was tested with 127 available ship samples from Sentinel-1 SAR images. The mean absolute error (MAE) of length measurement is 30 m (relative error: 16%), and the MAE of width is 11 m This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ (relative error: 37%). To further reduce the estimation error, researchers took the initial ship size of the second step as the initial information and performed the third step by statistical/machine learning (ML) methods, such as multiple linear regression [13], the kernel-based method [7], and nonlinear regression [8]. Among them, the nonlinear regression model is a typical representation. In 2018, Li et al. [8] used OpenSAR-Ship [14], a large-volume data set dedicated to Sentinel-1 ship interpretation, to extract the ship's length and width. First, they performed the binarization step by a threshold-based method and obtained the ship signature. Second, they refined the ship signature by an image segmentation operation and obtained the initial ship size. Third, they fused the dual-polarization information and considered the initial ship size, the environment information, sensor information, etc. to construct a nonlinear regression model based on the gradient boosting. The gradient boosting is an ensemble learning method, allowing for the optimization of arbitrary differentiable loss functions [15]. The nonlinear regression model established an accurate mapping between the input factors and the ship length and width. They validated their results against ships longer than 80 m. Experiments showed that the MAEs of length and width are 8.80 m (relative error: 4.66%) and 2.17 m (relative error: 7.01%), respectively. As the pixel spacing of the Sentinel-1 images in their database is 10 m, the model made noticeable progress and pushed the error under one-pixel spacing.
Overall, the accuracy of the extracted size is improved continuously. With the rapid increase in satellite data, the traditional three-step procedure shows an obvious limitation: it is extremely complex. Each step's error may accumulate to the final extracted size. In most cases, the first two steps, binarization and initial size extraction, need sophisticated image operations to meet the following estimation step [8]. For the third step, selecting influence factors to construct conventional ML-based regression models requires high-level prior expert knowledge and manual engineering [16], which is also a big challenge. The errors generated in each step will be accumulated and ultimately affect the final size extraction accuracy. Therefore, in the era of big data, it is feasible to develop new methods to improve ship size extraction accuracy and efficiency.
ML has gradually evolved into deep learning (DL) that brings new ideas for addressing the above challenge [17]. A typical DL model consists of deep neural networks (DNNs), accepts input data in a raw format, and automatically learns the required features to achieve classification or prediction [18]. This process is known as end-to-end learning. Compared with ML, DL significantly simplifies feature engineering and is suitable for modeling big data and complex relationships. In recent years, DL has been successfully applied in oceanography, geography, and remote sensing, which has helped people gain further process understanding of Earth system science problems [19]- [25].
A deep convolution neural network (CNN) is a particular type of DNN composed of CNN layers. A CNN layer connects to the previous layer's local patches through convolution kernels to extract local spatial features [26]. Since deep CNN models have achieved great success in image classification [27], researchers proposed various ship detection models based on typical deep CNN frameworks. Representations include a faster region-based convolutional network (Faster-RCNN) [28], [29], a single shot multibox detector (SSD) [30], [31], you only look once (YOLO) [32], and other integrated detection frameworks [33], [34]. These DL-based models achieved good performance in detecting ships from SAR images, with an average precision (AP) over 80%, which is an obvious improvement than the classical models. To further improve the performance, researchers introduced a rotatable bounding box (RBB) to replace the traditional nonrotating RBB in the DL detection frameworks. Liu et al. optimized the traditional SSD [35] by rotating the prior box [36] and developed DRBox. Compared with the Faster-RCNN and the traditional SSD, DRBox improves the AP by 11.77% and 11.17% in detecting densely arranged ships, respectively. Furthermore, DRBox-v2 [37] optimized DRBox by integrating the feature pyramid network (FPN) [38] and focal loss (FL) [39]. Validation experiment based on the SAR ship detection data set (SSDD) showed that the AP of DRBox-v2 is 9.36% and 6.4% higher than the rotation dense feature pyramid networks (R-DFPNs) [40] and the DRBox, reaching 92.81%.
DL has become a mainstream solution for a typical ship detection task. Compared with conventional models, DL-based models significantly simplify feature engineering and achieve end-to-end detections with higher accuracy and better robustness. Inspired by this, we construct an end-to-end DL model to replace the traditional three-step procedure for ship size extraction from SAR images. There are two contributions to our study. First, we develop an end-to-end DL-based model, SSENet, to extract the ship size from SAR images. SSENet employs DRBox-v2 to generate the ship's RBB from the SAR image and constructs a DNN-based regression model to estimate the accurate ship size. To the best of our knowledge, SSENet is the first end-to-end model for the ship size extraction of satellite images. Second, we design a hybrid input and a loss function named mean scaled square error (MSSE) for the DNN-based regression model, which significantly improves ship size estimation accuracy.
The rest of this article is organized as follows. Section II describes the data. Section III presents SSENet in detail. In Section IV, experiments are conducted to evaluate the effectiveness of SSENet. Section V discusses the performances of different regression models. The factors that influence the model performances are also discussed. Finally, Section VI concludes this article.

A. OpenSARShip Database
The experimental data is a subset of the OpenSARShip. The OpenSARShip (http://opensar.sjtu.edu.cn/) is a data set dedicated to Sentinel-1 ship interpretation, providing 11 346 SAR ship chips integrated with the automatic identification system (AIS) messages. The Sentinel-1 images are the ground range detected (GRD) products of an IW mode. The spatial resolution is about 20 m, and the pixel spacing is 10 m.
The preprocessing procedures, such as radiometric calibration and terrain correction, are carried out by SNAP 3.0. Each SAR chip contains one ship and is stored in a matrix, which indicates the amplitude values of pixels for VH (vertical emitting and horizontal receiving) and VV (vertical emitting and vertical receiving) polarizations. Here, we use 1890 samples in the VV mode as the experimental data set. Fig. 1(a), (d), (g), and (j) shows four types (cargo, tanker, tug, and fishing) of ships as examples. The AIS messages provide the ground size for each ship, which has been integrated into the OpenSARShip and can be obtained directly.

B. Labeling
The ground truths for training and testing SSENet include two parts. The first one is the ground ship size, which can be obtained directly from the OpenSARShip. The other one is the ground RBB for each ship. An RBB is a rotated bounding box surrounding the ship's signature on the SAR image. The DRBox-v2 is the first part of SSENet, which generates the RBB of each ship. The ground RBB is used to train the DRBox-v2 to generate an accurate RBB. As the OpenSARShip has no RBB, we manually label the RBB for each ship by a MATLAB tool shared in DRBox-v2 [37]. The rule for labeling an RBB is to surround the ship's signature on the SAR image as precisely as possible. Fig. 1(b), (e), (h), and (k) shows the labeled RBB of each ship. The angle range of labeled RBBs is (0 • , 180 • ). It is worth noting that there is an apparent gap between the size of the labeled RBB and the ship's ground size [ Fig. 1(b)/(c), (e)/(f), (h), and (k)]. Therefore, as stated earlier, based on ship detection, ship size extraction still needs a lot of challenging efforts.

III. METHOD
The general structure of SSENet includes three steps (see Fig. 2): 1) generating RBBs, 2) estimating the ship size based on a DNN-based regression model, and 3) calculating MSSE loss and optimizing SSENet. The first step takes the SAR chip as input and automatically detects the ship's RBBs by a deep CNN model, DRBox-V2. Then, the RBB with the highest confidence is selected as the initial RBB. The second step estimates the ship size by a DNN model. We construct a hybrid input for the DNN model. The hybrid input consists of two parts: 1) the initial ship parameters (the ship size and orientation angle) obtained from the initial RBB and 2) the feature map extracted in the first step. The DNN model then predicted width and length of the ship. The third step calculates the MSSE loss of the estimated size. The MSSE loss is summed with confidence, and the location losses are calculated in the first step to form the final loss. All trainable Overall, the idea of SSENet is consistent with the traditional three-step ship size extraction method: obtaining the initial ship size first (first two steps) and then estimating the accurate ship size (the third step). Since the CNN has advantages in image target detection and the DNN has advantages in nonlinear regression, SSENet combines the two to achieve the size extraction in an end-to-end way.

A. Generating RBBs
Generating RBBs for the ship is based on DRBox-v2 [37]. DRBox-v2 is a ship detection model based on the SSD. It optimizes the traditional SSD by integrating the angle of ship orientation into the SSD. It outputs the RBBs of ships. The input is a 300 × 300 pixel SAR image. We employed VGG16, which consists of five feature extraction modules, to extract feature maps from the input image. The first feature extraction module comprises two CNN layers, and the others are composed of a max-pooling layer and two CNN layers. Five feature maps named F 1 , F 2 , . . . . . . , F 5 are generated. The number of channels in the F 1 -F 5 feature maps is 64, 128, 256, 512, and 512. As the kernel of four max-pooling layers is 2 × 2, the spatial size of F 1 -F 5 feature maps is 300 × 300, 150 × 150, 75 × 75, 38 × 38, and 19 × 19 pixels, respectively. The output modules perform convolution on feature maps and generate output maps O f [ Fig. 2(b)]. Softmax and sigmoid functions [41] activate O f to create the confidence to be a ship and the location offsets of each prior RBB. We use three feature maps (F 2 -F 4 ) to generate O f . FPN is employed to fuse feature maps at different levels. The cross-entropy loss and the smooth L1 loss [35] are used as the confidence loss and location loss for DRBox-v2.
After the first step, a series of candidate RBBs with confidence are obtained for a ship, which provides initial references for the subsequent accurate size estimation.

B. Estimating Ship Size Based on a DNN Model
First, we construct a hybrid input for the DNN model. The first part of the hybrid input is the initial ship size and orientation angle obtained from the best RBB, which provides primary and direct information for accurate ship size regression. In Fig. 2(c), we select the RBB with the highest confidence value from the candidate RBBs as the best RBB. The length and width of the best RBB are employed as the initial ship size. The orientation angle of the best RBB is employed to calculate the ship orientation angle. The orientation angle measures the ship orientation to the azimuth direction, as shown in Fig. 3. It affects the ship signature on the SAR image [8], [14]. It does not distinguish between the bow and the stern of one ship, and its range is (−90 • , 90 • ).
The other part of the hybrid input is the feature map extracted from the input SAR image. The SAR image contains the ship's signature and the sea clutter under typical environmental conditions, which provides potential information for ship size estimation. The ship's signature on the SAR image reflects the ship's state, moving or stationary. The moving target point is often located in more than one resolution cell during the SAR integration time. The dispersion of the backscattered energy causes the smearing and the degradation of brightness in the SAR image. In addition, the signature of a moving ship shows an azimuth displacement. The SAR system receives the Doppler signal from the scatter in the azimuth direction. The azimuth position of a stationary ship is consistent with the azimuth position of the SAR platform. However, there is an additional component to the Doppler shift for a moving ship, resulting in an azimuthal displacement in the ship signature. The environmental conditions during satellite imaging, such as wind fronts, ocean waves, and rain cells, affect the ship's signature on the SAR image. Under the typical conditions, the sea-ship interaction brings out a complex ship motion in the real world and relatively different polarimetric scattering mechanisms in the SAR signature. Several studies analyzed the impacts of complex environmental conditions on the ship's signature [42]- [44]. The correlation between the ship's state, the environmental conditions, and the ship's size has been confirmed in reference [8]. The abstracted feature map extracted from the input SAR image contains the above factors. Therefore, we transform the high-level feature map F 5 in Fig. 2(b) as the other part of the hybrid input.
Second, we construct a DNN model to regress the accurate ship size. The DNN model is composed of several fully connected neural network (NN) layers. The hybrid input is fed into the DNN model. After layer-by-layer transformation, the DNN model output the ship's accurate length and width. In the following, we will detail the hybrid input and the DNN regression model.

1) Constructing a Hybrid Input for the DNN Model:
Constructing a hybrid input includes two procedures: 1) obtaining the initial ship size and orientation angle and 2) transforming feature map F 5 as inputs.
1) Obtaining the initial ship size and orientation angle. A series of candidate RBBs is generated by the DRBox-v2. The RBB with the largest confidence value is selected as the best RBB for the ship. The best RBB output by the location regression of the DRBox-v2 (SSD) is an encoded RBB, not a standard RBB [35], [37]. A standard RBB consists of five parameters: center coordinates (x, y), length (l), width (w), and orientation angle (θ). An encoded RBB consists of the offsets of these five parameters: x, y, l, w, and θ [ Fig. 4(a)]. To obtain the standard ship size and orientation angle, we decode the encoded RBB to a standard one by the decoding transformation [37] [ Fig. 4(b)]. The length (X L ) and width (X W ) of the decoded RBB will be input to the DNN regression model. As the range of the RBB angle is (0 • , 180 • ), we add −90 • to the angle to obtain the initial ship orientation (θ − 90 • ) [ Fig. 4(c)]. Since the effect of orientation is symmetric to the azimuth, we input the cosine transformation of the ship orientation cos (θ − 90 • ) into the DNN regression model, X O A in Fig. 2  To reduce the dimension of the DNN's input vector, we compress F 5 in the channel and spatial dimensions. First, we transform F 5 by a CNN layer whose convolutional kernel size is 1 × 1, and the filter number is N, obtaining F 5S . The channel number of F 5 is reduced from 512 to N [ Fig. 5(a)]. Second, we perform an S size max-pooling operation on the new F 5S and obtain a new feature map F 6 [ Fig. 5(b)]. The spatial size of F 6 is 19/S. We will discuss the N and S values in the experimental section. Finally, we flatten F 6 as a 1-D feature vector, concatenated with the initial width, length, and orientation to form the DNN regression model's hybrid input.

2) Regressing Accurate Ship Size Based on a DNN Model:
The feature vector of F 6 is concatenated with the size vector to form the hybrid input for the DNN model [ Fig. 2(c)]. Here, the DNN model has three hidden NN layers, each containing 256 neurons. The number of hidden NN layers and the number of neurons are obtained by the parameter-tuning experiment, which will be detailed in Section IV-F. The activation function of each layer is the rectified linear unit (ReLU). An output layer includes two neurons stacked on the last hidden layer. As we scaled the ship size values to 0-1, we employ a sigmoid function to activate the output layer. Finally, the predicted width W p and the estimated length L p are obtained [ Fig. 2(c)].

C. Calculating MSSE Loss and Optimizing SSENet
We design a new loss function, MSSE, as the loss function of the DNN-based ship size regression model. For regression tasks, mean square error (MSE) is a widely adopted loss function. In (1), y i represents the ground truth, y i represents the prediction value, and N means the number of values to be predicted. It is easy to see that there is no correlation between the loss value calculated by MSE and the ground truth value. For example, a ship's true length and width are 100 and 50 m, respectively. The corresponding predicted length and width are 80 and 30 m. The MSE values for length and width are the same, both at 400. They contribute equally to optimize the model parameters. However, the length of a ship is usually longer than its width and, therefore, draws more concerns than width. To improve length estimation accuracy, we hope that the length loss contributes more to optimize the model weights than the width loss.
We propose the MSSE loss function. Unlike MSE, MSSE integrates the ground truth values of ship length and width into the classical MSE. The ground truth is used as a dynamic parameter to scale the square error. The definition of MSSE is shown in (2); y i , y i , and N are the same as the definitions in MSE. For the example mentioned in the previous paragraph, the MSSE loss of length and width are 40 000 and 20 000, respectively. The length loss is much greater than the width loss, and the penalty on the model will be increased in the training process. Therefore, the optimization procedure is promoted, Replacing y i and y i in (2) by the ground length and the predicted length of the i th ship sample, we calculate the loss of length MSSE L . Similarly, replacing y i and y i by the ground width and the predicted width, we calculate the loss of width MSSE W . The size loss (L size ) is the summation of MSSE L and MSSE W (3).
In addition to L size , the confidence loss (L conf ) and the location loss (L loca ) are another two losses calculated in the first step of detecting the ship's RBB, Fig. 2(b). L conf is the cross-entropy loss, and L loca is the smooth L1 loss [37], [45]. These two losses are defined are as follows: where N is the number of predicted targets, c i is the ground confidence of a sample, c i is the predicted confidence of a sample, and x i is the element-wise difference between the ground RBB and the predicted RBB. In the training procedure, the three losses, L size , L conf , and L loca , are added to form the final loss that optimizes SSENet integrally.

A. Experimental Setting
The experimental data is a subset of the OpenSARShip. We randomly selected 1890 SAR chips from the OpenSAR-Ship. Each SAR chip has one ship. We manually label the RBB of each ship, as described in Section II. The corresponding ground length and width of each ship are collected from the AIS information in the OpenSARShip. The values of length and width are scaled to 0-1. We randomly choose 1500 SAR chips as the training set. The remaining 390 chips are the testing set. The chip size is 300 × 300 pixels.
The model runs on a workstation with one GeForce RTX 2070 8 GB GPU. The model is coded in Python 3.6 with the TensorFlow as the DL packages. The training batch size is 6. The initial learning rate is 0.0002. During the training process, the learning rate decreases by half every 5000 training epochs. When L size < 0.001, then L loca < 0.005, and the composite loss < 0.01, the training procedure stops. The memory limitation of the GPU determines the batch size. Six is the maximum value for the 8 GB GPU memory. The initial learning rate and the training stop condition are set based on experience and fine-tuned according to the training set's loss curve.

B. Evaluation Metrics
We employ the typical absolute and relative error metrics to evaluate the model performance: MAE and mean absolute percentage error (MAPE). Assuming y i is the ground truth, y i is the prediction value, and N is the number of values to be predicted, the definitions of MAE and MAPE are as follows:

C. Model Performance Test
As parameter tuning is an essential part of the DL models, we tune the hyperparameters of SSENet and pick up a well-trained model. We will discuss the tuning procedure in Section IV-F. We use the testing set to evaluate the performance of the well-trained SSENet. SSENet outputs the scaled lengths and widths of all 390 testing ships. We rescale the outputs to normal values and calculate the metrics.
The results of SSENet are shown in Fig. 6(a) and (c). The MAE of length is 7.88 m and that of width is 2.23 m. As the pixel spacing is 10 m, the MAEs of the estimated length and width are less than 0.8-pixel spacing. The MAPEs of estimated length and width are 5.53% and 8.93%, respectively. As shown in Fig. 6(a) and (c), the R 2 score values of the estimated length and width are 0.9773 and 0.9093, respectively. These high values demonstrate that the estimated ship length/width is highly consistent with the ground length/width. The R 2 score of width being smaller than that of length showing that the performance of width estimation is worse than that of length. There are two reasons; first, the ship's width is much smaller than its length. The ship's signature on the SAR image is more ambiguous in width than in length [8]. This phenomenon causes random errors in the width of the labeled RBB, affecting the accuracy of the initial width obtained from the first step of SSENet. Second, the MSSE loss function drives the model to pay more attention to length than to width during model training, making the model fit width less than length.
To demonstrate the necessity of the DNN-based regression model, we draw the relationship between the generated RBB's size and the ship's ground size and the relationship between the labeled RBB's size and the ship's ground size, as shown in Fig. 6(b)/(e) and (c)/(f). The generated RBB is automatically detected from the SAR images in the first stage of SSENet. The labeled RBB is manually labeled for each ship based on the principle presented in Section II-B. As shown in Fig. 6   that of width is more than 50 m. The gap between the RBB's size (whether generated or manually labeled) and the ship's ground size is huge. By adding the DNN-based regression model, the MAEs are pushed below 8 m, which demonstrated that the DNN-based model as the second stage of SSENet is necessary and effective.
In summary, SSENet can extract the ship size from SAR images end to end and control the absolute error (AE) under 0.8-pixel spacing.

D. Effectiveness of the Hybrid Input
To test the effectiveness of the hybrid input for the DNN-based regression model, we design an experiment to test the performance of SSENet in different inputs (Table I). SSENet 1 uses the initial ship length and width as the input without any feature map F 6 . SSENet 2 uses the initial ship length, ship width, and feature map as the input. SSENet 3 uses the initial ship length, ship width, feature map, and initial orientation as the hybrid input. The other settings are unchanged.
The results are shown in Table I and Fig. 7. The MAE and MAPE of SSENet 1 are the largest ones among the three models. By adding a feature map, SSENet 2 obtains much better results than SSENet 1 : reducing the MAE of length by nearly 2 m. As shown in Fig. 7(a) and (b), the R 2 of SSENet 2 is larger than that of SSENet 1 . The estimated lengths of SSENet 2 are more consistent with the true values than those of SSENet 1 . This fact demonstrates that the input SAR image's feature map is an essential factor for ship size estimation. As stated in Section III-B, the input SAR image contains the ship's signature and the sea clutter under typical environmental conditions. The ship's signature reflects the ship's state, moving or stationary. The environmental conditions such as wind fronts, ocean waves, and rain cells affect the sea-ship interaction. All these factors are related to the ship's size and are encoded into the extracted feature map. Therefore, using the feature map as an input improves the accuracy of size estimation. Finally, by explicitly adding the ship's initial orientation as another input, the model performance is further enhanced: the MAE of length is under 8 m, which is less than 0.8-pixel spacing. Therefore, the proposed hybrid input effectively contributes to an accurate estimation.

E. Effectiveness of MSSE Loss Function
We integrate the ground truths of the ship length and width into the classical MSE to form the MSSE loss function. To test the effectiveness of MSSE, we compare our model with the model using MSE as the loss function. The model with MSE as the loss function is named SSENet MSE . SSENet MSSE is the proposed model that uses MSSE as the loss function. The structure of the two models is the same except for the loss function. The other experimental settings are unchanged.
The results are shown in Table II. For length, the MAE and MAPE of SSENet MSSE show apparent advantages over those of SSENet MSE . The length MAE of SSENet MSSE is nearly 1 m less than that of SSENet MSE , an increase by 11%. Since MSSE magnifies the loss of considerable value and drives the model to focus more on length than on width during training, SSENet MSSE performs slightly worse than SSENet MSE . However, this drawback does not obscure the advantages of MSSE. First, the difference between the two widths is a negligible few centimeters. Second, length is a more direct parameter to reflect the ship information compared with width. The MSSE loss improves length estimation. Thus, our MSSE loss is useful, especially for estimating the ship's length.

F. DL Model Tuning Hyperparameters
Since tuning hyperparameters is an essential procedure for the DL model, we detail this process of four key parameters, including the neuron and layer numbers of hidden NN layers, the channel numbers (N), and the max-pooling size (S) of F 6 in Section III-B.

1) Number of Neurons in the Hidden Layers:
For the DNN regression part, the number of neurons in each hidden NN layer influences the model performance. We keep the depth of DNN as five (three hidden NN layers, one input layer, and one output layer) and change the neuron numbers of hidden NN layers to 64, 128, 256, and 512. The other settings are unchanged. The experimental result is shown in Fig. 8(a). For both length and width, MAEs and MAPEs decrease first and then increase as the number of neurons increases and reach the minimum when the number of neurons is 256. This result is consistent with typical NN models' characteristics: a small number of neurons do not fit the data adequately, and too many neurons lead to overfitting. The width of the MAPE fluctuates slightly and is minimal at 128 neurons. For most ship sizes, width is much smaller than length, which will magnify the MAPE and easily cause some fluctuations. Therefore, the appropriate number of neurons for the DNN model is 256.
2) Number of Hidden Layers: This section tries to find the best setting for the number of hidden layers in the DNN regression part. We fix the number of neurons in all hidden layers as 256 and set the number of hidden layers as one to four in step one. The MAEs and MAPEs are shown in Fig. 8(b). When there are three hidden layers, the MAE of length, MAPE of length, and MAE of width obtain the minimum values: 7.88 m, 5.53%, and 2.23 m. When the number of layers is fewer than 3, the model performs slightly worse. This phenomenon may be due to the insufficient fitting ability of the shallow NN models. Fewer hidden layers mean fewer parameters and fewer nonlinear transformations (activation function).
For example, the one-hidden-layer model parameters are about 130 000 (256 × 256 × 2) less than those of the three-hidden-layer model. The nonlinear transformations of the one-hidden-layer model are two times less than those of the three-hidden-layer model. Therefore, the fitting ability of one-hidden-layer-model is weaker than that of the threehidden-layer mode, and the same is true for the two-hiddenlayer-model. When the number of layers is 4, the performance of the model is low. This low performance may due to the gradient vanishing in a deeper model. The core idea of DL is to increase the nonlinear transformation capability of a model by employing hidden NN layers as many as possible, the so-called a more in-depth model. However, too many hidden NN layers lead to a gradient vanishing problem, making the model difficult to be trained and reducing the model's performance. Therefore, the hidden layers of a DNN model cannot be increased indefinitely.
3) Channel Numbers of F 6 (N): In Section III-B, we compress F 5 in channel dimension by a CNN layer with N 1 × 1 filters. This experiment explores the effects of different N values on the performance of the size estimation. We set the max-pooling layer's kernel size, defined as S in Section III-B, to 4. Then, the spatial dimension of F 6 is 5 × 5. The other settings are unchanged. The results are shown in Fig. 8(c). For length estimation, the model does not perform very well when N is 1 because the MAE is 9.21 m. When N has other values, the estimation errors do not differ much, reaching a minimum of 7.88 m when N is 3. For width estimation, different N values show little impact on MAEs. Therefore, we set the value of N to 3 in our proposed model.

4) Pooling Size of F 6 (S):
In Section III-B, we compress F 5 in spatial dimension by a max-pooling layer with S × S filters. This experiment explores the effects of different S values on the performance of the size estimation. We fix the channel numbers (N) to 3 and change S from 1 to 5 in step 1. As the spatial size of F 5 is 19 × 19, the spatial dimension of F 6 can be calculated by 19/S. Therefore, the spatial dimension of F 6 corresponding to filters with sizes 1, 2, 3, 4, and 5 is 19 × 19, 10 × 10, 7 × 7, 5 × 5, and 4 × 4. The results are shown in Fig. 8(d). For length estimation, the MAE of length fluctuates slightly with different values of S and is minimum when S is 4, which is 7.88 m. The MAE of width changes very little. Therefore, we set the value of S as 4 in our proposed model.
Finally, the appropriate values of four hyperparameters are shown in Table III. V. DISCUSSIONS

A. ML Versus DL in Practical Ship Size Estimation?
The accurate ship size regression part in the proposed model is a DNN model. We chose three typical ML models-gradient boosting regression (GBR) [46], support vector regression (SVR) [47], and linear regression (LR) [48]-and discuss their performances on our experimental data. GBR and SVR are two typical nonlinear models that are adopted by existing ship size extraction models in references [8] and [7]. LR is a basic linear model used in reference [13]. Because these three ML models cannot be integrated with the SSD to form an endto-end model, we manually labeled the RBB of each ship as the initial ship length, width, and ship orientation and inputted them into three ML models. The training set and testing are unchanged. We tune the parameters of GBR, SVR, and LR and record their results with minimum MAEs. The DNN model is the one integrated into the proposed DL model.
The results are shown in Table IV, with our DL model results in the last row for reference. For length estimation, GBR obtains the minimum MAE, that is, 9.69 m. For width estimation, GBR also performs the best. LR, SVR, and DNN perform worse than GBR.
The GBR works the best. GBR is a typical ensemble learning model that achieves better results than a single learner by training and combining multiple learners. GBR has been proved to be the right choice in the three-stage ship size estimation procedure [8]. However, since GBR is an ML model based on decision trees, it cannot automatically extract features from input SAR images and integrally perform the traditional three steps. It is also not possible to integrate GBR with a DL-based ship detection model, such as DRBox-v2. The premise of using GBR is that we need to accurately extract the RBB of the ship by binarization and image operation, which is a great challenge, especially in the era of big data. Practically, GBR is unable to achieve an end-to-end size extraction: inputting the SAR image and outputting the ship size. Table IV shows that the DNN model does not perform well. However, the advantage is that we can integrate a DNN model with any DL-based ship detection model based on CNN or NN to achieve an end-to-end size extraction from the SAR image. Compared with the traditional three-stage method, all trainable parameters in the DL model are globally optimized. The feature maps automatically extracted by the DL detection model can be input into the DNN regression model to further improve ship size's extraction accuracy. As shown in Table IV, compared with the GBR model, our SSENet reduces the MAE of length by nearly 2 m, that is, by about 18.68%. Therefore, we design a DNN model for accurate ship size regression.

B. Sources of the Errors?
This section analyzes the estimation errors in detail and tries to find out what causes large errors. Based on existing studies [8], [14], we mainly analyze the relationship between MAE error and ship orientation and travel speed.
1) Ship Orientation: In Fig. 9, we display the trend of errors with respect to the orientation angle. The orientation angles are ground values obtained from AIS. Fig. 9(a) and (c) is the results of the model without orientation as an input parameter (SSENet 2 in Table I). Fig. 9(b) and (d) is the results of the model with orientation as an input parameter (SSENet 3 in Table I). For both models, the error varies with the ship orientation variation. The errors are higher when the orientation angles are closer to 0 • (azimuth direction) in the range of (−45 • , 45 • ). This phenomenon is mainly caused by the ship motion and the unequal resolution after image formation (5 m × 20 m for range and azimuth directions, respectively; the lower resolution enlarges the errors in the azimuth direction) [8]. If the ship moves in a direction consistent with the azimuthal direction, the azimuth direction's speed component is large. The large speed component leads to greater smearing of ship signature in the SAR image, which increases the estimation error.
By comparing Fig. 9(a) and (b) with Fig. 9(c) and (d), we find that when we add the initial orientation angle (cos θ) into the DNN model as an input, although the overall trend does not change, the error in most angle intervals is reduced. This result further verifies the validity of adding the initial orientation angle as an input in the DNN regression model.
2) Ship Speed: Fig. 10 shows the trend of errors in the ship's speed. As the SAR images of the OpenSARShip are mainly from ports, about 83% of the ships are still. Fig. 10(a) and (b) shows that the MAEs of length and width in the range of (0, 1) knot (1852 km/h) are small. The error fluctuates as the ship speed increases. When the speed is greater than 15 kn (27 780 km/h), the MAEs of length and width increase significantly, which are 19.04 and 4.71 m, respectively, far greater than the errors of other speed intervals. Thus, when the ship's speed is greater than 15 kn (27 780 km/h), the AE in size estimation significantly increases. However, unlike the ship's initial length, width, and orientation, we cannot obtain the reliable ship speed from its signature on the SAR image. One may get ship speed from its wake features in the SAR image [49]. However, systematically obtaining ship speed from SAR images under different environmental conditions still needs further research.
3) Ship Size: Fig. 11 shows the absolute and relative errors of each predicted size to the ground ship size. The AE of a predicted size is calculated by setting N in MAE as 1. The relative error is the percentage error (PE), which is obtained by setting N in MAPE as 1. As shown in Fig. 11(a) and (c), there are no apparent dependencies between the AE of the predicted size and the ground size. The PE goes down with the increase in the ground size see [ Fig. 11(b) and (d)]. This phenomenon is easy to understand because an increase in the ground value leads to an increase in the PE denominator, resulting in a decrease in the PE. Therefore, the ship size is not a source of errors.

C. Robustness
To explore the robustness of SSENet, we test SSENet by the leave-one-out strategy. There are 1890 samples. We construct   (0.8-pixel), and the mean MAE of four testing sets is 7.96 m. For width, the MAEs of all testing sets are under 2.30 m, and the mean MAE is 2.10 m. Therefore, SSENet shows robustness on different training/testing sets.

VI. CONCLUSION
This study proposes an end-to-end DL model, SSENet, to extract the ship size from SAR images. We employ an SSD-based model, DRBox-v2, to extract high-level features and generate an RBB of the ship from the input SAR image. Based on the extracted features, we design a DNN-based regression model to estimate the accurate ship size. The DNN-based regression model is concatenated with the SSD-based model to form SSENet. We construct a hybrid input for the DNN-based regression model. The hybrid input includes the initial ship size and orientation angle obtained from the RBB and the high-level features extracted from the input SAR image. To characterize length and width of a ship accurately, we design a custom loss function MSSE to optimize the DNN model. SSENet is trained and validated by the OpenSARShip, a data set dedicated to Sentinel-1 ship interpretation. Experiments show that: 1) SSENet is capable of extracting ship size from SAR images end to end and pushes the estimation error MAE under 0.8-pixel, 7.88 m in length and 2.23 m in width; 2) compared with the single input, the hybrid input significantly reduces the estimation errors, about 2 m in length and 0.4 m in width; 3) the MSSE reduces the MAE of length by nearly 1 m and increases the MAE of width 0.03 m compared to the classical MSE loss function; 4) compared with the typical traditional ML model GBR, SSENet simplifies the extraction procedure and reduces the MAE of length by nearly 2 m (18.68%) and the MAE of width by 0.06 m (2.51%), showing advantages of the end-to-end learning; and 5) SSENet shows robustness on four different training/testing sets, with a mean MAE of length 7.96 m and a mean MAE of width 2.10 m.
As the ship orientation and travel speed are two main factors impacting the accuracy of size estimation, in the future, we will explore how to optimize the model to eliminate the influence of these two factors. Based on the extracted size, we will also attempt to develop a DL model to identify ship types end to end from SAR images. To test the model's adaptability to different SAR sensors, we will try to build more SAR ship data sets and make some quantitative evaluations.