Improved Sea-ice Identification using Semantic Segmentation with Raindrop Removal

Sea-ice identification is an essential process for safety critical navigation support of surface vessels in polar waters. Semantic segmentation has drawn much attention as an enabling technique for fast detection of objects in a scene including sea-ice conditions. Identifying sea-ice is a challenging problem, especially in the presence of raindrops. The raindrop alters the boundaries of the objects in the scene, and thus, degrades the identification performance. In this work, a raindrop removing framework is developed to enhance the classification performance. Three deep-learning semantic segmentation networks are trained to classify the scene of sea-ice images into ice, water, ship, and sky. The deep-learning networks are VGG-16, fully convolutional network, and pyramid scene parsing network. Transfer learning along with data augmentation operations have been implemented to improve the training process. Results illustrate that data augmentation operations enhance the performance of the three models. Moreover, the raindrop removing framework improves the models’ performance, e.g. the average intersection over union of the VGG-16 model is improved from 85.91% to 91.70%.


I. INTRODUCTION
R ECENTLY, navigation through sea-ice has attracted more research efforts as the need for operating surface vessels and offshore operations in polar waters have increased. Unlike in land navigation where road networks are deterministic, navigable area in polar waters is continuous and the ice thickness, shape, and concentration are changing over time.
Navigation under sea-ice conditions requires highly trained and experienced ice navigators to find a safe route through rapidly changing environment and to avoid the hazardous ice conditions following established polar operation risk assessment standards [1]. Satellite-based synthetic aperture radar systems provide large-scale and high-spatial resolution of the sea-ice floes, which is useful for operational planning through polar waters [2]. However, relying on aerial air-borne or space-borne imagery requires a reliable and sustainable communication channel between the aerial scanner and the earth. Moreover, the quality and temporal frequency of aerially captured images degrades significantly in bad weather conditions such as clouds. One advantageous alternative is to obtain the navigation information in situ, without the need for external platforms. Onboard systems are considered to gather and process information in real-time, which is important for time-critical decisions in maneuvering, control, and monitoring systems [3], [4]. An example of onboard system is the marine radar, which is an insitu sensing device that operates in the X-band. It has the capability of measuring the backscatter from the polar water surface in space and time, independent of lighting conditions and under different weather conditions. However, marine radars have limited spatial resolution and it is quite difficult to establish form and type of ice based on the radar information alone. Currently, for in-situ navigational risk assessment, the detailed information about ice is captured by ice navigators using primarily visual observations [5].
Computer-aided scene analysis techniques such as automated image processing and segmentation have paved the way for autonomous navigation systems, which reduces costs, processing time, and human bias in the navigation process. For this application at least, a trained human is considered an expert over any onboard system due to the experience required. Moreover, the resurgence of deep neural networks (DNNs) has dramatically improved the performance of many computer-aided scene analysis techniques such as image classification [6], object detection and localization [7], and semantic segmentation [8].
In contrast to image prediction, semantic segmentation generates a fine-grained delineation of objects that embeds their spatial information, which makes it a key enabling technique to address diverse remote sensing problems [9]. In context of sea-ice monitoring, DNNs have been utilized in image-based ice detection techniques such as sea-ice classification in synthetic aperture radar (SAR) images [10], [11], ice objects classification in optical close-range images [12], lake ice monitoring algorithm [13], and river ice classification in images collected by an unmanned aerial vehicle (UAV) [14].

A. RELATED WORK
Various classification and identification problems are of interest in the ocean environments, such as underwater target classification [15], maritime targets classification on highresolution image [16], classification of coral reef images [17], and sea-ice condition identification and assessment [4]. Research efforts have been devoted to develop techniques for sea-ice classification [2], [3], [18], [19]. In [2], the authors proposed a remote sensing algorithm that utilizes radar images to estimate the ice-drift velocity vector in a region around a moving ship. Two Kalman filters were integrated with radar image processing to estimate the local drift vector from the vessel motion. In [3], the authors developed an algorithm to quantify ice concentration and to estimate ice thickness. The global Otsu method and the K-means method were utilized to implement the ice concentration analysis. In [18], a sea-ice monitoring model using SAR was designed. This model performs SAR segmentation and classification using the Markov random-field theory such that a regiongrowing technique keeps refining the segmentation iteratively. In [19], an ice navigation system was presented. This system implements ship-based ice awareness by utilizing a combination of radar, lidar, and video processing for ice detection and classification.
NNs have attracted considerable research attention recently as a promising tool to avail automated sea-ice monitoring solutions [10]- [14], [20]. In [10], a NN algorithm was designed to classify sea-ice in SAR images of central arctic. The algorithm classified the images into open water, and deformed ice based on third and fourth central statistical moments, inertia, cluster prominence, energy, and homogeneity of image brightness. In [12], ice objects in optical close-range images were classified into several categories using convolutional neural networks (CNNs). In [11], an algorithm was proposed to classify the scene of SAR images into several classes using CNNs. In [14], a data set was collected by a UAV of river ice and used to train a semantic segmentation deep network, which classifies the scene into ice, water, and other class. In [13], a lake ice monitoring algorithm based on semantic segmentation was proposed. The authors utilized video streams acquired by a webcam to generate the data set with nomenclature classes of water, ice, and clutter. In [20], two data sets were used to train DNNs. The scene in the first data set captures four classes, namely ice, vessel, ocean, and sky; while the scene in the second data set captures more ice classes.
In this paper, we focus on studying the effect of rain droplets on the sea-ice identification and develop a framework to remedy this effect. Moreover, we study the effect of changes in the camera location and mounting angle on the sea-ice identification performance by data augmentation operations such as cropping and rotating the training set. DNNs are utilized to classify images of sea-ice scenes captured onboard of a ship using semantic segmentation. Each seaice scene image is classified into four classes, namely ice, water, ship, and sky. The initial dataset of this work consists of 428 and 23 training and evaluation images, respectively; this is constructed from images taken by the Nathaniel B. Palmer icebreaker during its expedition through the Ross Sea. 1 Data augmentation operations are applied to enhance the size of the dataset, such that the total data set of this work consists of 10,700 and 575 training and evaluation images, respectively. Moreover, transfer learning is implemented using the Cityscapes and CoCo datasets, which are widely used in academia in the context of training semantic segmentation models. Three DNN models are trained, namely VGG-16, fully convolutional network (FCN-8), and pyramid scene parsing network (PSPNet-50). The classification performance of the models is measured using the precision and intersection over union (IoU) of the predicted and the ground truth images.
The rest of this paper is organized as follows. Section II introduces the adopted DNN models. The data augmentation operations are discussed in Section III. Section IV introduces the raindrop removing framework. Results are discussed in Section V and Section VI concludes the paper.

II. DNN MODELS
In this work, we consider three models, each representing different DNN architecture, namely CNN, fully convolutional network, and encoder-decoder network architecture. This section introduces the adopted DNN models.

A. VGG-16
VGG-16 is a CNN model proposed by the University of Oxford [6]. The VGG-16 model includes convolution layers and max pool layers consistently throughout its architecture. In the end it has 2 fully connected layers (FCS) followed by a soft-max for output. The number 16 in VGG-16 refers to the fact that it has 16 layers.

B. FULLY CONVOLUTIONAL NETWORK (FCN-8)
FCN-8 is the first model to train a network end-to-end for semantic segmentation that gained this name from its archi-tecture, which is built from fully connected layers [8]. FCN-8 can work regardless of the image size. Figure 1 shows the architecture of the FCN-8 model.

FIGURE 1:
The architecture of the FCN model [8].

C. PYRAMID SCENE PARSING NETWORK (PSPNET-50)
PSPNet-50 model takes into account the global context of the image to perform the local level predictions [21]; hence, it achieves better performance in comparison with the FCN model which classifies pixels without capturing the context of the whole image. Figure 2 shows the structure of the PSPNet  model. It starts with an input image, then first uses the CNN to obtain the feature map of the last convolutional layer as in part (b), after which a pyramid parsing module is applied to harvest different sub-region representations, followed by upsampling and concatenation layers to form the final feature representation; this carries both local and global context information as in part (c). Finally, the representation is fed into a convolution layer to get the final per-pixel prediction as in part (d).

III. DATASET ENHANCEMENT USING DATA AUGMENTATION OPERATIONS A. IMAGES SOURCE
The dataset is constructed from images taken from the Nathaniel B. Palmer expedition through the Ross Sea, Antarctica [20]. 1 The images were captured in different light conditions encountered in the voyage ranging from midday sun to gray skies and setting sun. In addition, some images present precipitation of rain. The original data set consists of 428 training images (380 clear weather images and 48 rainy weather images) and 23 evaluation images that comprise different weather conditions such as sunny, cloudy, rainy, and clear weather. The scene in the images consists of ice, water, ship, and sky classes. As the ship moves, the surrounding ice, water, and to some extent the sky change. However, the ship class does not change much because the images are taken from a fixed location on the ship. Consequently, data augmentation operations are performed to increase the data diversity for a more robust model.

B. DATA AUGMENTATION OPERATIONS
Data augmentation encompasses a suite of operations that enhances the size of training and evaluation datasets such that better deep learning models can be built without the need to collect new data [22]. Two main categories of data augmentation operations are considered in this work, namely geometrical data augmentation and image effect data augmentation. In the following, we briefly describe the data augmentation operations. Each cropping operation is associated with an appropriate scaling operation to maintain the size of each image before and after cropping. • Adding noise: Adding noise augmentation consists of adding a random value to each pixel; the random value is usually drawn from a zero-mean normal distribution with variance σ 2 , N (0, σ 2 ). In this work, we consider σ 2 = 10 and σ 2 = 20. • Changing lighting condition: Changing lighting condition augmentation is implemented by increasing or decreasing the brightness of the images.
The original training set consists of 428 images; we add 2 × 428 images through noise augmentation, 3 × 428 images by applying cropping operation, and 18 × 428 images by applying the rotation, flipping, and changing lighting condition operations. Figure 3 illustrates a sample example of some of the augmentation operations. VOLUME , 2021 Images captured onboard a ship are subject to weather conditions such as rain. The raindrop degrades the image quality and alter the boundaries of the objects in the image, which reduces the performance of the semantic segmentation models. Including rainy weather images in the training dataset does not improve their performance. To cope with this issue, removal of raindrop effects is essential to improve the performance of DNN. Figure 4 illustrates the raindrop removing framework, which consists of the following operations: • The first step is to test whether there is any raindrop in the image or not. This step is important to apply the morphological operations to the images with raindrop. We trained a deep CNN model as binary classifier to classify the images. This model consists of a convolutional 2D layer with 16 filters, a kernel of 3 × 3 pixels, the input size as our image dimensions, i.e., 713 × 713 × 3. After that, a stack of 5 max pooling layers is added. Finally, the output is flatten and feed into a fully-connected layer, and then to a sigmoid layer for binary classification. A data set of 10,700 images is utilized to train the binary classifier, which achieves 99.5% accuracy. • Morphological operations: Images with raindrop undergo morphological operations to reduce the corresponding effects; these operations are image smoothing and object edge detection.
-Image smoothing is the process of capturing important objects in the image while leaving out fine-scale structures/rapid patterns. In this work, we adopt a bilateral filter which smooths out the raindrop and preserves the edges of the object [23], [24]. The bilateral filter takes a weighted sum of the pixels in a local neighborhood; the weights depend on both the spatial distance and the intensity distance. In this way, edges are preserved well while noise is averaged out. Mathematically, at a pixel location x, the output of the bilateral filter is calculated as followŝ where I andÎ are the input and filtered images, respectively, σ 2 d and σ 2 r are parameters controlling the fall-off of the weights in spatial and intensity domains, respectively, N (x) is a spatial neighborhood of x, and C is a normalization constant such that with |·| and ∥·∥ as the first and second norms, respectively. The selection of σ 2 d and σ 2 r depends on the image intensity and the size of the object that needs to be smoothed out [24], [25]. In this work, we set σ 2 d = 45 and σ 2 r = 150. -Objects edge detection: Edge detection is a morphological operation for finding the boundaries of objects within an image. The edge detection algorithm identifies points in the image at which the image brightness changes sharply or, more formally, has discontinuities. The points at which the image brightness changes sharply are typically organized into a set of curved line segments termed edges. In this work, we adopt the Canny edge detection algorithm which represents one of the most efficient edge detection algorithms [26]. It consists of the following steps: (1) Finds the intensity gradients of the image; (2) Apply non-maximum suppression to remove the spurious response to edge detection. Non-maximum suppression means that edge points are defined as points where the gradient magnitude assumes a local maximum in the gradient direction; (3) Apply a threshold to determine potential edges such that the pixels with gradient values greater than this threshold will be considered as edge; we set the threshold as 100, which is obtained heuristically based on sensitivity analysis.
Clear boundaries between the objects in the image enables the DNN models to accurately apply the semantic segmentation. The raindrops alter objects' boundaries and the smoothing filter cannot obtain these boundaries. The raindrops do not alter all the boundary pixels of an object in the original image because their size is smaller than the object. Consequently, the boundary pixels affected by the raindrops can be reconstructed using the neighboring boundary pixels. The mode filter 1 is an edge-preserving filter, in which the value of the output pixel is obtained by the mode over all pixels within the filter's window. The locations of boundary pixels are determined using the edge detector. The input of the mode filter is the intensity of the boundary pixels in the original image and its output determines the intensity of the boundary pixels in the resulted image. Figure 5 illustrates a sample of the morphological operations.  Table 1 illustrates the number of training epochs, estimated time of arrival (ETA) which in the context of Keras is the estimated time before the model finishes one epoch, time of each training step, and testing time per image of each deeplearning network. The training epochs of each network are selected by evaluating the performance after each iteration to determine the optimal number of training epochs and to avoid over-fitting. It is noticed that the PSPNet-50 performs more epochs. It is worth mentioning that the training is an off-line process and can be performed using powerful computers. The proposed raindrop removing framework consists of three morphological operations: (1) A bilateral image smoothing filter, which has a complexity on the order of

A. COMPLEXITY ANALYSIS
where N is the number of pixels and R is the extent of the intensity scale [28,Chapter 4.5]. (2) Canny edge detection, which has a complexity on the order of O (N 2 log N ). (3) Mode filter, which has a complexity on 1 Each output pixel of a mode filter is assigned the intensity of the most frequently occurring pixel in the input pixels [27]. the order of O (N 2 log N ). Consequently, the computational complexity of the proposed framework is O(N + N . It is worth mentioning that the execution time of the proposed raindrop removing framework is 55 ms per image and it is applied to only the images with raindrop.

V. RESULTS
In this section, we evaluate the three DNNs using the original and augmented datasets, with and without the raindrop removal.

A. TRAINING SETTINGS
The three DNN models are trained using the original dataset (428 images) and the augmented dataset (10,700 images); the size of each image is 713 × 713 pixels. Transfer learning has been implemented to train the models with both original and augmented datasets. To train the DNN models with the original dataset, pre-trained models with Cityscapes and CoCo datasets have been used as starting points. The starting points to train the DNN models with the augmented dataset are the resulted models of the original dataset. The training has been performed using a Lenovo ThinkStation-P920 server running on Linux version Ubuntu 18.04 LTS; the central processor unit is Intel Xeon 24 cores and 64 GB RAM is used. The workstation has a GeForce RTX 2080 Ti graphics card with 11 GB memory.
The classes in each image represent the following: • Ice: ice visible in the image including ice pans; • Water: open water of the ocean that appears in the image • Ship: sections of the ship that appear in the image • Sky: visible sky in the image. We evaluate the performance of the deep learning networks using the intersection over union (IoU), precision, recall, and F1-score [29]. IoU is a similarity coefficient representing the ratio of the overlapping area of ground truth and predicted area to the total area, and can be expressed as where T and P represent the ground truth image and image produced by the deep learning model, respectively. The precision and recall are defined as respectively, where P T is the number of true positive pixels, P F is the number of false positive pixels, and N F is the number of false negative pixels. Finally, the F1-score is defined as VOLUME , 2021 Figure 6 illustrates a sample image and the results of VGG-16, FCN-8, and PSPNet-50 models. It is clear that the models are able to classify the classes in the image. To get detailed insight, Table 2 summarizes the performance of the three models evaluated using the image in Figure 6. It is noticed that the best performance is achieved by the PSPNet-50, and the FCN-8 model outperforms the VGG-16 model. This is in line with the results in [8] where FCN-8 outperforms the VGG-16, as the former was designed by upgrading the latter through including FCNs and transferring its learned representations by fine-tuning. The PSPNet architecture achieves state-of-the-art performance on both the original dataset and the augmented dataset because it takes into account the global context of the image, hence gives better performance. It is worth mentioning that the values in Table 2 are the average of the four classes ice, water, ship, and sky.  Figure 7 illustrates some sample images and the corresponding results of the images based on training the models using the original dataset.  Figure 8 summarizes the performance of the three models with the original dataset. It is noticed that the performance of the three models in the water and ice classes is lower in comparison with the other two classes. This is attributed to the fact that the location of the sky and ship do not change in the images of the original dataset. Consequentially, the location of the sky and ship enables the models to identify these classes more accurately. On the other hand, the location, shape, and size of the water and ice classes are changing in the original dataset, which reduces the models' capability to classify these classes. Motivated by this observation, geometrical data augmentation operations are applied to increase the diversity in the dataset. Figure 9 illustrate an image with different augmentation operations in which image 1 is the original image and images 2, 3, 4, and 5 represent the following augmentation operations: rotation by 45 • , rotation by 150 • , darker image, and cropped image, respectively. It is noticed that the models can classify the ice, sea, ship, and sky classes.

D. PERFORMANCE OF THE MODELS WITH DATA AUGMENTATION
To study the effect of each augmentation operation, Table 3 illustrates the average IoU of the three DNN models using the original dataset, the dataset after applying each augmentation operation, and the entire augmented dataset. It is noticed that each operation improves the performance of the three DNN models and the operation of adding noise provides minor improvement. The performance of the three models improve when all augmentation operations are performed. To get deep insight into the models performance with the augmented dataset, Figure 10 summarizes the performance of the three models. It is clear that data augmentation remarkably improves the performance of the PSPNet-50 model, which takes into account the global context of the image while the other models perform pixel per pixel classification. It is worth noting that the classification performance of the models is improved with the augmented dataset and the classification performance of the water and ice classes is close to that of the ship and sky classes. This result indicates that the augmented dataset enables the models to gain spatial diversity of the classes.

E. RAINDROP EFFECT
In Figure 7, it is noticed that the three models are incapable to predict the classes properly in image 4, which is an image with raindrop. It is clear that such images degrade the performance of the three models dramatically. This motivates the application of a removing model. Figure 11 shows a sample of an image with raindrop before and after applying the removing process using the proposed framework. It is noticed that the performance of the models is improved with this framework. To gain deep insight on the effect of the raindrop, Figure 12 and Figure 13 illustrate the average IoU and precision of the three models when evaluated using images of good and rainy weather condition (without raindrop removal), respectively. It is clear that rainy weather images significantly degrade the performance of the three models.  Table 4 illustrates the performance of the three models before and after removing the raindrop using the proposed framework, the ℓ 0 -gradient minimization approach proposed in [30], and the conditional generative adversarial network developed in [31]. It is noticed that removing the raindrop enhances the performance of the three models. Furthermore, the proposed framework outperforms the approaches in [30] VOLUME , 2021 and [31]. The three approaches are also compared in terms of the peak signal-to-noise ratio (PSNR), which is expressed as: where M is the maximum pixel score and MSE is the mean square error between the input and resulted images. The average PSNR of the proposed framework, the approach proposed in [30], and the model developed in [31] is 31.91 dB, 31.02 dB, and 30.30 dB, respectively.

VI. CONCLUSION
In this work, three deep-learning semantic segmentation networks (VGG-16, FCN-8, PSNet-50) were applied to identify sea-ice in a scene of ice, sky, water, and ship. The performance of the models has been evaluated using the IoU and precision metrics. Data augmentation operations have been implemented to increase the diversity of the dataset and a raindrop removing framework has been applied to improve the performance of the models under rainy weather conditions. Results have showed that the data augmentation operations enhance the performance of the three models.