MeltPondNet: A Swin Transformer U-Net for Detection of Melt Ponds on Arctic Sea Ice

High-resolution aerial photographs of Arctic region are a great source for different sea ice feature recognition, which are crucial to validate, tune, and improve climate models. Melt ponds on the surface of melting Arctic sea ice are of particular interest as they are sensitive and valuable indicators and are proxy to the processes in the Arctic climate system. Manual analysis of this remote sensing data is extremely difficult and time-consuming due to the complex shapes and unpredictable boundaries of the melt ponds, and that leads to the necessity for automatizing the processes. In this study, we propose a robust and efficient automatic method for melt pond region segmentation and boundary extraction from high-resolution aerial photographs. The proposed algorithm is based on a swin transformer U-Net in which we introduce novel cross-channel attention mechanisms into the decoder design. The framework operates with optical data and allows for classifying imagery into four classes, i.e., sea ice/snow, open water, melt pond, and submerged ice. We use aerial photographs collected during the Healy–Oden Trans Arctic Expedition over Arctic sea ice in the summer season of 2005 as test data. The experimental results show that the proposed method is suitable for precise automatic extraction of melt pond geometry, and it can also be extended for other optical data sources that involve melt ponds. The approach has a promising potential to be used to analyze melt ponds' corresponding changes between years.


MeltPondNet: A Swin Transformer U-Net for
Detection of Melt Ponds on Arctic Sea Ice Ivan Sudakow , Vijayan K. Asari , Senior Member, IEEE, Ruixu Liu, Member, IEEE, and Denis Demchev , Member, IEEE a significant effect on atmospheric circulation patterns. Sea ice loss in the Arctic can be tied to the rapid warming trends observed recently in the Arctic, primarily due to the ice-albedo feedback [2]. In order to define the ice-albedo feedback, we must first define melt ponds. Melt ponds form atop Arctic sea ice during the spring/summer melt season from the melting snow layer on top built up over the winter months. Freshwater runoff from this snow melt begins to percolate through the porous micro structure of the ice, reducing the salinity of the brine in the pore space, causing it to freeze and block further drainage [3]. As the ice continues to warm, it becomes increasingly permeable eventually allowing the ponds to drain into the ocean below.
The ice-albedo feedback is the notion that as the ice begins to melt, ponds form atop the surface, lowering the albedo of the surface encouraging more melt, the melt further lowers the albedo and so on. This is an important effect to capture in any sea ice model being used for climatology and to date is not well-parameterized in climate scale sea ice models [4].
Not only do melt ponds have a significant effect on the energy budget of the Arctic, they also have an effect on the satellite-derived sea ice observations, in particular sea ice concentration (SIC) from passive microwave radiometry [5] for which the resolution is as low as 14-25 km, too low to resolve melt ponds. The high contrast in the microwave emissivity of sea ice is used to derive ice water fractions using tuned linear mixing models, which are taken as inputs representing satellite radiances at a variety of frequencies and polarizations. A major challenge in locating the melt ponds is that they have the same microwave signature as open water. In this way, they obscure the ice beneath them making it appear, as though there is less ice than those actually existing there. To combat this, the data derived "tie points" are used depending on the season. However, in cases where melt ponds are not present or in sufficient abundance, this can result in artificially inflated values of SIC. Errors can be as much as 30% [5]. The passive microwave record goes back almost 30 years and still can produce a daily snapshot of the ice cover giving a high-resolution long time record of the ice pack. Improving concentration retrievals during summer months is crucial for climate statistics, model evaluation, and for data assimilation with state-of-the-art models.
There is a long history of melt ponds defections on the images that include but not limited to the TerraSAR-X dual-polarization This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ data and airborne SAR images [6], [7], MODIS images [8], [9], the SHEBA and Healy-Oden Trans Arctic Expedition (HO-TRAX) aerial photographs [10], seasonal sea ice monitoring and modeling site (SIMMS) field experiment photographs [11], and ENVISAT WSM images with HH-polarization [12].
The comprehensive analysis of the literature (see Appendices A and B) shows that the researchers prefer classical methods of image processing, ignoring machine learning approaches. It could be driven by different reasons, however the main one is that a method is chosen based on the "physics" of a solved problem. However, we need to use more universal methods of image analysis that would not rely on the set of certain physical parameters, but on the general principles of the developing system (geometry, complexity, physics, etc.). The main goal of this study is to develop a machine learning method for robust melt ponds detection that is based on general strategies of deep learning. CNN-based segmentation methods, such as the FCN [13], provide superior performance for natural image segmentation. The state-of-the-art models for image segmentation are variants of the encoder-decoder architecture, such as U-Net [14]. U-Net++ [15] is essentially a deeply supervised encoder-decoder network where the encoder and decoder subnetworks are connected through a series of nested, dense, and skip pathways. With these hierarchical feature maps, the swin transformer model can conveniently leverage advanced techniques for dense prediction, such as feature pyramid networks FPN or U-Net. The transformer [16] is a network architecture originally developed for natural language processing (NLP). Also, inspired by the success of self-attention layers and transformer architectures in the NLP field, some works employ self-attention layers to replace some or all of the spatial convolution layers in the popular ResNet [17]. The visual transformer (ViT) [18] directly applies a transformer architecture on nonoverlapping medium-sized image patches for image classification. Swin transformer [19] modifies the ViT architecture to achieve the best speed-accuracy tradeoff among these methods on image classification. We also consider swin transformer as our main backbone and integrate it into the U-Net architecture with cross-channel attention, named as melt pond network (MeltPondNet), for melt pond detection on Arctic sea ice.

II. DATA
In this study, we use aerial photographs of Arctic sea ice obtained during the HOTRAX captured from a helicopter between 5 August and 30 September, 2005 [20]. The flights have been typically flown at relatively low altitudes of 150-700 m to avoid the influence of low clouds with a digital camera Nikon D70 onboard. One thousand thirteen individual scenes over Arctic sea ice have been selected for the analysis, which contain highly detailed imagery of individual ice floes, melt ponds, and submerged ice and open water areas. The average photo resolution was 3042 × 2048 pixels. Depending on the altitude, the pixel resolution ranges from 5 to 25 cm per pixel. By visual expert analysis of the photographs, we defined zones with four classes of surface: 1) sea ice/snow; 2) melt ponds;

A. Dataset Annotation
Many studies have focused on speeding up the image dataset annotation for semantic segmentation tasks. For example, one of the crowdsourcing methods is crowdsourcing annotations for visual object detection [21]. The three steps involved in this algorithm are: 1) drawing; 2) quality verification; and 3) coverage verification. In the drawing step, a worker draws one bounding box around one instance of the given image; in the quality verification step, a second worker verifies whether a bounding box is correctly drawn; and in the coverage verification step, a third worker verifies whether all object instances have bounding boxes.
Our cascaded annotation framework uses an incremental learning approach on a small batch of manually labeled images [22]. Then, it trains a segmentation model with the labeled data to propose segmentation areas on a batch of unlabeled images. Finally, it requests the annotator to correct possible incorrect polygons or label proposals. Thus, the involvement of human annotators is only in the correction stage [23]. Fig. 1 shows the conceptual diagram of our proposed method for cascaded semisupervised semantic segmentation annotation (CSSA). The segmentation encoder and decoder are trained on the dataset with unsupervised learning by reconstructing the input image [24]. Then, we changed the last layer of the decoder part to fine-tune the different classes in the dataset. The segmentation model is trained on a small set of manually annotated images. First, a trained model (i.e., model A) predicts pseudo labels on all unlabeled data. Next, model B is trained to annotate the unlabeled data by combining the labeled and pseudo-labeled data. After the first round of train-infer correction, the segmentation encoder and decoder parts are trained on the recently labeled batch. This process continues in a loop until all unlabeled batches are labeled.
The functional framework of the CSSA method uses unsupervised learning to obtain a feature encoder. Then, the CSSA model uses an incremental learning approach on a small batch of manually labeled images [25]. After that, we train a segmentation model with the labeled data to propose bounding boxes on a batch of unlabeled images and request the annotator to correct possible incorrect polygons or label proposals. In this process, the involvement of human annotators is only in the correction stage. Hence, our method decreases the tedious task of manual annotations. Algorithm 1, shown as follows, summarizes all the relevant steps of the proposed iterative training method.
The first step in the CSSA procedure is unsupervised training of the whole dataset to obtain a suitable encoder. The second step is fully annotate (manually) an initial batch of images from the unlabeled dataset. This stage is manual and requires human involvement to draw polygons and provide class labels on images. In this stage, we use a basic segmentation annotation tool (i.e., Labelme) with no extra speedup procedures to create mask labels. The third step is to train segmentation model A (supervised training) with the fully annotated data (i.e., L). Although any segmentation network can be used for this purpose, we focus on a recent deep learning-based semantic segmentation model of U-net [14], [26]. The fourth step is to train human-annotated initially labeled data and pseudolabeled data together and relabel the pseudo-labeled data again. Now, the system outputs the human-annotated labeled data. Finally, the semisupervised model suggests labeled data (after predicting the unlabeled data by the network B). Before the cascaded network starts outputting the fully annotated data, the human annotator needs to correct the mask polygons suggested by model B and the annotated data shown in Fig. 2.
Given an image x ∈ H × W × 3 with a spatial resolution of H × W and 3 channels (RGB). The network is to predict the corresponding pixel-wise target map with size H × W. The normal deep neural network is to directly train a U-Net, which first encodes images into high-level feature representations, and then decoded back to the full spatial resolution. Unlike existing approaches, our method introduces self-attention mechanisms into the encoder design [27] via the usage of transformers [19]. We will first introduce how to directly apply a transformer for encoding feature representations from decomposed image patches [28], [29]. The elaborated framework of the overall architecture is shown in Fig. 3.
Transformers take image into nonoverlapping patches by a patch partition module [18], [30]. Each patch is treated as a "token" and its feature is set as a sequence of vector. The self-attention mechanism in transformers projects each feature X into corresponding query, key, and value vectors, using learned linear transformations W Q , W K , and W V . Thus, the projection of the whole sequence generates representations Q, K, and V, which is formulated as where d is the query or key dimension and the values in B are taken from a smaller sized bias matrix. The basic unit of MeltPondNet is swin transformer block [19]. We use it to substitute the traditional convolution layer in the U-Net module. The number of swin transformer layers is always a multiple of two where: 1) one is for window multihead self-attention where LN( ) denotes the layer normalization and MLP is a multilayer perceptron that has two fully connected layers with Gaussian error linear unit activation function. MeltPondNet feeds the inputs into sequence embeddings for the encoder, and the geospatial images split into nonoverlapping patches with a patch size of 4 × 4. Furthermore, a linear embedding layer is applied to the projected feature dimension into an arbitrary dimension (we used 96 in this study). The patch-merging layer is the same as the original swin transformer structure. The input patches are divided into four parts and concatenated together by the patch-merging layer. With such processing, the feature resolution will be downsampled by two times. And, since the concatenate operation results in the feature dimension increase by four times, a linear layer is applied to the concatenated features to unify the feature dimension to the two times of the original dimension. Corresponding to the encoder, the symmetric decoder is built based on the swin transformer block. To this end, in contrast to the patch-merging layer used in the encoder, we use the patch-expanding layer in the decoder to upsample the extracted deep features. The patch-expanding layer reshapes the feature maps of adjacent dimensions into a higher resolution feature map (two times the upsampling) and reduces the feature dimension to half of the original dimension accordingly. The cross-channel attention module consists of three parallel operations: 1) dilated convolution; 2) batch normalization; and 3) Mish activation. It selects the important channel using different kernel sizes implemented by dilation convolution rates. We use depthwise separable convolutions to replace the standard convolution to save parameters and speed up the processing time. The dilated convolutions in the three parallel branches have the same kernel size but different dilation rates. Specifically, the kernel of each dilated convolution is 3 × 3, and the dilation rates d are 1, 2, and 3 for different branches. Dilated convolutions support exponentially expanding receptive fields without losing resolution or coverage. However, in the convolution operation of dilated convolution, the elements of the convolution kernel are spaced, and the size of the space depends on the dilation rates, which is different from the elements of the convolution kernel that are all adjacent in the standard convolution operation. The dilation rates 1, 2, and 3 dilation convolutions are approximately equal to kernel sizes 3 × 3, 5 × 5, and 7 × 7 standard convolution, respectively.

IV. RESULTS AND DISCUSSION
The deep learning architectures U-Net [14], U-Net++ [15], transformer U-Net [31], and the proposed MeltPondNet are implemented based on Python 3.8 and Pytorch 1.9. For all training cases, data augmentations, such as flips and rotations, are used to increase data diversity. We train our model on 4 NVIDIA TITAN RTX GPU with 24 GB memory. The synaptic weights pr-trained on ImageNet are used to initialize the model parameters. During the training period, the batch size is 8 and the popular SGD optimizer with momentum 0.9 and weight decay 1e-4 is used to optimize our model for error back propagation learning.

1) Model Performance Evaluation:
We have two kinds of label strategies: a) one is three classes; and b) the other is four classes. The three classes are: a) snow; b) pond; and c) open water, and the fourth class is submerged ice. In Table I, we compare the performance of the U-Net, U-Net++, transformer U-Net, and MeltPondNet. It can be seen that the MeltPondNet shows the best overall performance for both label strategies, and the details for each class are given in Table II. As quality metrics, we use dice similarity coefficient (DSC) that combines the advantages of precision and recall, and mIOU, that is, the mean value of IoUs (a number from 0 to 1 that specifies the amount of overlap between the predicted and ground truth bounding box), corresponding to different classes which would match with the actual degree of similarity. F1 score is the harmonic mean of the precision and recall.
We can conclude from the obtained results that using the three classes provide a more robust classification by all models. Probably, it is caused by difficulties for a model to detect submerged ice because of its more complex and varying signature that consequently led to ambiguities with other classes. The details of the each class quantitative results are shown in Figs. 4 and 5 2) Robustness to Different Resolutions: Since our MeltPond-Net has the fixed input image resolution, we preprocess the  input images by partitioning to many overlapping patches with 640 × 640 pixels. Then, segmentation forward pass is applied independently to each overlapping patch. Finally, the overlapping prediction results are merged back into the original size by weighted average. Based on the slicing patch method, the input image can be of any size. In Fig. 6, an example of a high-resolution image is shown, and that demonstrates the benefits of the high-resolution in reducing confusion with the mixed pixels. The result legend is shown at the right-hand side of the predicted results.
3) Robustness to Different Background: As shown in Fig. 7, we picked some different background results from our test dataset. Our MeltPondNet can detect the snow, pond, and open water classes very well, no matter how the environment is illuminated bright or dark. The submerged ice class is challenging to identify because it usually has a similar color to open water or the melt pond.
4) Robustness to Image Artifacts: Due to the melt pond images are usually acquired from the inclement weather area,   the camera may not be working normally, as shown in Fig. 8. Some white lines on the image are considered image artifacts or sensor artifacts. Our MeltPondNet can properly handle those artifacts without losing any accuracy. Fig. 9, the top image is acquired from the optical IceBridge DMS, and the bottom image is acquired from Aerial sRGB [32]. Even though our MeltPondNet is never trained on those data, it can still predict an accurate segmentation result.

V. CONCLUSION
In this study, we proposed a segmentation algorithm based on a swin transformer U-Net for accurately extracting the boundaries of melt ponds in the surface of Arctic sea ice that operates in high spatial resolution aerial photographs. The model has been trained and assessed using reference data obtained by expert melt pond mapping from aerial photographs taken over sea ice during the HOTRAX in the central Arctic. The mapping has been performed based on albedo differences for four classes of surface: 1) melt pond; 2) ice/snow; 3) submerged ice; and 4) open water. These classes have been used for the model training and application, and we observed the efficiency of separation of melt ponds from other surfaces.
The developed method can be applied not only for melt ponds detection, but their corresponding changes between years that are beneficial for climate studies. The workflow can be adapted for other types of optical data or potentially the data acquired at frequencies in other bands that can extend the algorithm application and provide new insights into processes between ocean and atmosphere. The obtained results discussed in this study are promising and future work could include a comprehensive assessment of the algorithm accuracy in complex climatic transformations. The MeltPondNet architecture would also offer the potential for efficient image analysis of geometrically sophisticated tundra lakes on permafrost [33]. He has been an Assistant Professor with the Department of Physics, University of Dayton, for a long time. He is currently a Lecturer of Applied Mathematics with the School of Mathematics and Statistics, The Open University, Milton Keynes, U.K. He is also a Scholar with the Kavli Institute for Theoretical Physics, Santa Barbara, CA, USA. He specializes in data analysis and mathematical modeling for physical and living systems.

ACKNOWLEDGMENT
Dr. Sudakow was awarded by German Federal Government the title "Green Talent" in 2013 for "his outstanding research of sea ice and his strong commitment to interdisciplinary interaction between mathematics and climate science". He is currently a Professor of Electrical and Computer Engineering and the Ohio Research Scholars Endowed Chair in wide area surveillance with the University of Dayton, Dayton, OH, USA, where he is also the Director of the Center of Excellence for Computational Intelligence and Machine Vision (Vision Lab). He holds four U.S. patents and has authored or coauthored more than 700 research articles, including an edited book in wide area surveillance and 116 peer-reviewed journal papers in the areas of image processing, pattern recognition, machine learning, deep learning, and artificial neural networks.
Prof. Asari is an elected Fellow of SPIE and a Co-Organizer of several SPIE and IEEE conferences and workshops. He