Deep Semisupervised Teacher–Student Model Based on Label Propagation for Sea Ice Classiﬁcation

—In this article, we propose a novelteacher–student-basedlabelpropagationdeepsemisupervisedlearning(TSLP-SSL) methodforseaiceclassiﬁcationbasedonSentinel-1syntheticapertureradardata.Forseaiceclassiﬁcation,labelingthedata preciselyisverytimeconsumingandrequiresexpertknowledge.Ourmethodefﬁcientlylearnsseaicecharacteristicsfromalim-itednumberoflabeledsamplesandarelativelylargenumberofunlabeledsamples.Therefore,ourmethodaddressesthekey challengeofusingalimitednumberofpreciselylabeledsamplestoachievegeneralizationcapabilitybydiscoveringtheunderlying seaicecharacteristicsalsofromunlabeleddata.Weperformex-perimentalanalysisconsideringastandarddatasetconsistingof properlylabeledseaicedataspanningoverdifferenttimeslotsoftheyear.Bothqualitativeandquantitativeresultsobtainedonthis datasetshowthatourproposedTSLP-SSLmethodoutperformsdeepsupervisedandsemisupervisedreferencemethods.

of the sea ice conditions and how it changes with time is important [4], [5].
For high-resolution sea ice analysis, researchers and ice centers around the world are using synthetic aperture radar (SAR) data [6], [7]. These data are not restricted by weather conditions and polar darkness [8]. An important part of sea ice analysis includes sea ice classification. Sea ice classification based on SAR data [9] is carried out by classical statistical classification methods, traditional machine learning (TML) methods, and deep-learning-based methods (DLMs). Statistical and TML methods rely on handcrafted features, which may not properly encapsulate the challenging sea ice characteristics [10]. Therefore, their generalization capabilities and their abilities to find efficient features that can be considered to various geographic areas and time frames are limited [10]. DLMs, when properly trained on large training datasets, have shown excellent generalization capabilities in many research fields, including several remote sensing applications such as food security monitoring [11], hybrid data-driven Earth observation modeling [12], and flood mapping from high-resolution optical data [13]. We consider these achievements in the aforementioned fields and believe that deep neural networks (DNNs) may also show performance improvement in automatic sea ice classification [14], [15]. However, scarce training data is the most challenging issue in sea ice data analysis. This problem is particularly challenging in the Arctic, where gathering of precise true observations is expensive, time driven, and sometimes not feasible [16]. For sea ice classification, archived ice charts are available rendering huge labeled data. Nonetheless, these charts are very coarsely labeled and do not have the quality and details needed to train a DLM effectively [17].
To extract accurate information from large-scale datasets, when limited amount of labeled data are available, semisupervised learning (SSL) has been introduced in the technical literature [18]. These methods aim to combine labeled data with unlabeled records. In the past few years, semisupervised models have presented performance improvement in various fields of remote sensing research, such as despeckling of SAR images [19], change detection in heterogeneous remote sensing images [20], and hyperspectral image classification [21]. Considering these successes, we anticipate that deep SSL methodologies could also be favorable in sea ice classification and potentially lead to significant improvements by overcoming the specific challenge of few labeled samples. In fact, a deep SSL technique is halfway This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. TSLP-SSL method. We have two models, namely, teacher and student models. The teacher model is trained on labeled data during the first stage, and then, both models are trained on labeled and unlabeled data during the second stage of the training. between supervised and unsupervised learning. This technique exploits multiple layers to progressively extract higher level features from the raw input data considering both labeled and unlabeled data.
We propose a teacher-student-based label propagation deep semisupervised learning (TSLP-SSL) method. Our architecture consists of two models, namely, a teacher model and a student model. The teacher model is trained in a two-step procedure. Initially, we trained the teacher model in a supervised fashion utilizing only the labeled data. We then feed both the labeled and unlabeled samples to the trained teacher model and consider the feature space embedding to engender pseudo-labels for the unlabeled data through a label propagation procedure [22]- [24]. The original and the pseudo-labels are in the next step used to train the student model, which is subsequently used during the inference stage. The purpose of using the student model is to avoid the problem of the teacher model being biased toward the labeled data, which is like in case of a small training set. Our proposed method, hence, effectively exploits a relatively large amount of unlabeled data to improve the final classification performance. The training methodology is depicted in Fig. 1 and is more thoroughly described in Section III. The summary of our contributions is as follows.
1) We propose a novel TSLP-SSL method. One of the major attractions of our proposed method is its capability to deal with a small number of labeled samples. This is a favorable property in the case of sea ice classification using SAR data, where the availability of a large amount of reliable labeled data is scarce. 2) We consider sea ice datasets to train and analyze the generalization capabilities of our proposed method. We compare our method with a supervised method and three state-of-the-art semisupervised methods. Our results show that our proposed method performs better than all the reference methods, especially in cases with a small number of labeled samples. 3) Additionally, we present a comprehensive literature review covering both the probabilistic learning method and the DLM. The rest of this article is organized as follows. Related work is described in Section II. We present our proposed deep models and training approaches in Section III. Section IV depicts the experimental analysis considering a set of SAR images. Finally, Section V concludes this article and presents future work.

II. RELATED WORKS
In general, sea ice classification can be divided into two major classes: TML/probabilistic methods and DLMs [25]. The approaches in the latter class fall into two subclasses, namely, supervised deep learning and semisupervised deep learning methods. The literature is very limited in the case of semisupervised DLMs since methods in this subcategory are quite recent and still under development.

A. Probabilistic Methods for Sea Ice Classification
The literature on TML/probabilistic methods is very rich, and we will restrict ourselves to only including a few recent publications. Statistical algorithms often combine probabilistic models and classical classification methods with texture or polarimetric features to perform sea-ice-type maps. An extensive survey is given in [26].
Some specific studies in this category are highlighted below. Examples of machine learning algorithms include the use of standard multilayer perceptrons, as in [14], support vector machines, as in [7], or decision tree methods [15], as in [15]. Statistical and shallow machine learning methods often rely on having extracted the input features in a preoperation prior to the classification. Karvonen [27] and Dinessen [28] used probabilistic and statistical features for estimating sea ice concentration from SAR imagery. Johansson et al. [29] used statistical entropy and horizontal-vertical (HV) polarization computations to isolate sea ice from open water and thicker sea ice. Furthermore, Fors et al. [30] investigated the potential of Cand X-band multipolarization SAR features for sea ice segmentation during late summer. Dabboor et al. [31] analyzed a set of compact polarimetric parameters for classifying newly formed ice and multiyear ice. Hong and Yang [32] used the statistical coefficient, incidence angle, environment temperature, and speed of wind to improve the sea ice and water classification. Johansson et al. [33] used a statistical mixture model to isolate open water from sea ice. Their method is based on the semiautomatic segmentation technique. They applied the algorithm to explore the sea ice characteristics in Svalbard. Aldenhoff et al. [34] demonstrated that C-band SAR can reliably generate the layout of the ice boundary, whereas the L-band shows effectiveness considering thin ice and water regions.

B. DLMs for Sea Ice Classification
Deep-learning-based approaches have been widely exploited for addressing the challenge of sea ice classification. Malmgren-Hansen et al. [17] applied a convolutional neural network (CNN) model to predict Arctic sea ice by fusing data from two different satellites. They found that the CNNs are showing good performance for multisensor data integration. It is worth noting that they used archived ice chart data for both training and validation. However, these data are coarsely labeled, hence leading to undesired effects in the training of the CNN model. Wang et al. [10], [35], [36] exploited CNNs for ice concentration estimation. Tom et al. [37] proposed an ice monitoring model based on Sentinel-1 data with a deep learning approach. Boulze et al. [38] introduced a CNN for detecting different kinds of sea ice [39] using SAR data. They trained the CNN considering the archived ice chart data. They performed comparison with a random forest classifier using texture features.
SSL methods are proposed for classification when only scarce training data or a limited number of training samples are available. The idea of SSL relies on the assumption that unlabeled samples provide essential information and clues on how the data are distributed. Therefore, a DLM can be trained by considering this distribution. In this sense, different approaches such as teacher-student models [40], graph-based methods [41], pseudo-labeling [42], consistency regularization [43], and generative models (i.e., generative adversarial networks-GANs) [44] have been introduced. Shin [40] proposed a multiteacher single-student method to solve the visual attribute prediction problem. His method learnt task-specific domain experts called teacher networks and a student network by forcing a model to imitate the distributions learned by domain experts. Xie et al. [45] proposed a noisy student method for generating pseudo-labels to train a model in an iterative way. The output of the trained model based on the labeled samples is exploited to produce pseudo-labels for the unlabeled samples, which are subsequently used to train another model. They used the teacher-student model to train a larger student model by incorporating noise, considering data augmentation (DA), dropout, and stochastic depth. Tarvainen and Valpola [46] proposed a mean teacher method that averages model weights instead of label predictions. Their method improves test accuracy and enables training with fewer labeled samples. Salimans et al. [47] trained the semisupervised generative adversarial network (semi-GAN) as a generative model. Kingma et al. [48] exploited a variational autoencoder in the form of a semisupervised model. In their method, a classifier is trained on top a latent representation to predict the labels. Iscen et al. [24] proposed a transductive label propagation model for deep SSL. This model is trained in an iterative two-step procedure. In the first phase, a CNN is trained using the labeled part of the dataset in a supervised manner. In the second phase, based on a manifold assumption in the feature space of the CNN, pseudo-labels are produced for the unlabeled data through a label propagation procedure using a nearest neighbor graph. The pseudo-labels are considered to extend the set of labeled samples in the second stage to train the CNN model. Berthelot et al. [49] used an augmentation technique to introduce an SSL approach. They assumed that the distribution of a classifier should remain the same considering unlabeled data. They used average prediction to produce pseudo-labels for the unlabeled samples.

C. SSL Methods for Sea Ice Classification
The aforesaid cases show that the development of SSL methods is a hot topic in the data analysis community. However, it is also true that the application of SSL architectures to sea ice classification is very limited. For example, Han et al. [50] investigated an approach for sea ice classification based on active learning (AL) and SSL. They acquired the most informative data examples considering AL. They exploited these informative examples in training the SSL method. Staccone [51] introduced an SSL approach based on GANs for sea ice classification. In this work, both labeled and unlabeled data were considered to achieve more accurate results by exploiting the knowledge from both data sources. Li et al. [52] presented an SSL method for ice and water classification based on self-training. Their method combined a contextual model and the self-training approach into a unified framework.
Our proposed method falls into the subcategory of SSL methods. We propose a teacher-student model considering the feature space using the label propagation method, which is summarized in the following section.

III. TEACHER-STUDENT-BASED LABEL PROPAGATION METHOD
As mentioned above, labeled sea ice samples are difficult to acquire, making the training of sea ice classification architectures a difficult task. Therefore, we explore a novel TSLP-SSL method for this application. We adequately utilize a limited number of labeled samples and a comparatively much large number of unlabeled samples to train a deep CNN architecture for extracting sea ice information. Our proposed TSLP-SSL method consists of a teacher model and a student model, which are cooperatively trained in an iterative way during two training stages. Our method is different from the teacher-student models presented in [45] and [46] in two major aspects. First, in our case, features generated by the trained teacher model are extracted before the final classification layer and used in the label propagation process to produce pseudo-labels for the unlabeled samples using a k-nearest neighbor approach. Hence, label propagation is performed in feature space, and not in output label space. Second, the pseudo-labels from the teacher model are exploited, together with the original labels, to train the student model in order to find an optimal decision boundary during a second iterative training stage. Our proposed method is also different from the deep SSL model in [24] in the way it aims to avoid the model to be biased toward the labeled data. In fact, the method in [24] is based on a single model, which is trained on only the labeled data, making it susceptible to be biased toward these data samples. The biasing problem may be even more significant in the sea ice classification task, considering the small amount of labeled data and noting the fact that texture features are important for discriminating between different ice types.
In our proposed method, both models are represented by a CNN constructed of a 13-layer architecture [24]. During the first training stage, the teacher model is trained on the labeled data only. During the second stage, the teacher model generates pseudo-labels for the unlabeled data. These pseudo-labels, combined with the labeled samples, are used to train the student model. The motivation for considering an additional student model is to handle the problem of the teacher model being biased toward the labeled data, as discussed above [53]. To further elaborate on this issue, the teacher model formulates a decision boundary considering a small set of labeled data. However, this decision boundary may not be the best boundary when also considering the unlabeled data during the second stage, especially if the teacher model gets overfitted to the labeled data because of the limited number of samples [54]. The idea is that the student model should discover a more appropriate decision boundary, as illustrated in Fig. 2. Fig. 2 displays a simplified case, in which the triangles represent samples from one arbitrary class and the circles show samples from another class. Hence, the red and blue symbols represent labeled data from the two classes, respectively, and the black symbols represent unlabeled data from both classes. Since the teacher model is trained using labeled data only, the decision boundary shown as a blue solid line in Fig. 2 could be a solution. A better decision boundary is discovered by repeatedly training the student model from scratch with both pseudo-labeled and labeled data. In this way, the student model would end up with the decision boundary defined by the green-dashed line, which properly separates both the labeled and unlabeled data from both the two classes. It is worth noting that this example shows the advantage of using label propagation based on nearest neighbors instead of using the network output as pseudo-labels. Fig. 2. Complexity of tuning the teacher decision boundary to also take into account the unlabeled data. We show two-class labeled data with red triangles and blue circles. The black markers represent the unlabeled data.
During the second stage of our training, the teacher model generates predictions for the entire dataset. The feature space embedding is subsequently used to construct a nearest neighbor graph and an adjacency matrix, from which we assign pseudo-labels to the unlabeled samples in a transductive label propagation procedure [24].

A. Formulation for the Learning Process
To clearly provide the details of the process of label propagation for our teacher model, we present the affiliated notations in this section. In this, we will largely follow the outline in [24]. We consider a set of n samples denoted by X := (x 1 , . . ., x s , x s+1 , . . ., x n ) with x i ∈ X, where s samples x i for i ∈ S := {1, . . ., s}, represented by X S , are labeled according to Y S := (y 1 , . . ., y s ). Each element in Y S is y i ∈ G, where G := {1, . . ., g} is a discrete label set of g classes. The rest of the e := n − s samples x i for i ∈ E := {s + 1, . . ., n}, represented by X E , are unlabeled. We consider all samples in X and labels in Y S to train the CNN to assign class labels to the previously unseen samples. The CNN takes an input sample x i from X and builds a vector of class probabilities f Λ (x i ), f Λ : X → R g , where Λ represents the hyperparameters of our deep model. In this process, the feature extraction stage is represented by the function Ω Λ : X → R d , which maps the input data to a ddimensional feature vector, where the ith sample is represented by d i := Ω Λ (x i ). In the next stage, a vector of class probabilities is built by the softmax on top of the fully connected layer considering Ω Λ . The prediction of the CNN for the ith sample is the class of the highest probability, i.e., where j is the jth dimension of the vector. In supervised learning, the loss function in (2) is minimized to train the CNN Equation (2) applies only to the labeled samples, i.e., x i ∈ X S . In fact, (2) shows one term of the loss function in SSL. In classification problems, the cross-entropy loss function is generally used for ε sup , which for a given sample x i is defined as where y k is the kth component of the one-hot encoding of y i ∈ G. Pseudo-labeling finds a pseudo-labelŷ i for each sample x i for i ∈ E. The pseudo-labels for unlabeled samples in X E are represented byŶ E = {ŷ s+1 , . . .,ŷ n }, and they form an additional loss term formulated as

B. Pseudo-Label Generation and Learning Process
In our method, the CNN is represented by the parameters Λ, and we formulate the descriptor set as D : We build a sparse affinity matrix Δ ∈ R n×n , where its elements are represented by where N k represents the set of k-nearest neighbors in X, and γ is a hyperparameter. It is worth noticing that building the sparse affinity matrix is computationally efficient even if we have a very large number of samples. We then build a symmetric adjacency matrix Θ = Δ + Δ T such that Θ ∈ R n×n . The diagonal of the matrix Θ consists of zeroes. The rest of the elements of Θ are nonnegative pairwise similarities between d i and d j for i = 1, 2, . . . , n and j = 1, 2, . . . , n. We formulate the symmetrically normalized counterpart of Θ as where Γ = (Θ1 n ) is the degree matrix and 1 n is an ndimensional vector with all elements set to 1. We formulate a label matrix Y of size n × g consisting of the elements The rows of the matrix Y represent one-hot encoded labels for the labeled samples. Subsequently, the diffusion amounts to formulating an n × g matrix ψ such that where α ∈ [0, 1) is a parameter. The elements of ψ are represented by δ ij . In fact, calculating matrix ψ, according to (8), is impractical for large n because the inverse matrix (I − αΞ) −1 is not sparse. Therefore, we use the conjugate gradient method to solve the linear system Equation (9) is fast and valid since the matrix (I − αΞ) is positive definite. We find the pseudo-labelsŶ E = {ŷ s+1 , . . .,ŷ n } for unlabeled samples aŝ y i = argmax j δ ij (10) where δ ij is the (i, j)th element of matrix ψ. It is worth noting that finding pseudo-labels from matrix ψ in this way has some unwanted causes. For example, we assign pseudo-labels to all unlabeled samples; however, we are clearly not confident about the same certainty for all generated pseudo-labels. Moreover, pseudo-labels may not represent the same number of samples for each class, which will affect the performance of the learning process. To handle the former problem, we affiliate a weight representing the certainty of the prediction to each pseudo-label. For this purpose, we consider the entropy Υ to compute the level of uncertainty and provide a weight ω i to sample x i formulated as where Υ : R g → R is the entropy function, and the weight ω i is normalized in [0, 1] because log(g) is the maximum possible entropy in R g [when all datapoints are equally distributed to the clusters, the maximum entropy for g classes is H = − g c=1 1/glog(1/g) = log(g)].δ i is a g-dimensional vector of the ith rowwise normalized counterpart of δ i with components formulated asδ To cope with the issue of the situation when we have different number of samples for each class, we provide weight υ j to class j that is inversely related to class size, formulated as where |S j | is the number of labeled samples and |E j | is the number of pseudo-labeled samples in class j. To this end, we formulated per-sample and per-class weights. We relate the weighted loss to the labeled and pseudo-labeled samples as follows: In fact, (14) is the sum of weighted versions of ξ sup and ξ pseu in (2) and (4), respectively. Iscen et al. [24] used one CNN model to produce the pseudo-labels and then used these labels to train the same model. On the contrary of this approach, we are using two CNN models in the form of a teacher model and a student model. The teacher model generates the pseudo-labels, which are combined with the labeled samples to train the student model. Therefore, the trained student model is not biased toward the labeled data. To this end, the student and teacher models are trained in parallel, according to (14), in whichŷ i in the student model comes from the teacher model.
In summary, considering the nearest neighbor graph definition in the form of affinity matrix, label propagation, sample and class weights, and label and pseudo-label loss terms, our semisupervised method follows a repetitive procedure. Initially, we randomly initialize all the parameters. We then train the teacher model using the s labeled samples in X S , considering the supervised loss term. We use the trained teacher model to extract descriptors D for the complete training set X. We then find the k-nearest neighbors of all samples to build the adjacency matrix Θ and carry out label propagation by computing (9). We then assign pseudo-labels to the unlabeled samples in X E by considering (10). Subsequently, we train both the teacher and student models for one epoch on the complete training set X using the weighted loss function in (14). This process is repeated for T epochs.

A. SAR-Based Sea Ice Dataset
We have trained our proposed method considering 31 Sentinel-1 images. The images are acquired from the North of Svalbard with 40 m × 40 m pixel resolution. They are preprocessed using the ESA SNAP software by applying thermal noise removal, calibration using the σ 0 lookup table, and multilooking using a 3 × 3 boxcar filter. After converting the intensity images to dB values, they are clipped and scaled linearly in the range To create a suitable dataset for sea ice classification, we used labeled polygons generated from 31 Sentinel-1 EW scenes from the North of Svalbard. These polygons were carefully labeled manually according to coregistered optical images with as small as possible time gaps. We used these images for training our proposed method. More details can be found in [39]. The dataset consists of five classes, as shown in Table I.
Nonetheless, to perform sea ice classification and create a proper dataset [55] for deep learning, we extracted patches with size equal to 32 × 32 pixels, corresponding to a spatial resolution of 1280 m 2 , from inside the polygons, with a stride of 10 pixels. This dataset can be accessed from the link [55]. It is worth mentioning that we analyzed the effect of different patch sizes in a previous work [9]. We found that the validation results got better by increasing the patch size. However, this improvement comes at the cost of a lower spatial resolution as larger patches cover wider areas of the surface. For instance, a larger patch will be classified as water if the majority of the pixels represent water. This would become a significant issue at ice edges as classification based on larger patches would lead to coarser or nonsmooth edges. Therefore, there is a tradeoff between accuracy and resolution. To compensate for this, in our proposed work, we consider a patch size equal to 32 × 32 pixels. We extracted two channel patches consisting of HH and HV intensities. It is also worth mentioning that we also analyzed the effect of different channel composition (HH, HV, and incidence angle) in our previous work [9]. We found that adding the HV channel to the HH gives large improvement. However, the improvement resulting from also adding the incidence angle is quite small. In the current work, we do not include the incidence angle as this also enables more proper comparison with other SSL methods [48], [49]. These reference SSL methods largely apply different DA techniques, and the inclusion of the incidence angle is not feasible because of the DA techniques. Therefore, the patches in our work consist of only HH and HV intensities to maintain consistency. In Table I, we provide ice type codes, following the definitions of the World Meteorological Organization [56] and a brief description of each class. We consider binary sea ice classification. The first class, namely, the water class, consists of open water and leads with water, and the various ice types are grouped together as the ice class. The total number of patches for water is 9317, and for ice, it is 5433. We provided the dataset online [55]. For now, we are interested in analyzing the performance of DNNs for binary classification. Our consideration based on our experience with sea ice classification is that if DNNs can perform well in the binary classification case, they may also classify multiple sea ice types properly.
For validation, we consider some other Sentinel-1 scenes provided by the Norwegian Meteorological Institute [57] from the Danmarkshavn area on the Northeastern coast of Greenland and extract 1516 water patches and 1324 ice patches, mostly from challenging areas. In the first experiment, we consider the training dataset from the North of Svalbard and split it into labeled X S and unlabeled X E samples. In the next experiment, to show the capability of the proposed method in classifying real unlabeled data, we consider 5000 random patches picked from the Norwegian Meteorological Institute dataset as the unlabeled dataset X E and use all samples in the training set as the labeled dataset X S . We insert a different number of labeled datasets for each class, i.e., 15, 30, 60, 100, 500, and 1000. For the inference results, we apply SAR images from the Norwegian Meteorological Institute dataset [57], which were collected in 2018.

B. Our Model Configurations
We exploit the same network models for the teacher and student models. Similar to [24], we use the network architecture defined in [46] and shown in Table II. We trained the teacher model for 100 epochs in the first training step. In the second step, we trained the teacher model for 200 epochs based on the label propagation to produce pseudo-labels. These labels are then exploited to train the student model concurrently. The learning rate for the teacher model is 0.0008 for the first step and 0.0001 for the second step. The learning rate for the student model during the second step is 0.002. For DA, we used only rotation in both steps to keep the same physical meaning of all the channels of the SAR data and considered the same values for the hyperparameters as used in the previous studies [24], [58] in all experiments. We run the experiments on a single NVIDIA Quadro RTX 5000 with 16-GB memory. The code is available. 1

C. Results and Discussion
We trained our models with a distinct number of labeled data to assess the performance of our proposed method in comparison with four reference methods. For this purpose, we consider both a supervised CNN model and three semisupervised methods, namely, semi-GANs [47], MixMatch [49], and label propagation model (LP-SSL) [24]. In the supervised CNN model, we consider the same CNN architecture that we use for both our teacher and student models. We present the validation results in Table III in terms of accuracy for both our proposed TSLP-SSL method and the reference methods. In the first experiment, we use our training data and split it into labeled, i.e., X S , Y S , and unlabeled datasets, (X E . For the validation, we use the validation data that were mentioned previously (see Section IV-A). As can be seen in Table III, our proposed method outperforms the fully supervised CNN architecture considering 15, 30, 40, 60, and 100 labeled samples. Similarly, our method also outperforms the semisupervised methods semi-GANs [47], MixMatch [49], and LP-SSL [24] considering different number of labeled datasets except in case of 500 labeled samples. For comprehensive analysis, we also consider other performance metrics, namely, average precision, average recall, and average F1-score, for both the classes: water and ice. We present the results in Table IV. As 1 https://github.com/sakh251/TSLP-SSL can be seen, we also outperform in most cases considering both the supervised and semisupervised methods. In fact, our method learns more information from the unlabeled data, especially when a very limited number of samples are available. In fact, the student model in our approach has the potential to remedy the problem of overfitting of the teacher model when only few samples are available, and it presents comparable validation accuracy when considering 500 and 1000 labeled datasets. However, when the number of labeled datasets increases, the amount of information extracted from the unlabeled data does not significantly improve the results. It is worth noticing that the good samples of the labeled data can significantly impact the results in the second step. This can be seen when comparing the results of using 15 and 30 labeled samples in Tables III and IV. In fact, our proposed method can learn from the unlabeled data and, thus, improves its performance. It even achieves better validation accuracy than the supervised and LP-SSL models considering 15, 30, 40, 60, and 100 labeled samples. In order to explain the behavior of our method considering 500 and 1000 labeled samples, we compute the accuracy of the pseudo-labels from the teacher model during the second step of the training process. This can be done since the ground-truth labels of the unlabeled data can be extracted from the training dataset. We consider the comparison of our proposed method with the fully supervised CNN architecture. When both the methods are trained on 500 and 1000 labeled datasets, the accuracy on the pseudo-labels reaches more than 99%, but at the same time, the validation accuracy does not increase, as shown in Table III. This means that there is no more information in the unlabeled data to further improve the validation accuracy considering this particular dataset. We investigate this by training the supervised model with all the data in the training dataset, and it reached a validation accuracy of 91.57%.
We also investigated the inference results on a single-image SAR scene from Danmarkshavn considering 30, 60, and 100 labeled datasets in our proposed TSLP-SSL model. The results of this experiment are reported in Fig. 3, where the first row shows results using the supervised model and the second row shows results using our proposed method. Blue color indicates the water and white color indicates the ice class. As can be seen, our method presents improvement compared to the supervised model, especially in the noisy areas.

D. Feature Separability of Our Proposed Method
Furthermore, we illustrate the capability of the label propagation step that we use to generate the pseudo-labels for training the student model. In fact, label propagation is characterized by consolidated feature separability, which helps generate meaningful pseudo-labels for training the student model. To explain this visually, we extract the feature vector output from the last convolution layer. The dimension of the feature vectors is 128. We transform the feature vectors into three components based on the principal component analysis (PCA), considering both labeled and unlabeled data, to visually understand the feature space. These components are shown in Fig. 4. Fig. 4(a) and (c) shows the feature space when training the teacher model in the  TABLE IV  AVERAGE OF PRECISION, RECALL, AND F1-SCORE FOR DIFFERENT AMOUNT OF LABELED DATA AND UNLABELED DATA FROM THE TRAINING DATASET   Fig. 3. Inference results. We present qualitative results of a single input image. The first row depicts the results considering supervised deep learning, and the second row depicts the results using our proposed TSLP-SSL model.   5. Inference results. The first column shows input images, the second column shows the results obtained with supervised deep learning, and the third row shows results obtained with our TSLP-SSL model, which is trained by also taking into account unlabeled data from other images. first step considering 60 and 1000 labeled samples, respectively. Fig. 4(b) and (d) shows the feature space representation after label propagation is applied in the second step. The yellow circles represent water and the purple circles represent the ice class. As can be seen, label propagation leads to more separable classes in the feature space, especially when 1000 labeled samples are considered. Therefore, through label propagation, the unlabeled data help to build a more class-separable feature space and generate more meaningful and informative pseudo-labels to train the student model.

E. Extended Unlabeled Data
To elaborate a bit more on the capability of our proposed method, we conduct another experiment. We evaluate the validation accuracy of the proposed method by considering 1000 data samples from the training dataset as labeled data (i.e., considering it as an element of X S ) and adding unlabeled data not contained in the training dataset. For this purpose, we extract 5000 random patches from the Danmarkshavn data and add to the training process in the second step X E . We present the performance of all the methods in Table V in terms of accuracy, average precision, average recall, and average F1-scores. As can be seen, our method performs better than the fully supervised CNN method and three semisupervised methods: semi-GANs [47], MixMatch [49], and LP-SSL [24]. These results demonstrate that our proposed method can extract and use relevant information from real unlabeled data and learn new information from unseen and unlabeled data. This is a useful and powerful capability that can be beneficial in sea ice classification, where the amount of available training data is limited.
We also present inference results using four different images from the Danmarkshavn data considering the student model trained on 1000 labeled datasets and extended with unlabeled data. In Fig. 5, the left column depicts the original SAR images, the middle column presents the inference results obtained with the supervised learning model, and the last column shows the results obtained with our proposed TSLP-SSL method. Water is highlighted in blue color and ice is highlighted in white color. These inference results again show the capability of our proposed semisupervised method in using the information of unlabeled data.

V. CONCLUSION
In this article, we proposed a teacher-student-based label propagation method for sea ice classification. The teacher model and the student model were trained in an iterative way during the training stage. The teacher model produced features that were extracted before the final classification layer. These features were used during the label propagation process. Considering the unlabeled data, the labels were propagated to produce pseudo-labels. Subsequently, the pseudo-labels from the teacher models were fed to the student model during the training to find an unbiased decision boundary. Our method outperformed the supervised CNN and the semisupervised LP-SSL models. We presented both qualitative and quantitative results for our proposed method and the reference methods. Our proposed method outperformed both the reference methods. Our proposed method considered a very limited number of labeled samples starting from 15 samples and unlabeled samples to train the models efficiently. In fact, our proposed method was characterized by the ability to learn useful information from both labeled and unlabeled data. Our method reduced the dependence on labeled samples, which is very time consuming and costly to collect for sea ice analysis. Therefore, this property of our method makes it a good fit for the community of sea ice analysis, where limited labeled data are available. We have also shown that by adding more unlabeled samples, the performance of the inference results has improved. Considering the semisupervised aspect, our method can be extended to other problem areas, where a very limited number of labeled samples are available since we coped with the biasing and dependence issues related to the labeled samples.
The dataset we collected consists of different ice types. However, the number of samples for each ice type is limited. Considering the promising performance of our proposed method for binary sea ice classification, in our future work, we would adopt and extend our method to ice type classification.