Unsupervised Single-Scene Semantic Segmentation for Earth Observation

Earth observation data have huge potential to enrich our knowledge about our planet. An important step in many Earth observation tasks is semantic segmentation. Generally, a large number of pixelwise labeled images are required to train deep models for supervised semantic segmentation. On the contrary, strong intersensor and geographic variations impede the availability of annotated training data in Earth observation. In practice, most Earth observation tasks use only the target scene without assuming availability of any additional scene, labeled or unlabeled. Keeping in mind such constraints, we propose a semantic segmentation method that learns to segment from a single scene, without using any annotation. Earth observation scenes are generally larger than those encountered in typical computer vision datasets. Exploiting this, the proposed method samples smaller unlabeled patches from the scene. For each patch, an alternate view is generated by simple transformations, e.g., addition of noise. Both views are then processed through a two-stream network and weights are iteratively refined using deep clustering, spatial consistency, and contrastive learning in the pixel space. The proposed model automatically segregates the major classes present in the scene and produces the segmentation map. Extensive experiments on four Earth observation datasets collected by different sensors show the effectiveness of the proposed method. Implementation is available at https://gitlab.lrz.de/ai4eo/cd/-/tree/main/unsupContrastiveSemanticSeg.


I. INTRODUCTION
R APID development of remote sensing technologies has drastically increased the quantity of Earth observation sensors acquiring images with different spatial, spectral, and temporal resolution [1], [2]. A large volume of unlabeled images are currently available for characterizing various objects on the Earth's surface. Automatic analysis of such images is useful to study various anthropogenic and natural factors, including urban monitoring [3], disaster management [4], [5], agricultural monitoring [6], and monitoring natural resources' exploitation [7].
An important step in understanding images is semantic segmentation that assigns each pixel in image/scene to a meaningful category or class. This is true for both computer vision and Earth observation images [8]. Research toward supervised image segmentation methods has received significant attention in the era of deep learning that has outperformed previous methods [9]- [11]. Superior performance of deep learning, especially convolutional neural networks (CNNs), for semantic segmentation can be attributed to their capability to learn spatial features from large volume of labeled data. Most computer vision problems can use crowdsourcing [12] to collect large volume of labeled data. However, collecting labeled data in Earth observation is significantly challenging due to several factors that require domain expertise, including variation among different Earth observation sensors and disparity among different applications. Moreover, active (e.g., synthetic aperture radar) and lower resolution optical images are visually unintelligible, thus making them difficult to be labeled by a volunteer in a crowdsourcing platform. Thus, the applicability of supervised segmentation has been limited on Earth observation images due to the lack of labeled data [13]. Moreover, many Earth observation applications assume presence of only the target scene and no additional scene [14], [15]. Analyzing using only the target scene can be especially useful for quick disaster mapping when there is little time to collect additional unlabeled images.
Recently, unsupervised and self-supervised learning have gained significant attention in machine learning. Such approaches have been devised for different problems, e.g., image clustering [16], video analysis [17], and change detection in Earth observation images [18]. While most Fig. 1. Batch of patches is extracted from the training scene. Model is trained from this batch using deep clustering. Furthermore, this batch is simply transformed and shuffled, to form two other batches, first of which must be similar to the original batch in the feature space and the other must be dissimilar to the original batch in the feature space.
deep-learning-based semantic segmentation methods are supervised [19], [20], unsupervised semantic segmentation methods have been proposed in the literature exploiting deep clustering [21]. Deep-clustering-based approaches have also been extended for Earth observation bitemporal image analysis [3]. As such, self-supervised learning can be potentially used to learn from a single unlabeled scene.
Earth observation scenes generally capture a geographic area and are significantly large in comparison to images in a typical computer vision dataset. As an example, scenes in the International Society for Photogrammetry and Remote Sensing (ISPRS) semantic labeling dataset [22] are up to 6000 × 6000 pixels. Due to the repetitive nature of geographic objects, an Earth observation scene generally captures many instances of the same objects in a single scene. Based on this, we propose to sample smaller patches from a large scene. When randomly sampled, many such patches essentially represent the same object category (e.g., buildings). By taking a batch of patches, an augmented version can be conveniently obtained by data transformation, e.g., noise addition. This allows us to process the patches using a two-stream network similar to contrastive learning [23] and other multiaugmentation methods [24]. By jointly using concepts such as pixelwise deep clustering [25], similarity between multiple augmentations of the same input [24], and contrastive learning [23], we propose a self-supervised method to simultaneously train a network and assign pixelwise labels to an Earth observation scene. The conceptualization behind the proposed method is shown in Fig. 1. The key contributions of our work are as follows. 1) We propose a self-supervised segmentation method that does not require any annotated data and can be trained using single unlabeled Earth observation scene, without requiring any additional pool of unlabeled data. 2) We use the concept of pixelwise deep clustering [25] to automatically discern different classes from a single remote sensing scene. We further use multiple augmentations of same input [24] to ensure that similar inputs produce similar segmentation map. We use the concept of contrastive learning [23] to ensure that dissimilar inputs produce dissimilar output. 3) By performing a set of experiments using input of different sensors and resolutions, we show that the proposed method is able to automatically discern important Earth observation classes. This implies, irrespective of exact application, our method can be a precursor to further analysis in most such applications.

A. Deep Segmentation for Earth Observation
Popular deep-learning-based segmentation architectures include fully convolutional networks (FCNs) [19], U-Net [26], SegNet [27], and dilated convolutional models including DeepLab [28]. For Earth observation images, several supervised segmentation algorithms have been proposed using these architectures [8], [29]- [35]. However, these methods necessitate a large amount of training data for supervised learning. To deal with the lack of training data, Hua et al. [13] proposed a semantic segmentation approach that uses spatially sparse annotations to train the model. In [3], an unsupervised deep clustering algorithm is introduced for the problem of multitemporal Earth observation segmentation. To effectively capture the domain knowledge, Li et al. [36] combine the deep learning module and knowledge-guided ontology reasoning.
Compared with optical images, SAR image segmentation is more challenging due to the sensitivity to noise [37]. The traditional SAR segmentation methods rely on superpixel merging [38], [39]. There are very few methods using deep learning for SAR image segmentation [40]. Wang et al. [40] noted that to train an effective deep model for SAR semantic segmentation, it is important to have high-quality ground-truth data that are not always available.

B. Unsupervised and Self-Supervised Learning
Practicality of supervised methods is limited due to difficulty in acquiring labeled data. Unsupervised learning focuses on alleviating these limitations by learning semantic representations from unlabeled images without relying on predefined annotations. Clustering is an extensively studied unsupervised learning topic. Extending this, deep clustering [16] jointly optimizes the parameters of a deep network and the cluster assignments of the data in feature space. Deep clustering and its variants [41]- [44] divide a set of unlabeled training inputs into groups in terms of inherent latent semantics. Some self-supervised approaches use pretext tasks for learning semantic features [45], [46]. Popular pretext tasks include image rotation [45], jigsaw transformation [47], and rearranging of time-series [48]. Capitalizing on the availability of positive and negative pairs, contrastive methods aim to spread the representations of negative pairs apart while bringing closer the representations of the positive pairs [23], [49]. Bootstrap Your Own Latent (BYOL) [24] further eliminates the necessity of negative pairs using augmented instances of the input. Several works have shown that self-supervised learning can produce good representation even when available data are scarce [50]. Weakly supervised [48], [51], [52], unsupervised [53], and self-supervised learning [54], [55] have also been used in many remote sensing applications, e.g., cloud detection [52], change detection [53], and scene classification [54]. Tao et al. [54] used self-supervised learning for classification using limited label. Yue et al. [56] used self-supervised learning for hyperspectral scene classification.

C. Unsupervised Deep Segmentation
Aligned with the increased interest in unsupervised methods, efforts toward reducing supervision have gained traction in semantic segmentation [21], [57]. A simple yet effective approach toward this is using deep clustering in the pixel space [21], [58]. In [21], a lightweight architecture is used for single-image segmentation and output/label is obtained by arg-max classification of the final layer. Predicted pixel labels and network representation are adjusted in iterations. Pixel-level feature clustering using invariance and equivariance (PiCIE) [25] further exploits geometric consistency in addition to deep clustering for unsupervised segmentation.
Our work is closely related to the above-mentioned unsupervised methods. Like [21] and [25], it exploits pixelwise deep clustering. Our method relies on multiple augmentations of the same input, similar to BYOL [24]. Similar to [23], the method uses contrastive learning. The method focuses on single scene, thus further showing potential of deep self-supervised learning in data-constrained situation, similar to [50]. While works on self-supervised remote sensing classification [54] or self-supervised hyperspectral scene classification [56] still use some labeled samples, our method does not use any labeled sample.

III. METHODOLOGY
To describe the proposed idea, let us denote the available unlabeled scene/image as X and its transformed version asX having the same spatial dimensions of R × C. Although any transformation T X could be useful, we use simple transformations such as addition of Gaussian noise. The transformed version can be taken as an alternative view of the same scene. This allows us to formulate the task of semantic segmentation at hand as a self-supervised problem which typically exploits the idea of reducing the gap among feature representations of multiple views of the same image in an iterative manner without using any labeled data. Both X andX are then processed through a two-stream network and weights are iteratively refined using pseudo labels generated via deep clustering [16] and a contrastive learning strategy [23] in the pixel space to automatically segregate the major classes present in the scene. The proposed method produces segmentation map without using any explicit labels as detailed in Sections III-A-III-F.

A. Proposed Network Architecture
To enable feature learning, a Siamese-like two-stream network architecture is proposed that takes as input the patches of size R × C (R < R and C < C) extracted from X andX . Each training batch is formed by drawing B patches from X denoted as X = {x 1 , . . . , x B } and the spatially corresponding patches fromX , symbolized asX . Since the bispatial patches can be seen as multiple views of the same location, the semantic information can be inferred from them using a proposed Siamese-like architecture [59]. Both the branches have the projection modules f X and fX to obtain learned feature representations for the original and transformed images, respectively. These learned feature representations are then fed to the subsequent prediction modules h X and hX to obtain the respective activation volumes. It is important to note that the projection modules do not share weights (hence, twostream), while the prediction module does share the weights (i.e., h X = hX ), therefore denoted using h only. The projection and prediction modules consist of L 1 and L 2 (in our case L 2 = 1) convolutional layers, respectively, where the total layer L is the sum of L 1 and L 2 . Convolution layers are followed by activation function [rectified linear unit (ReLU)] and batch normalization layer. The input size is preserved in the output as pooling or stride is not used. The projection module uses convolution filters of size 3 × 3, whereas the prediction module is formed with 1 × 1 filter. The K kernels in the final layer groups or clusters input data (pixels) into K groups or classes.
The simplified network architecture for a five-channel network (L = 5) is shown in Table I. The reasoning behind using such an ad hoc lightweight architecture can be explained by the following. 1) Given that training mechanism is unsupervised and training patches are sampled from a single scene, we have limited number of patches. Thus, using a large network is ineffective in such case. This is further supported by previous works on single-scene segmentation [21] that also used such lightweight network. 2) Given that most Earth observation images have much coarser resolution compared with those in computer vision, small networks using only few convolution layers can still capture required spatial context.

B. Pseudo Label Activations
The patches x b andx b refer to patch extracted from the same location in X andX , respectively. The outcome of the network for x b andx b can be represented as Here, each pixel in this tensor can be viewed as a K -dimensional vector of activations. If we denote any generic i th pixel in y b as y b i , then we can obtain the prediction of the semantic label by simply selecting the kernel in y b i that has maximum value. Based on this simple intuition, we formulate the pseudo label assignment as the process of computing c b i by finding the feature having the highest value in K -dimensional pixel activation vector y b i .

C. Pseudo Label Loss Objective
The computed pseudo label c b i is thus considered as the label of prediction y b i . This enables us to quantify per-pixel cross-entropy loss b i between y b i and c b i . b i is aggregated (by computing mean) over pixels in x b and patches in the batch to obtain the loss L p . L p is used to adjust the weights of h and f X . Similarly,L p is computed fromx b (b = 1, . . . , B) and used to adjust the weights of h and fX . Ising L p andL p to iteratively adjust the weights of the network, the proposed method simulates deep clustering in the pixel space.

D. Spatial Consistency
The bispatial patches x b andx b refer to the same location and hence to same objects, and therefore the features computed for such a bispatial pair patch should be similar. To ensure this, we compute per-pixel absolute error loss b i as absolute difference between y b i andŷ b i . The mean of b i over all pixels for all the patches in the batch gives the loss term L s that ensures that the pixels in the bispatial patches x b andx b tend to have the same label. We note that spatial consistency criterion is conceptually similar to bringing closer the multiple views of input as in some self-supervised learning methods [24]. However, differently from them, spatial consistency loss aims to reduce the representation gap at pixel level instead of image level.
A pitfall of the spatial consistency loss is that merely trying to reduce the representation gap of x b andx b may generate trivial solution, simply producing the same output for all pixels.

E. Representation Learning From Disparity
The spatial consistency loss encourages the features computed for a paired bispatial patch to be similar. To balance the overall training procedure, we also use a strategy similar to contrastive learning to ensure that the network should also learn different feature representations for dissimilar patches. To create dissimilar pair patches, we randomly shuffle the batch of patchesX to produceX . This ensures that the paired patches in X andX are indeed dissimilar. These dissimilar bispatial patches are then used to enable the model to learn disparate features computed from x b andx b . Specifically, b i is computed as (negative) absolute error loss between y b i Algorithm ObtainX by randomly shufflingX 6: for j ← 1 to J do 7: for b ∈ B do 8:

F. Progressive Network Training
The proposed mechanism for network training is shown in Fig. 2 and Algorithm 1. Initially, all the trainable weights W 1 , . . . , W L corresponding to all L layers in the network are initialized using He initialization strategy proposed in [60]. Instead, a pretrained network could have been used to initialize weights. However, we note that Earth observation deals with a variety of sensors with different specifics, and suitable pretrained network is not always available. This motivates us to exclude importing weights from pretrained networks.
For each batch of data, training is performed for J iterations when the weights are iteratively optimized using stochastic gradient descent with momentum [61]. Sampling all possible patches from the training scene is equivalent to one epoch, and the training process is performed for a total I epochs. Since pseudo label losses (L p andL p ) and other two losses (L s and L c ) have values in different range, the first epoch is optimized with the sum of L p andL p , while from the second epoch onward the sum of all four losses (L p ,L p , L s , and L c ) is used, which yields a balanced training process taking into account coherent cluster formation, spatial feature consistency, and feature dissimilarity for unpaired patches.

A. Dataset
We use the following datasets for experimental validation. 1) Vaihingen dataset, an urban semantic segmentation benchmark [22], [62] acquired over Vaihingen, Germany, with 9 cm/pixel resolution. The images in Find T D j as the class inT j having highest intersection/overlap with S j 7: Assign T D j as match for S j 8:T j+1 =T j \ T D j 9: 10: end for 11: Assign any remaining class inT N +1 to background. the dataset are composed of three bands-near infrared (NIR), red (R), and green (G), and each image covers approximately 1.38 km 2 . The images show six landcover classes: building, impervious surface, low vegetation, tree, car, and background. Following previous works [13], for test we use the image IDs 11, 15, 28, 30, and 34, i.e., total five test scenes. We train our unsupervised model on a single scene, image ID 1.
2) Zurich summer dataset [63] acquired using Quickbird sensor over Zurich, Switzerland. The images show a spatial resolution of 0.62 m/pixel. Following previous works [13] we use NIR, R, and G in our experiments. Eight different urban classes are present: roads, buildings, trees, grass, bare soil, water, railways and swimming pools. Image IDs 16-20 (i.e., total five test scenes) are used for test, while we train our unsupervised single-scene model on image ID 1. 3) A polarimetric synthetic aperture radar (PolSAR) [64] scene showing an area in Germany comprising four classes [65]. Being characterized by speckle noise and complex backscattering mechanism at the junction of different landcovers, PolSAR images are significantly different from optical images. Thus, the experiment on this dataset illustrates the application of the proposed method beyond typical optical images. Furthermore, due to less visual saliency, PolSAR scenes are challenging to label and there are not many labeled PolSAR datasets. This further proves the application of the proposed single scene unsupervised method on a case where label is actually scarce. This dataset [65] is acquired by ESAR L-band sensor. ESAR is an airbone SAR system of German Aerospace Center (DLR). It captures a semiurban area in Germany (Oberperfaffenhofen, Bavaria province). The scene shows an area of 1300 × 1200 pixels. The reference information for the area is obtained using manually labeling based on the aerial images over the same area in Google Earth. The entire image is classified into four categories: built-up areas (in blue), wood land (in green), open areas (in yellow), and others (in dark blue). The classes are unbalanced, with much more open areas than others. Besides, there are some similarities between the built-up areas and wood land in terms of PolSAR image. Thus, segmentation of this scene is a challenging task for unsupervised methods. 4) Fire disturbance is recognized as an essential climate variables (ECVs) and burned area is its primary descriptive variable [66]. Here, we show the segmentation result produced by the proposed method on a burned area in an Alpine area in north Italy [67]. The fire event took place on February 27, 2019. We applied our segmentation method on postfire image acquired on March 3, 2019, using Sentinel-2 sensor (10 m/pixel spatial resolution and 13 spectral bands), part of Copernicus program of European Space Agency. The goal of this study is to investigate whether the proposed method can identify burned area as a separate cluster from the postevent image. The proposed unsupervised training can be performed either on a different scene from the test scenes (as in the first two cases above) or on the same scene as the test scene (as in the third and fourth cases above).

B. Compared Methods
Our work is one of the first attempts toward obtaining multiclass segmentation in unsupervised way by training on single-scene Earth observation image. Thus, we exclude entirely supervised methods from compared methods and choose following unsupervised/weakly supervised methods for comparison.
1) FEature and Spatial relaTional regulArization (FESTA) [13] is a weakly supervised method proposed in the context of semantic segmentation of high-resolution Earth observation images. The same training scene is used for training FESTA as our method; however, our method assumes no annotated point, while FESTA assumes the presence of some annotated points. We design two variants of FESTA, "FESTA 5 points" by considering five labeled point in the training scene and similarly "FESTA 10 points." 2) An unsupervised deep-clustering-based approach by adopting [16] in pixel space. The same training scene is used as the proposed method, and this method assumes no annotated data as in the proposed approach. 3) Combining deep clustering with image reconstruction as an additional pretext task. This model uses two outputs, one output is optimized for clustering and the other is optimized to reconstruct the input image [68]. 4) Online deep clustering (ODC), derived from [41]. 5) An unsupervised method by simply extracting pixelwise features from the second convolutional layer of VGG16 [69] and applying k-means clustering on the extracted features. This particular layer is chosen since beyond this layer, the spatial size reduces, and thus pixelwise feature extraction is not possible. Since FESTA assumes the presence of labeled pixels in the training scene and we use the same scene for training/testing in  II   PERFORMANCE VARIATION IN THE PROPOSED METHOD ON THE  VAIHINGEN DATASET WITH RESPECT TO EPOCH   TABLE III   PERFORMANCE VARIATION IN THE PROPOSED METHOD ON THE  VAIHINGEN DATASET WITH RESPECT TO K case of the PolSAR scene, we exclude comparison to FESTA for that scene. The burned area scene is evaluated for change detection and hence compared with the relevant method in [67].

C. Settings
The training process of the proposed method is performed using I = 2, J = 50. The number of kernels in the final layer (K ) is set as slightly larger than the number of target classes in dataset, e.g., K = 8 for the Vaihingen dataset and K = 12 for the Zurich dataset. R = C = 224 is used to sample patches from the training scene. A learning rate of 0.001 is used for training.
In an unsupervised clustering setting, it is not possible to automatically discern the name of classes. Hence, each class in obtained segmentation is assigned to the class with most overlap in the reference map. This procedure is further shown in Algorithm 2.
The results are shown as F1 score and intersection over union (IoU). The indices are computed for each target class and the mean is computed over all the classes. We also show accuracy; however, note that accuracy may be misleading as constituent classes are imbalanced and merely learning a single class can lead to seemingly good accuracy.
For the proposed method, the segmentation results are shown as an average of ten runs. Table II shows the performance variation in the proposed method as the number of epochs I is varied by fixing the other parameters. We observe that the performance improvement beyond I = 2 is not significant. Hence, we used I = 2 in our subsequent experiments. To further understand this, we visualize evolution of losses in Fig. 3. L s + L c keeps decreasing slightly beyond I = 2; however, it shows an oscillatory behavior beyond that, which provides further indication toward why optimum result is already reached by I = 2. Table III shows the performance variation in the proposed method as the number of kernels in the final layer (K ) is   varied. We recall that the value of K implies the number of classes that we want to cluster the data. The best performance is obtained for K = 8 which is slightly larger than the actual number of classes in the Vaihingen dataset (six classes). Table IV shows the performance variation in the proposed method as the number of layers (L) is varied. The result confirms that only few layers are sufficient for the proposed method, and further increasing the number of layers may not improve the performance.

D. Result on Vaihingen Dataset 1) Result Variation With Respect to Parameters:
2) Ablation Study of Loss Function: Table V tabulates     3) Comparison to Existing Methods: The quantitative result is shown in Table VI. The proposed method outperforms FESTA 5 points, deep clustering, deep clustering with image reconstruction, ODC, and VGG16 + kMeans with respect to all three indices and outperforms FESTA 10 points with respect to two out of three indices. We recall that FESTA is a semisupervised method that uses few annotated points. The proposed method still outperforms it, which shows the efficacy of the proposed method. Segmentation map corresponding to image ID 11 is visualized in Fig. 4. The three columns show input image, reference segmentation, and obtained segmentation, in that order. We observe that dominant classes like buildings (blue) and impervious surfaces (white) are clearly detected by the proposed method. However, it identifies spectrally similar low vegetation and trees in the same cluster. The classwise F1 score is 0.66, 0.48, 0.40, 0.64, and 0.08, for impervious surface, buildings, low vegetation, trees, and cars, respectively. This shows that the proposed unsupervised method is capable of identifying the major classes while its scope is limited for visually inconspicuous classes like cars.

E. Result on Zurich Dataset
The quantitative result of the proposed method versus the compared methods is shown in Table VII. The proposed method outperforms all the compared methods in terms of mean F1, and mean IOU, showing again its superiority even against semisupervised FESTA. Segmentation map for image ID 17 is visualized in Fig. 5. Similar to the observation for Vaihingen, we observe that the dominant classes are clearly detected by the proposed method. However, the performance deteriorates for the nondominant classes.

F. Result on PolSAR Scene
Pauli-color-coded input, reference segmentation map, and the segmentation produced by the proposed method are visualized in Fig. 6. Despite different nature of PolSAR data, the proposed method is able to identify the major classes from the target scene. The quantitative result is tabulated in Table VIII   which shows the superiority of the proposed method against other unsupervised methods.

G. Result on Sentinel-2 Burned Area Scene
Our segmentation method is applied on the postchange image (acquired on March 3, 2019). The target area is significantly complex, showing mountain, some snow, forest, in addition to the burned area. Showing the cluster that has the best match to the burned area as positive class and rest as negative class, we obtain a binary segmentation map, as visualized in Fig. 7. It is evident that the proposed method can segregate the target burned area as one class with little false alarm. The method obtains an accuracy of 97.19%.
The result obtained by the proposed method is superior to or comparable to the change detection methods compared in [67] (worst accuracy: 76.16%, best accuracy 99.0%), though the change detection methods use both pre/postchange images, while the proposed method uses only the postchange image.

H. Comments on Computation Time
The proposed unsupervised training on a single scene can be achieved in reasonable time, e.g., it takes approximately 195 s for training on Vaihingen image ID 1 using a machine equipped with GeForce RTX 3090. Using the same hardware and for the same scene, deep clustering [16] takes 280 s and ODC takes [41] 295 s. VGG + kMeans does not involve a training phase. FESTA takes considerably more time than the proposed method (approximately 10 min).

I. Summary of Observations
The proposed method is an inexpensive method, both in terms of annotation (not needed) and computation time. In addition to clustering in pixel space, the proposed method effectively exploits spatial consistency and contrastive loss, which is evident from the fact that the proposed method outperforms deep clustering. While the proposed method's effectiveness to automatically segment small classes is limited, it can effectively segregate the major classes, seen in all the datasets. However, this suits most Earth observation applications where the task is to quickly find one or two classes of interest, e.g., building during Earthquake disaster management and burned area during postfire operations.

V. CONCLUSION
We proposed an unsupervised single-scene segmentation method that combines different recently popular topics from unsupervised and self-supervised learning, e.g., deep clustering in pixel space, different view/augmentation, and contrastive learning. The experimental results on four different Earth observation datasets show that the method can effectively learn dominant classes, e.g., buildings in the Vaihingen dataset. On the other hand, the effectiveness of the method is limited for classes that are inconspicuous. However, given the strong constraints under which the method works (only a single unlabeled scene for training), learning such classes is certainly challenging. A potential direction of extension of this work is training weakly supervised model given few labeled pixels from only such inconspicuous classes. The proposed method complements the supervised models by providing a quick unsupervised way of creating reasonable segmentation map. In future, we will experiment on the images acquired by other popular sensors in Earth observation, e.g., light detection and ranging (LiDAR). He was an Engineer with TSMC Limited, Hsinchu, Taiwan, from 2015 to 2016. In 2019, he was a Guest Researcher with the Technical University of Munich (TUM), Munich, Germany, where he has been a Post-Doctoral Researcher since 2020. His research interests include multitemporal remote sensing image analysis, domain adaptation, time-series analysis, image segmentation, deep learning, image processing, and pattern recognition.
Dr. Saha was a recipient of the Fondazione Bruno Kessler Best Student Award 2020. He is a reviewer for several international journals. He served as a Guest Editor at Remote Sensing (MDPI) special issue on "Advanced Artificial Intelligence for Remote Sensing: Methodology and Application" and Frontiers In Remote Sensing Research Topic on "Learning with Limited Label."