Semantic Segmentation of Remote Sensing Images With Self-Supervised Multitask Representation Learning

—Existing deep learning-based remote sensing images semantic segmentation methods require large-scale labeled datasets. However, the annotation of segmentation datasets is often too time-consuming and expensive. To ease the burden of data annotation, self-supervised representation learning methods have emerged recently. However, the semantic segmentation methods need to learn both high-level and low-level features, but most of the existing self-supervised representation learning methods usually focus on one level, which affects the performance of semantic segmentation for remote sensing images. In order to solve this problem, we propose a self-supervised multitask representation learning method to capture effective visual representations of remote sensing images. We design three different pretext tasks and a triplet Siamese network to learn the high-level and low-level image features at the same time. The network can be trained without any labeled data, and the trained model can be ﬁne-tuned with the annotated segmentation dataset. We conduct experiments on Potsdam, Vaihingen dataset, and cloud/snow detection dataset Levir_CS to verify the effectiveness of our methods. Experimental results show that our proposed method can effectively reduce the demandoflabeleddatasetsandimprovetheperformanceofremotesensingsemanticsegmentation.Comparedwiththerecentstate-of-the-artself-supervisedrepresentationlearningmethodsandthemostlyusedinitializationmethods(suchasrandominitialization andImageNetpretraining),ourproposedmethodhasachievedthebestresultsinmostexperiments,especiallyinthecaseoffew trainingdata.Withonly10%to50%labeleddata,ourmethodcanachievethecomparableperformancecomparedwithrandom initialization.Codesareavailableathttps://github.com/ﬂyakon/SSLRemoteSensing.


Semantic Segmentation of Remote Sensing
Images With Self-Supervised Multitask Representation Learning Wenyuan Li , Hao Chen , and Zhenwei Shi , Member, IEEE Abstract-Existing deep learning-based remote sensing images semantic segmentation methods require large-scale labeled datasets.However, the annotation of segmentation datasets is often too time-consuming and expensive.To ease the burden of data annotation, self-supervised representation learning methods have emerged recently.However, the semantic segmentation methods need to learn both high-level and low-level features, but most of the existing self-supervised representation learning methods usually focus on one level, which affects the performance of semantic segmentation for remote sensing images.In order to solve this problem, we propose a self-supervised multitask representation learning method to capture effective visual representations of remote sensing images.We design three different pretext tasks and a triplet Siamese network to learn the high-level and low-level image features at the same time.The network can be trained without any labeled data, and the trained model can be fine-tuned with the annotated segmentation dataset.We conduct experiments on Potsdam, Vaihingen dataset, and cloud/snow detection dataset Levir_CS to verify the effectiveness of our methods.Experimental results show that our proposed method can effectively reduce the demand of labeled datasets and improve the performance of remote sensing semantic segmentation.Compared with the recent stateof-the-art self-supervised representation learning methods and the mostly used initialization methods (such as random initialization and ImageNet pretraining), our proposed method has achieved the best results in most experiments, especially in the case of few training data.With only 10% to 50% labeled data, our method can achieve the comparable performance compared with random initialization.Codes are available at https://github.com/flyakon/SSLRemoteSensing.Index Terms-Cloud detection, remote sensing images, selfsupervised representation learning, semantic segmentation.

I. INTRODUCTION
T HE rapid development of remote sensing technology has greatly widened the scope of exploring the earth.Satellite images have been widely used in resource exploration, land census, natural disaster monitoring, etc.The semantic segmentation [1]- [3] (also called pixel level classification) plays a key role in the analysis of remote sensing images and the fully convolutional neural networks (FCN) [4]- [6]-based methods have brought a great breakthrough to the semantic segmentation [7]- [12] of remote sensing images.
Despite the great success, recent FCN-based semantic segmentation methods for remote sensing images still rely on training with a large number of manually annotated data.Although there are some annotated datasets available, most of remote sensing data from the Internet are not labeled that adapts to semantic segmentation task.These unlabeled data have no effect on improving the semantic segmentation of remote sensing images.The purpose of this article is to design an effective pretraining method with unlabeled data to improve the effect of remote sensing semantic segmentation.
The most commonly used pretraining paradigm is Ima-geNet [13] pretraining.However, it is time-consuming and laborious to construct such a large-scale remote sensing dataset like ImageNet to pretrain networks as the annotation of remote sensing data may rely heavily on professional domain knowledge.In addition, considering that remote sensing images are increasingly showing the characteristics of multisources and multiresolutions, even if there is a large-scale remote sensing dataset, it cannot meet the requirements of downstream tasks for all remote sensing images obtained from various satellites.
Self-supervised representation learning is the other recently emerged research topic that learns effective visual representations of images by taking advantage of self-supervised learning ideas [14]- [16].It is an elegant subset of unsupervised learning, which can obtain supervision information from data itself during training.Therefore, it does not need any labeled data for training and can possibly learn from any scale of unlabeled data.In self-supervised representation learning, a set of pretext tasks are usually designed to explore the relationships between image patches or image transformations.Through the pretext tasks, the networks can be trained with unlabeled data and a pretraining model can be obtained.Then the downstream tasks such as semantic segmentation can be fine-tuned on this pretraining model to obtain better results.According to the type of supervision acquired, the pretext tasks in previous self-supervised representation learning methods can be divided into three categories: 1) Image level pretext tasks [17]- [35], 2) patch level pretext This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/tasks [36]- [40], and 3) pixel level pretext tasks [41]- [47].However, remote sensing images have random viewing angle and no specific salient area but more complex hierarchical structure and more abundant background information.The above selfsupervised methods designed for natural images do not consider the characteristics of remote sensing images and may not work properly.
Moreover, considering remote sensing images usually show more high-frequency details and hierarchical structure compared to natural images, the pretraining methods should extract the high-level and low-level features.Especially for the semantic segmentation, it requires that the networks should take these two aspects into account at the same time.However, the current methods for natural images and remote sensing images only consider one of them.In order to solve this problem, this article proposes a novel self-supervised representation learning method for remote sensing semantic segmentation.Our method is designed to focus on both high-level and low-level features.In our method, we design three different pretext tasks for pretraining, including an image inpainting task, an augmentation transform prediction (ATP) task and a contrastive learning task.We design a triplet-Siamese network with three output branches.The backbone network shares the same set of image features and network parameters.Each output branch corresponds to a different pretext task and is trained with its own loss function.The total loss for training the whole network is a multitask loss function which combines losses of the three pretext tasks.
By designing the image inpainting task, we aim to help networks to learn low-level representations.We propose a moderate approach to construct occluded areas by randomly transforming the in-box areas with image rotation, flip, color transformation, etc.By designing ATP task and contrastive learning task, we aim to help networks to learn high-level representations.For ATP task, according to the problem that remote sensing image has no obvious imaging perspective, we build the Siamese networks, and take the image and its transformation as the input to predict the type of transformation.
After the pretraining, the networks can be easily applied to semantic segmentation by fine-tuning on their labeled datasets.In the experimental part, we use the Potsdam dataset and Vaihingen dataset [48] to verify the effectiveness of our method.In addition, the cloud/snow detection task can also be regarded as a semantic segmentation task, so we select Levir_CS dataset [49] to verify our method in cloud/snow detection task.The results show that our method outperforms other recent self-supervised representation learning methods [24], [32], [33] in the semantic segmentation task.Our method achieves better results than ImageNet pretrained models, and the best results with limited training data.In addition, with only 10% to 50% labeled data, our method can achieve the comparable performance compared with random initialization, which shows that our method can effectively reduce the demand on annotated data.
The contributions of this article are summarized as follows.1) We propose a self-supervised representation learning method for remote sensing semantic segmentation.A multitask loss function is designed to guide the networks to learn both high-level and low-level features at the same time.A large number of unlabeled remote sensing images can be effectively used to train the networks and improve the performance of the semantic segmentation task.2) In the remote sensing semantic segmentation task, we achieve better results than models with ImageNet pretraining and other recent self-supervised pretraining methods.3) Our method can achieve the comparable performance with only 50% labeled data on Vaihingen dataset and 20% labeled data on Potsdam dataset compared with random initialization, while only 20% labeled data for cloud detection and 10% labeled data for snow detection are needed to achieve the comparable performance.The rest of this article is organized as follows.In Section III, we give a detailed introduction of our proposed method, including network configuration, multitask loss function, and implementation details.In Section IV, the experimental datasets and experimental results are introduced.Discussion and conclusions are drawn in Sections V and VI.

A. Self-Supervised Representation Learning
Self-supervised representation learning is a recently emerged research topic that learns effective visual representations of images by taking advantage of self-supervised learning ideas [14]- [16].Self-supervised representation learning obtains supervision information from data itself and train the networks without using manual annotations by designing a series of pretext tasks.The trained models can be used to improve the performance of downstream tasks.According to the type of supervision acquired, the pretext tasks in previous self-supervised representation learning methods can be divided into three categories.1) Image level pretext tasks [17]- [35], 2) patch level pretext tasks [36]- [40], and 3) pixel level pretext tasks [41]- [47].

1) Image Level Pretext Task
The image level pretext tasks in self-supervised representation learning explore the intrinsic properties or the relationship of images.For example, data augmentation can be used to generate transformed images and corresponding labels from the original image [17], [18], [22], [24], [26], [29], [35], and their correspondence can be thus explored during training.Such a group of methods can also benefit from the adversarial training [50] and the training of networks can be guided at the image level by building adversarial losses [19], [31].In addition to augmentation methods, we can also use clustering methods to divide the images into different groups roughly.The inputs of cluster methods are features extracted from neural networks and clustering results are integrated into the loss function to guide networks training [20], [21], [23], [27], [30], [34].
Another popular way to define image level pretext tasks is to design contrastive loss functions.These methods encourage the networks to learn similar representations from similar images (typically the image and its random transformations) and learn different representations from different ones.The contrastive loss functions can also be used to help networks obtain higher robustness to image rotation and scaling, and therefore improve the generalization of their feature representations [25], [28], [32], [33].

2) Patch Level Pretext Task
The key to patch level pretext tasks is that if we divide an image into several patches, then we can construct self-supervised loss functions by simply exploring the location or semantic relationship between them.The networks thus can be trained to learn from the patches and their surroundings.Doersch et al. [36] propose to evenly divide an image into nine patches and train the networks to predict the position of a certain patch relative to the center patch.However, this method can only learn the relationship between adjacent patches, and is hard to learn the overall arrangement of the content in an image.The pretext task based on the "jigsaw" solves this problem [39], [40].The jigsaw-based methods divide an image into patches and shuffle them.Then networks are trained to predict orders of the patches to recover the input image.During this process, the networks need to fully understand the content and relationship between each patch, and show better performance in downstream tasks [37], [38].

3) Pixel Level Pretext Task
The goal of pixel level pretext tasks is to make networks understand semantic information.Compared with image level tasks, pixel level pretext tasks focus more on the learning of semantic level information.However, they may also force the networks to learn too many details or shortcuts between the input and output, which is sometimes meaningless for downstream tasks [43].Therefore, pixel level pretext tasks usually need some skills to prevent overfitting during training.Autoencoder is a group of commonly used unsupervised/self-supervised representation learning methods [41], [42], [47].However, if autoencoder is directly used to reconstruct the input images, it will easily overfit to raw pixels rather than fully "understand" the image.Pathak et al. [42] propose to combine autoencoder with the image inpainting task to alleviate this problem.In their method, they train an autoencoder to recover the input image, but at the same time a region in the input image will be randomly discarded.Then they train a discriminator with the recovered images and some real ones to further improve the features.Zhang et al. [41] combine the ideas of autoencoder and contrastive learning.They switch the input of autoencoder to the transformed images.The split-brain autoencoder [47] modifies the one-way calculation of an autoencoder and proposes a two-way reversible autoencoder structure.Through a group of reversible operations, the self-supervised model can be trained with pair-wise losses.Besides, the tasks of image colorization [43]- [45] and image inpainting [46] are also widely used in self-supervised representation learning where a network has to learn to recognize objects and make a full understanding of the details of the images (e.g., sky is blue and trees are green) before achieving such goals.Although self-supervised representation learning has made great progress in recent years, it still falls behind ImageNet pretraining in most downstream tasks.Self-supervised representation learning outperforms ImageNet pretrained models only on a few tasks such as object detection [51]- [53].In addition, most of the above methods are designed for natural image tasks without considering the characteristics of remote sensing images.

B. Self-Supervised Representation Learning for Remote Sensing Images
Although researches of self-supervised representation learning for natural images are developing rapidly, methods for remote sensing images are relatively less.Compared with natural images, remote sensing images usually consists of more than three bands.Vincenzi et al. [54] propose to use high-dimensional data to reconstruct image color for pretraining, which can help networks learn image representations.But for hyperspectral image processing task [55]- [58], this method may not work well.Some self-supervised learning methods [59]- [61] are proposed for hyperspectral images, and have achieved good results.
In addition, the longitude and latitude information of remote sense images and multitemporal data [62] can also be used for self supervised learning.The SauMoCo [63] method utilize the spatial information of remote sensing images, and achieved good results.Researches [64], [65] combine images with spatial information and multitemporal images to the contrastive learning, and improve the performance of downstream tasks.
However, the above methods still follow ideas for natural images, trying to extract the supervised information from remote sensing images by designing pretext tasks just like natural images, but it does not take advantage of the characteristics of remote sensing images.In addition to the self-supervised learning methods, some early researches usually focus on the representation learning for specific tasks.In [66] and [67], a feature learning method for the scene classification task of remote sensing images is designed, which can effectively improve the effect of scene classification.In order to solve the problem of multiple remote sensing image data sources, Neumann et al. [68] proposes a feature learning method between different datasets.However, these methods are designed for an only single task and lack of generality.

III. METHODS
Given a backbone convolutional network (e.g., VGG16 [69], ResNet50 [70]), we start by designing a triplet Siamese network on top of the backbone.The triplet Siamese network consists of three input branches, three output branches, and the backbone as a feature extraction network.In this section, we introduce the detailed configuration of our network architecture and pretext tasks.

A. Overview of the Proposed Method
Fig. 1 shows an overall architecture of our method.For different tasks, weights are shared among input branches and the backbone networks.Different pretext tasks are implemented by Fig. 1.Overview of the proposed method.Given a backbone convolutional neural network, we build three branches on its output: An inpainting branch, an ATP branch, and a contrastive learning branch (from top to bottom in this figure).The inpainting branch takes in a randomly occluded image and is trained to repair the occluded area.The ATP branch and the contrastive learning branch share a same pair of input images, where the former is trained to predict the transformation type and the latter ensures that the backbone produces similar features for similar input and vice versa.
adding different heads and loss functions on top of the backbone networks.
For the inpainting branch, the input is a randomly occluded image.The branch repairs the occluded area by adding several transposed convolution layers on top of the backbone networks.To increase the details of texture and edge, we also fuse the global and local information by introducing skip connections between different convolution layers and transposed convolution layers.For the ATP branch and the contrastive learning branch, their inputs are the images before and after random transformation.Their features produced by the backbone networks are concatenated along their channel dimension.We then construct two fully connected networks: One takes in the features and predicts the transformation type as the output of ATP branch, and the other one maps the features to the latent space to calculate the contrastive loss function.

B. Pretext Tasks and Loss Functions
We define three pretext tasks for self-supervised training: An inpainting task that helps the backbone networks learn low-level features, and an ATP task + a contrastive task that are responsible for learning high-level features.

1) Inpainting Task
The inpainting branch helps networks to learn useful features by repairing occluded areas of the input image.Suppose I represents an original input image.We randomly occlude I with an S × S pixels square box B p and suppose I represents the occluded image.
In conventional image inpainting tasks, the pixel values of the occluded area are set to 0 or 255.However, the strategy of filling with 0 or 255 will cause the loss of information in the occluded area, thereby increasing the difficulty and instability of network training.Current researches [43], [46] usually utilize generative adversarial networks to improve this problem.But the use of generative adversarial networks will increase the difficulty of network design and training too.Therefore, we use a more moderate approach to construct occluded areas by randomly transform the in-box areas with image rotation, flip, color transformation, etc.In this way, the backbone can use the information both inside and outside occluded area for the restoration.To further improve the generalization ability of the pretrained model, we also perform random clipping and color jittering on the input image I .
We define the following loss function to train the backbone and the inpainting branch: where I is outputs of the networks.β = |I − I | is a predefined weighting map, which guides the networks to pay more attention to the areas with bigger changes.• 1 is the pixel-wise l-1 function.

2) Augmentation Transform Prediction (ATP) Task
Given I as an original input image, in the ATP branch, we define a series of image transformation operations (image rotation, flip, etc.) T = {t 1 , t 2 , . .., t M } and transform I by using a randomly selected one operation t from T .Suppose I represents the transformed image in the ATP branch.We feed I to the networks and train it to recognize which type of transformation is applied.The ATP thus can be essentially formulated as a standard classification problem.The loss function of the ATP branch is defined as follows: where Â(m) = {0, 1} represents the one-hot encoding of the ground truth class label.P represents the predicted classprobability of M different transformations.The number of categories is six.The used data augmentations include, rotating 90 degrees, rotating 180 degrees, rotating 270 degrees, horizontal flip, vertical flip, and no augmentation.
3) Contrastive Learning Task We follow the paper [28] to build a contrastive loss function to guide networks to learn high-level features of remote sensing images.The contrastive branch and ATP branch share the same group of input images.
Given a pair of input image I and its transformation I , we suppose φ(I) and φ(I ) represent their image features produced by the backbone networks respectively.We calculate the similarity between φ(I) and φ(I ) as follows: where • 2 represents the l-2 norm.
We further assume that a mini-batch during the training consists of N image pairs, and (I i , I i ) represents the ith image pair.We define the contrastive loss function l(I i , I i ) of the ith input pair as follows: where (I i , I i ) is a positive pair and (I i , I k ) is a negative pair.
Minimizing the above contrastive loss function can ensure that the feature similarity of a positive image pair is larger than any other negative combinations.For all image pairs in a training batch, the total contrastive loss function of this branch is written as follows: The final loss function of the three pretext tasks is defined as the linear combination of their losses where γ p , γ a , and γ c are positive coefficients to balance the losses of the above three tasks.

C. Implementation Details
We experiment on two widely used convolutional neural networks architechtures -VGG16 [69] and Resnet50 [70], and use them as our backbone networks.To improve the training stability, we add batch normalization (BN) [71] layers to the three prediction branches after each convolution and transposed convolution layer.We also use data augmentation on input images to avoid overfitting.We augment the input images by using random image rotation ([0, 90, 180, 270] degrees), horizontal flip, and vertical flip.We add color jittering and random clipping to the transformed image, which makes the pretext tasks more difficult.The input image is converted into a gray image with a certain probability to avoid learning too much color information.
To balance the loss functions of the three tasks numerically, especially at the beginning of training, We set γ p = 20.0,γ a = 1.0, and γ c = 1.0.For self-supervised training stage, we set batch size as 8.The network training lasts 13 epochs in total, and  I, we take VGG16 [69] as an example to show the detailed structure of our prediction branches.The columns "Ker," "S," and "#Ker" denote the size, stride, and channel number of the convolution layers, respectively."conv" and "deconv" denote convolution and transposed convolution operations, respectively.The pseudocode of training process is shown in Algorithm 1.

1) Dataset for Pretraining
Since there are still few researches in remote sensing focusing on self-supervised pretraining, no such benchmark data are publicly available.Therefore, we construct a large unlabeled dataset by combining several well-known remote sensing datasets, including DIOR [72], DOTA [73], and Levir [74].To increase the versatility of the dataset, the images are selected with different resolutions.We use the method in [75] remove some low contrast images, and the final number of images for pretraining is 186 486. 2

) Datasets for Semantic Segmentation
We use three datasets: Levir_CS [49], Potsdam, and Vaihingen [48] to verify the effectiveness of our method for semantic segmentation of remote sensing images.The Potsdam and Vaihigen datasets are commonly used datasets for remote sensing semantic segmentation.Levir_CS dataset is a large-scale dataset for cloud / snow detection task that is in essence a pixel classification task.The detailed information of all the above datasets are shown in Table II  both have six categories in each dataset.We crop the images into patches with size of 256 × 256, and randomly divide them into a training set (60%), a validation set (20%), and a testing set (20%).The Levir_CS dataset consists of two categories with cloud and snow.We also crop the images into patches with size of 256 × 256, and randomly divide them into a training set (60%), a validation set (20%), and a testing set (20%).

B. Experimental Setup
In the pretraining stage, we do not use the validation set, but used all the data for network training.Because even if we set the validation set, we cannot reasonably infer the performance of the pretrained model on the semantic segmentation task through the validation set.We infer whether the pretraining process is completed according to the loss function during the training process.
We compare our method with three state of the art methods for self-supervised representation learning, which are NPID [24], MoCo [32], and MoCo v2 [33].All methods are trained and evaluated using the datasets described above.
For the semantic segmentation task, we add five transposed convolution layers on top of the backbone model with each layer followed by a BN layer (the same architecture as our inpainting branch).We compute the accuracy on validation set every 20 epoch and save the model with the highest accuracy.We stop training after 200 epochs.The learning rate is set to 0.005.We adjust the learning rate to its 90% every 10 epochs.For the cloud/snow detection task, we adopt the same network structure and training strategy with those for semantic segmentation, except that the learning rate is 0.001.
We use the intersection-over-union (IoU) as the evaluation accuracy.The IoU can be computed as follows: where TP is the number of true positive pixels, and FP and FN are the number of false positive and false negative pixels.
All of these values are calculated from the confusion matrix of categories.Finally, after getting the IoU of each category, we compute mIoU-the averaged accuracy of all categories as the final evaluation accuracy.

C. Semantic Segmentation Results
We verify the performance of our method on remote sensing image semantic segmentation task on Potsdam and Vaihingen datasets [48].The results are shown in Tables III, IV, and Fig. 2. Considering humans are able to recognize novel instances with very few training examples, we also show the performance of our method with very limited training data.The columns in the    In addition, we have counted the number of training data for each class under different proportions to ensure that even at the ratio of 0.25% and 0.33%, each category has corresponding training data.It can be seen that our method achieves the best results in above two datasets and obtain the best segmentation results in almost every scale of training set.As semantic segmentation usually requires a large number of effective low-level features to supplement the details of outputs, the results suggest that our method can extract better low-level features than other methods.Therefore, we can use a sufficiently large unlabeled dataset which can be obtained easily to pretrain any segmentation model before fine-tuning on target datasets even with every limited labels.The results show that our method is qualified to be an alternative or even a better replacement for the ImageNet pretraining on standard remote sensing image segmentation tasks.
In addition, it can been seen from Fig. 2 that the segmentation performance has a positive correlation with the scale of training data for every method.With the increase of training data, the segmentation performance has been improved.The performance improvement of our method is more obvious when the amount of training data is limited.When the training data increases to 100%, the advantages of our method that that of other methods begin to decrease.We can reasonably speculate that if the training data are large enough, the advantages of our method will eventually be wiped out.However, due to the difficulty of segmentation data annotation in reality, we can hardly get enough training data, and we can not know what scale of labeled data is enough for network training.Therefore, our method can effectively improve the accuracy of segmentation, and reduce the workload of data annotation.
Fig. 3 shows some examples of the semantic segmentation results of comparison methods on the Potsdam (the first three rows) and Vaihingen dataset [48] (the last three rows).The first column shows the input images, and the second column shows the label image.The third to seventh columns are the results of the comparison methods.The last column shows the results of our method.All the models are trained with 100% training

TABLE VI SNOW DETECTION RESULTS
IoU is used as the metric.The highest scores are marked in bold.Ours represents ImageNet pretraining + self-supervised pretraining of our method data.It can be seen that our method can effectively improve the performance of semantic segmentation, and can reduce the false alarms.

D. Cloud/Snow Detection
The cloud/snow detection is in essence a pixel classification task, so it can be regarded as a special semantic segmentation problem.Cloud/snow detection task consists of two sub tasks: Cloud detection and snow detection.The difficulty of cloud/snow detection lies in the high similarity between cloud and snow.In addition, the cloud samples are widely distributed and easy to obtain, but the snow samples are limited by terrain and season so that they are relative rare.The cloud/snow detection results are shown in Tables V, VI, and Fig. 4. Considering that the snow samples are relatively rare, we start from 0.5% of the training data to verify the effect of our method on different scale of training data, rather than from 0.25% as in the semantic segmentation experiment.It can be seen that our methods have achieved the best results both on cloud and snow detection.
When the scale of training data is limited, almost all the methods can achieve a good cloud detection performance, but under the same scale of training data, the performance of snow detection is pretty low.This is because even if some clouds are difficult to distinguish, cloud detection is still a relatively easy task.Most clouds have similar texture information, and a small amount of annotation data is enough for relatively simple cloud detection.But for snow detection, most of snow samples are similar with cloud ones and the scale of snow samples is usually small, which leads to the networks tends to label snow as cloud, resulting in the performance degradation of snow detection.As can be seen from the Tables V, VI, and Fig. 4, our method is superior to other methods in cloud detection results, but the advantage is not particularly great.For snow detection, our method is significantly better than other methods, especially in the case of less labeled data.
Fig. 5 shows some examples of the cloud detection results of comparison methods on the Levir_CS [49] dataset.The first column shows the input cloud images, and the second column shows the label image.The third to seventh columns are the results of the comparison methods.The last column is the predicted cloud result of our method.The parts marked in gray correspond to the cloud in the input image, and the parts marked in black and white correspond background and snow separately.For the   cloud detection, the performance of our method is comparable with other methods, but for the snow detection, we can obviously see that our method has achieved better snow detection results.

E. Ablation Studies
We design the following ablation analysis to analyze the importance of each pretext task in our method, including 1) the inpainting task, 2) the ATP task, and 3) the contrastive learning task.We first start from a baseline approach where we directly train the networks on downstream tasks from scratch.Then the above pretext tasks are added one by one to pretrain the networks with self-supervised losses.Finally we fine-tune the pretrained networks on the Vaihingen dataset and their mIoUs are recorded.Results are shown in Table VII.The results show that each task achieves a noticeable improvement on the semantic segmentation task, where the "ATP" and "Contrasitive" tasks improve the segmentation accuracy by about 2%, while the "Inpainting" task further improves the segmentation accuracy by 4%.As the inpainting task pays more attention to the low-level features, it improves the semantic segmentation more significantly.In addition, although ATP and contrastive tasks are both for learning high-level features, they can continue to improve the accuracy of segmentation from the results.The features they focus on and the effects they produce are not exactly the same.The ATP task may make the network pay more attention to the changes in the texture and position of objects, while the contrastive learning task may help networks pay attention to the semantic information of images.

F. Experimental Results Analysis
In this part, we analysis the performance of our method in the face of different task with various difficulty and training data scale.In terms of the scale of training data, our method can significantly reduce the demand for training data, which is manifested in two aspects.On the one hand, it can be seen from Tables III and IV that the performance of our method is limited when the training data is extremely small (0.25%, 0.33%).But with the increase of training data, our method first shows a leap in performance.Compared with our method with the most commonly used ImageNet pretraining method, our method can save almost half of the training data, i.e., our method with only half of the training data can achieve the performance that ImageNet pretraining method can achieve with all data.compared with random initialization, our method can achieve the comparable performance with only 50% labeled data on Vaihingen dataset and 20% labeled data on Potsdam dataset.On the other hand, when using all the training data, our method can still improve the segmentation mIoU by 4%.Without changing the network structure, the most effective way to improve the performance is to increase the training data.But the segmentation data annotation is very time-consuming and laborious, our method provides a new way to continue to improve the performance.
In addition, the experiment results of cloud/snow detection shows performance of different methods for segmentation tasks with various difficulty.From Fig. 4, we can see that the curve of cloud detection performance is relatively flat.Although our method is still better than other methods in most cases, the improvement of cloud detection is not particularly great.But for the snow detection task, our method has brought great performance improvement to the snow detection.The reason for the difference between cloud detection and snow detection is that cloud detection is a relatively simple task.In most cases, cloud and ground objects are easy to distinguish.However, snow and cloud have similar characteristics, and usually the cloud samples are more than twice as large as the snow samples, which leads to the network tends to label snow as cloud and makes it difficult to improve the performance of snow detection.The experimental results on snow detection shows that our method is more effective in the face of complex tasks.Compared with random initialization, only 20% labeled data for cloud detection and 10% labeled data for snow detection are needed to achieve the comparable performance.

V. DISCUSSION AND FUTURE WORK
The self-supervised representation learning method provides an effective way to utilize large amount of unlabeled data.Up to now, most deep learning methods rely on a large number of labeled data, but for remote sensing images, the vast majority of available data are not labeled.How to use these remote sensing images effectively is a great challenge to be solved.In order to improve the utilization efficiency of large-scale unlabeled remote sensing data via self-supervised representation learning method, the following three issues need to be considered in the future.
1) Since self-supervised representation learning needs to cooperate with large-scale datasets to give full play to its advantages, future work will consider building a largescale remote sensing representation learning dataset.The dataset needs to fully consider the characteristics of multisource and multiresolution of remote sensing images, and try to cover the main data sources of remote sensing images.2) As the great difference between remote sensing images and natural images, the method which performs well for natural images may not be effective for remote sensing images.Therefore, we will systematically study and compare the differences of different methods in these two images in the future, so as to provide reference that help self supervised representation learning to play a greater role in the field of remote sensing.
3) In addition to the image itself, remote sensing images also contains a lot of geographic information.We will consider how to apply this geographic information into the self-supervised representation learning method of remote sensing images, so as to greatly improve the performance of networks for remote sensing images.

VI. CONCLUSION
This article proposes a self-supervised representation learning method for remote sensing semantic segmentation.Considering the characteristics of remote sensing images, we design multiple pretext tasks (inpainting, augmentation transform prediction, and contrastive learning) to guide networks to learn both lowlevel and high-level features at the same time.The pretrained models can be applied to various downstream tasks as an alternative of the ImageNet pretrained models.The experimental results show that our method outperforms random initialization, ImageNet pretraining, and other self-supervised methods in remote sensing the semantic segmentation task.Our method has achieved better results especially with limited training data.This proves that the model trained by our methods can be considered as an effective initialization for various remote sensing image semantic segmentation tasks and can be also used to improve the performance of semantic segmentation for remote sensing images.

Fig. 2 .
Fig. 2. Semantic segmentation results.(a) Results on Vaihingen dataset.(b) Results on Potsdam dataset.The dotted line shows the result of our method.Ours represents ImageNet pretraining + self-supervised pretraining of our method.(a) Segmentation IoU of Vaihingen.(b) Segmentation IoU of Potsdam.

Fig. 3 .
Fig. 3. (Better viewed in color) Some examples of the semantic segmentation results of comparison methods on the Potsdam (the first three rows) and Vaihingen dataset [48] (the last two rows).The first column shows the input images, and the second column shows the label image.The third to seventh columns are the results of the comparison methods.The last column is the result of our method (VGG16 ).(a) Image.(b) Label.(c) Random.(d) ImageNet.(e) NPID.(f) Moco.(g) Moco v2.(h) Ours.

Fig. 4 .
Fig. 4. Cloud/snow detection results.(a) Cloud detection results.(b) Snow detection results.The dotted line shows the result of our method.Ours represents ImageNet pretraining + self-supervised pretraining of our method.

Fig. 5 .
Fig. 5. (Better viewed in color) Some examples of the cloud / snow detection results of comparison methods on the Levir_CS [49] dataset.The first column shows the input cloud images, and the second column shows the label image.The third to seventh columns are the results of the comparison methods.The last column is the predicted cloud result of our method (VGG16 ).The parts marked in gray correspond to the cloud in the input image, and the parts marked in black and white correspond background and snow separately.(a) Image.(b) Label.(c) Random.(d) ImageNet.(e) NPID.(f) Moco.(g) Moco v2.(h) Ours.
[48]e Potsdam and Vaihingen[48]datasets Algorithm 1 Training process of our method 1: Input: Training data X, backbone f , different pretext tasks heads g p , g a , g c .Transforms for tasks: t p , t a , t c .

TABLE II DATASETS
FOR DOWNSTREAM TASKS AND THEIR STATISTICS

TABLE III SEMANTIC
SEGMENTATION RESULTS ON VAIHINGEN DATASETIoU is as the metric.The highest scores are marked in bold.Ours represents ImageNet pretraining + self-supervised pretraining of our method.

TABLE IV SEMANTIC
SEGMENTATION RESULTS ON POTSDAM DATASETIoU is used the metric.The highest scores are marked in bold.Ours represents ImageNet pretraining + self-supervised pretraining of our method.

TABLE V CLOUD
DETECTION RESULTSIoU is used as the metric.The highest scores are marked in bold.Ours represents ImageNet pretraining + self-supervised pretraining of our method