Improving Land Cover Segmentation Across Satellites Using Domain Adaptation

Land use and land cover mapping is essential to various fields of study, such as forestry, agriculture, and urban management. Generally, earth observation satellites facilitate and accelerate the mapping process. Subsequently, deep learning methods have been proven to be excellent in automating the mapping via semantic image segmentation. However, because deep neural networks require large amounts of labeled data, it is not easy to exploit the full potential of satellite imagery. Additionally, land cover tends to differ in appearance from one region to another; therefore, having labeled data from one location does not necessarily help map others. Furthermore, satellite images come in various multispectral bands, which range from RGB to over 12 bands. In this study, our aim is to use domain adaptation (DA) to solve the aforementioned problems. We applied a well-performing DA approach on the DeepGlobe land cover dataset as well as datasets that we built using RGB images from Sentinel-2, WorldView-2, and Pleiades-1B satellites with CORINE Land Cover as ground truth (GT) labels. The experiments revealed significant improvements over the results obtained without using DA. In some cases, an improvement of over 20% mean intersection over union was obtained. Sometimes, our model manages to correct errors in the GT labels.


I. INTRODUCTION
O VER the last few years Remote Sensing (RS) data became easily obtainable thanks to the surge of open data from some Earth Observation (EO) satellites.Data such as the Sentinel-2 data from the European Copernicus program [1] and Landsat data from the U.S. Geological Survey (USGS) [2] are available for free.These satellites provide multispectral high resolution imagery (up to 10m).This gave the opportunity to use big data in RS applications which in turn facilitated the usage of Deep Learning (DL) tools in the processing of that data.Land Use and Land Cover (LULC) classification and segmentation is an example of such applications.It is indubitably an important tool in RS providing a way to monitor forests, agriculture, oceans, etc.
Lately classical Machine Learning (ML) tools have fallen out of favor when it comes to many machine vision notably LULC segmentation.Otávio et al. [3] showed that Convolutional Neural Networks (CNN) vastly outperform the classical ML methods when it comes to land cover classification.In the land cover segmentation section of the DeepGlobe challenge [4] the leaderboards are completely dominated by Deep Neural Networks (DNN).In land cover segmentation, [5], [6], and [7] are leading examples in the land cover segmentation where they rely on DNNs such as ResNet an DenseNet.
Deep Neural Networks (DNN) perform well, but, they require a lot of data to show their true potential.The availability of free data from a few satellites does not translate well to ground truth LULC annotation.Corine Land Cover (CLC) is an example of annotation over all of Europe, but the pixel resolution is quite low at 100m/px.Better resolution counterparts exist such as the 20m/px version covering Finland provided by the Finnish environment institute [8] or the 10m/px version covering Germany by Geodatenzentrum [9].Albeit these versions are limited and not foul proof they can be used to train a DNN to adapt from one location to another or from one satellite to another.
Removing the domain shift between different datasets due to different distribution, lighting, and other factors is called Domain Adaptation.In the context of semantic segmentation, domain adaptation is used to segment images from a target dataset using a source dataset.There exist two types of domain adaptation approaches, supervised where some or all of the target data is labeled, and unsupervised where the target data is unlabeled.Domain adaptation for semantic segmentation is usually applied on street-view images or basic character recognition images.It has the potential to improve upon results achieved with simple transfer learning.
Our contributions are: 1) To the best of our knowledge this is the first paper that goes into applying domain adaptation to LULC mapping.2) Building various labeled datasets using CLC.
3) Adapting an existing domain adaptation method for satellite imagery.

II. RELATED WORK
1) LULC segmentation: Land use and land cover image segmentation is a field that has not been exploited much especially using deep learning.Usually classical machine learning methods are applied such as support vector machines and decision trees [10] [11].This is due to the lack of labeled data to train the required DNNs for the task.A small amount of research has been done to tackle the task land cover segmentation using deep learning.As an example the state of the art semantic segmentation DNN based methods were translated to satellite imagery, however, the results were not as good.This is caused by the difference between regular objects images and satellite images where unlike the regular shapes of objects the land cover types have random shapes like forests stands and bodies of water.Kuo et al. [7] presents a method that is one of the leading results in the Deebglobe challenge, in which improving the results relied on a variation of Deeplabv3+ [12] where the fully connected layers of the ResNet backbone are replaced with ASPP (Atrous Spacial Pyramid Pooling).In addition an encoder-decoder architecture is deployed to reduce the effect of resolution loss due to pooling and strided convolution.The encoder-decoder method is a repeating approach on the leading methods in the field of land cover segmentation.This is due to the necessity of high resolution to land cover since not only structure affects the segmentation but texture and color play a big role resulting from the irregularities of land cover; for the most part.Hasan et al. [13] used state of the art semantic segmentation method, and added LIDAR 1 data that improved upon the results a bit compared to other methods.
2) Domain adaptation: Domain adaptation takes two different sets (source and target) with different distributions and aims at reducing the domain shift between them by changing and aligning the distribution of one of the sets to match the other.This can also be done by aligning the sets to a common space different from both sets.There are mainly three forms of domain adaptation approaches [14] [15].The first method is a classical form of minimizing a distance between the source and the the target data.Maximum Mean Discrepancy (MMD) is an example of such measures to minimize to achieve domain-invariant feature representation that performs well in both source and target domains.The second method is adversarial based domain adaptation that relies on using Generative Adversarial Networks (GANs) [16] to make one of the sets appear similar to the other one.Eric et al. [17] is an example of the adversarial methods where the target data is encoded into the source data with a discriminator to recognize between the two.The third method creates a shared representation for each domains so to not only translate one domain to another but both domain to a common space, or have a transfer function capable of translating both models to each other and back to the original one.CycleGAN [18] is a example of the third method where two discriminators are used to map images from the source data to the target data and vice-versa.
3) Domain adaptation for semantic segmentation: Domain adaptation can be used for multiple applications including semantic segmentation.All methods mentioned previously have been tested in the field of semantic segmentation with varying level of success.In fact domain adaptation is very useful for semantic segmentation.The reason behind it is that the pixel annotation of images is expensive and require a lot of resources.Generally domain adaptation methods used for classification are not well translated for semantic segmentation [19].Therefore adversarial and reconstruction methods have been preferred.Architectures such as [20], [21] are examples of adversarial domain adaptation that aims at using a GAN to generate source like images and then segment them using a network trained on the source data.The reconstruction approach has been tested by many methods with some having different variations [22] [23] [24] [25].The datasets that they used are almost exclusively the street view images datasets, including Cityscape [26], GTA5 [27], and Synthia [28].Chang et al. [29] is an example where a Domain Invariant Structure Extraction (DISE) framework was proposed to disentangle images into domain-invariant structure and domain-specific texture representations.Another new method that adds an extra step was proposed by Li et al. [30] called bidirectional learning (BDL).The principle behind it is to add a bidirectional learn- 1 LIDAR uses light in the form of a pulsed laser to measure ranges ing system and alternate between learning the segmentation adaptation and the image translation model and adding a loss that supervises the translation using the segmentation adaptation model.This prevents the translation model from converging to a point where the discriminator sees the images as being from the same distribution while not aligning the classes correctly which causes the segmentation to fail.

III. MATERIALS A. Satellite data
The satellite data used in the study comprised of four EO satellites.The first one containing the bigger bulk of the data is from the Sentinel-2 constellation of satellites.The second satellite is the Worldview-2 satellite containing a much fewer amount of data.The third satellite is Worldview-3 through an already available dataset with images taken from Worldview-3 Vivid data [31].Finally the Pleiades satellite data with the least amount of images.
Sentinel-2 satellites is composed of two polar-orbiting satellites placed in the same sun-synchronous orbit, phased at 180°to each other.The satellites are named Sentinel-2A and Sentinel-2B.They each contain a multispectral sensor with 12 bands ranging from 60m/px resolution up to 10m/px.Of which the four 10m/px bands were chosen to be used.These bands shown in Table I contain the Near Infra Red (NIR) band which is extremely useful in detecting vegetation.
Through the European Space Agency's (ESA) third party mission data from Worldview-2 satellite is available for free as WorldView-2 European Cities [32].Worldview-2 is a Very High Resolution (VHR) satellite that contains an 8-bands multispectral sensor with 1.8m/px resolution shown in Table II and a 0.46cm/px pan-chromatic sensor.
Pleiades-1 satellite is part of the Pleiades constellation of satellites whose data is not available for the public for free.It is similar in its properties to Worldview-2 (Table III) so it would be interesting to see the results of using Worldview-2 to train a model that can do land mapping on the Pleiades data.
The data obtained from satellite imagery comes in the form of raster images in floating point format.Both Sentinel-2 and Worldview have a 12bit radiometric resolution encoded in 32bit floating point.These had to be encoded in 8bit unsigned integer format to be able to handle them as image files using Python libraries.This process required the use of QGIS which is a Geographical Information System (GIS) allowing the handling of high resolution multi-channel rasters.The main steps consisted of merging the rasters, translating the format to 8bit images and extracting the RGB channels before normalizing them to have a unified illumination.
The bands mentioned in III-A were extracted to form a 3 channel 8bit RGB raster.This in turn got divided into patches of PNG images with 224 × 224 resolution for Sentinel-2, 512 × 512 for Worldview-2, and 448 × 448 for Pleiades-1.The resulting amount of images from Sentinel-2's is 37706 RGB images.The amount of Worldview-2 images is much lower resulting in 3570 RGB images.The Pleiades-1 dataset is the one with the least amount of images at only 500 RGB images.The DeepGlobe land cover segmentation dataset [4] is used for comparison with available methods.Containing 1146 images of which only 803 are labeled.The dataset is built from Worldview-3's vivid images [31].It covers India, Indonesia, and Thailand.This dataset is readily available without the need for preparation.It contains 12847 images with a size of 612 × 612 pixels, which I cropped from the full size of 2000×2000 pixels since using the full resolution images would require more GPU memory.The format of the images is 8bit JPG with a PNG RGB labeling.
For all the datasets data augmentation has been introduced during training as a mixture of random rotation and cropping.

B. Label data
The label data chosen is from Corine Land Cover by Copernicus (CLC) [33].CLC is a manually annotated map of Europe based on 44 classes ranging from natural covers such as forests and water surfaces to man made covers such as buildings and crops.Roughly in every 6 years a new version of CLC is available since the year 2000.The pixel resolution of it is 100m/px, however, there exists a version with 20m/px covering the whole area of Finland [8] and a 10m/px version covering Germany.This modified version of CLC2018 is the one used in the study with further modifications involving the classes.These modifications consist of merging some classes together to get them down to 7 classes instead of the original 44 classes.Table IV shows the details of the classes used.The lack of label data in the field of LULC is a serious problem that feeds itself.Without good data it is difficult to build reliable models and thus limiting the ability to use them to

C. Study area
The study area varies between Sentinel-2 and Worldview-2 both covering parts of Finland and Germany with none of the areas between the satellites overlapping.Worldview-2 covers 1520.17km 2 in Finland and 1310.18km 2 in Germany (Fig 2).The rasters were carefully chosen to avoid any cloud coverage that might compromise the efficiency of the training.Sentinel-2 covers a far larger area in Finland at around 128320.21km 2 from all over the country .The area covered in Germany by Sentinel-2 is also larger at around 74361.98km 2 (Fig 1).The availability of more data from Sentinel-2 is due to the fact that all of its data is available for free whereas a limited amount of the Worldview-2 data is freely available.Finally the Pleiades-1 data which covers a small area of Finland at around 519.67km 2 with no overlapping data with the Worldview-2 data.Some of the area overlaps with Sentinel-2.

IV. METHOD
The method applied on this study relies on using domain adaptation to be able to semantically segment satellite images from a different dataset than the one with the ground truth images already available.

A. Network architecture
The network used are based on the ones used by [30].We also explore the effect of newer semantic segmentation architectures on the results.
The architecture proposed by Li et al. [30] is shown in Fig 3   image to image translation network based on GANs that takes images from the source dataset and learns to translate them into the target dataset's distribution.The architecture of F is CycleGAN with a 9 block residual network as generator and a 3 layer discriminator.The second network (M) is the segmentation network that learns to segment images using ground truth labels.The segmentation network is based on Deeplabv2 which is a network specialized for semantic segmentation with ResNet as its backbone, and in this case ResNet101 pretrained with ImageNet dataset.M is connected to a discriminator that learns to distinguish between segments generated from target images and translated source images.

B. Training
The training was performed on a different set of Nvidia GPUs (Tesla V100, Tesla P100, Tesla T4) [34] with 16GB of video memory.For about 250, 000 iteration with a batch size of 4. The batches were randomized for every iteration in the epoch.
The training of the BDL network uses the loss functions l M to train the segmentation network shown below: as for the translation network l F to train the translation network shown below: where S is the source data, T is the target data, S is the translated data from source to target, T is the translated data from target to source, l adv is the adversarial loss by the discriminator added to the segmentation network, l seg is the cross entropy loss between images and labels, F is the perceptual loss which is the loss that connects the segmentation network back the translation network 2 .The coefficients λ adv , λ GAN , λ recon , λ per , and λ perrecon are used to emphasize on some losses more than others.Those coefficients are the ones that will help guide the translation network using the segmentation network.And Fig 5 show the difference between setting λ per = 0.1 and λ per = 1 respectively.The result with setting λ per = 0.1 which works with streetview images causes forestry to be replaced with desert since both of them have a random shape and since these classes both exist it doesn't trigger the discriminator to think it is wrong.Therefore it is important to find the best coefficients for each task.A multitude of combinations have been tested based on previous results and intuition.In addition the loss function has been modified to the following one:  The main changes are the separation of λ per to λ perA and λ perB which were modified from the default value of 0.1 to a higher one to avoid an issue where the classes are not matched correctly.to DeepGlobe translation the resolution is similar so texture information can be used by the generator to match classes from source to target.In the Sentinel to DeepGlobe translation the resolution difference is much higher and there is not much information shared between the classes from both sets apart from color where it could also be different.

C. Metrics
Semantic segmentation uses a varying set of metrics to measure how accurate it is compared to the GT data.Those include Mean Intersection over Union (MIoU), Average Preci- MIoU is a widely used metric for the evaluation of semantic segmentation.It computes the mean of the rate of overlap between the GT segments and the resulting segmentation.MIoU is obtained using the following equation: where n is the number of classes.Another way to write the formula is:

V. RESULTS AND DISCUSSION
DeepGlobe dataset is the only available standard dataset for land cover segmentation so trying to fit results to it is a good way to compare with other methods.The Worldview-2 dataset contains somewhat similar images to the ones in the DeepGlobe dataset since they are both from the Worldview constellation albeit DeepGlobe uses images from Worldview-3 and with different post processing methods.The test of using domain adaptation from Worldview-2 (Finland and Germany) to DeepGlobe is going to be referred to as "WV2 to DG".Similarly a test that also adapts to DeepGlobe is the Sentinel-2 to DeepGlobe referred to as "Sen to DG".Unlike WV2 to DG where Worldview-2 is a satellite of somewhat similar properties with the big difference being the location, Sen to DG is trying to have domain adaptation between two very different satellites.In addition to DeepGlobe images being captured from Asian countries close to the equator, the Sentinel-2 dataset covers Finland and Germany only.Having a test between two similar satellites such as Worldview-2 and Pleiades-1 satellites from the same location is helpful to know whether or not it is necessary to use domain adaptation and how useful it is.The test will be referred to as "WV2FI to PLFI" To test how well domain adaptation performs between satellites when the location is similar we implemented a test between Sentinel-2 and Worldview-2 referred to as "Sen to WV2".Both satellites cover non-overlapping areas from Finland and Germany.The final test aims at seeing how well domain adaptation works for different locations when the sensor used is the same.Therefore, we implemented Worldview-2 Germany to Worldview-2 Finland referred to as "WV2GR to WV2FI".Finland and Germany do not share much of the land cover distribution.The tree species for once are quite different and Finland has much more lakes and forests compared to Germany with more urban areas and agriculture.

A. No domain adaptation results
To test the improvements over no domain adaptation, we implemented a separate run with only the segmentation network.Then we ran a test on the model with the target dataset's validation subset.
The results of WV2 to DG test without domain adaptation can be seen in Table V.The results are quite bad and that is caused by the fact that images from Worldview-2 vary a lot from images from DeepGlobe in both sensor and location on the planet with many classes being different as seen in Fig 8. Everything is considered agriculture by the network which makes very unreliable.Similarly Sen to DG results are not reliable albeit better than the WV2 to DG since Sentinel-2 has way more data which results in a slightly more variety.This however is not enough to produce acceptable results as seen in Fig 9.The results of WV2FI to PLFI without using domain adaptation is not as good as expected.Even though the differences of the sensors between both satellites are not big the results do not translate well.Finally the results of WV2GR to WV2FI are surprisingly low considering that the satellite is the same and the labels were both based from CLC.Although the last point is not very relevant since the accuracy of the labels are not similar.Table V shows that the results are quite low with just 9.25 MIoU which is the second lowest score in the list of tests ran.

B. BDL results
The results of using domain adaptation for the WV2 to DG can be seen in Table VI.Although the result is not very impressive numerically, there is a big difference between that and the result without domain adaptation from 8.31 to 29.8.Fig 13 shows an example of a few test images with the ground truth and the model output after domain adaptation showing very similar labeling to the ground truth.In addition to that the results in some cases could be better than the ground truth and since the logic behind the labeling of the classes is not clear the errors could be debatable.As an example it is unclear what is considered a forest and what is considered rangeland, in addition to that some small villages have been completely ignored in the ground truth while parts of them are not as seen in Fig 14.
Using domain adaptation between Sentinel-2 and Worldview-2 is another successful example.The difference of results between using domain adaptation and not using it is quite big, almost double as seen in Table VI.Samples from the test set can be seen in Fig 15 .As mentioned in Section III-B the ground truth labels for Germany are lacking in precision which would limit the results even when training on that specific dataset.However having good data covering Finland would help correcting the errors in it and end up getting results that are better than human annotation in some   like much especially since the results with Deeplabv2 on Deepglobe dataset is 52.24 [35], but, visually it translates to good results as seen in Fig 17 .The results are still not as good as with WV2 to DG and that could be explained with the huge difference of pixel resolution between Sentinel-2 satellite and Worldview-3 where the DeepGlobe dataset comes from.This is still a good step as the Sentinel data is free whereas Worldview-3 data is very expensive.
With a limited amount of data like the case with the Pleiades-1 dataset, training a DNN on this data is quite difficult if not impossible.Therefore, applying domain adaptation is a good way to test the efficiency of such method to be able to have land mapping on a very limited dataset.WV2FI to PLFI show good results with almost an extra 10% MIoU as seen in Fig 18 which is quite good considering that it would be costly to train on Pleiades-1 data.
The results obtained from WV2GR to WV2FI using domain adaptation have improved significantly compared to the previous results from not using domain adaptation.As seen in Table VI the MIoU score rose up 27 to from 9 without domain adaptation.

C. Other tests
In addition to the previous experiments Deeplabv3+ has been tested as the segmentation network.Replacing the backbone with Deeplabv3+ did not bring much bigger improvements compared to Deeplabv2 which is shown in Table VII.The reason behind that is the imperfection of the dataset.In fact the results start to diverge after a while due to the network being too good at mimicking the errors in the ground truth images.So in order to make use of the best potential of deep learning methods it is important to have a very good and precise dataset in land cover mapping.

VI. CONCLUSION
In this paper, we experimented on applying domain adaptation to semantically segment satellite images for LULC.We chose to test on the state of the art method available for street view images as they are the most popular in the field of semantic segmentation.The experimental results show that the use of domain adaptation could highly benefit the ability to label LULC without the need for various labeled datasets, which is something that it lacks.The results suggest that it is even possible to segment satellite images from areas completely different than the ones used to train the model.
containing two subnetworks.The first network (F) is an

Fig. 3 .
Fig. 3. Bidirectional learning architecture.S represents the source data, T is the Target data, and S' is the Translated source data.
Fig 4 is an example of such occurrence where trees from Worldview-2 are being replaced by barren class in DeepGlobe domain.Another change is the addition of λ D as a result of the generator not focusing on making good enough images with the inclusion of a higher value of λ perA and λ perB .This causes the generated image to look a lot like the source dataset therefore λ D would force the discriminator to put the generator back in track while preserving the classes alignments.Fig 5 shows the previous error fixed with by changing the values of λ D , λ perA , and λ perB .The values change depending on the experiment ran.While translating Worldview-2 to DeepGlobe requires λ D = 1.5, λ perA = 0.5, and λ perB = 0.1 translating Sentinel-2 to DeepGlobe requires λ D = 100, λ perA = 2, and λ perB = 0.5.The reason for that is the difference of resolution where in the Worldview-2

Fig. 7 .
Fig. 7. Results of training without domain adaptation from Worldview-2 to DeepGlobe.Right: output of model.middle: test images from DeepGlobe dataset.Left: ground truth images from DeepGlobe dataset.
Fig 7 shows an example of a few test images with the ground truth and the model output without domain adaptation.

Fig. 9 .
Fig. 9. Results of training without domain adaptation from Sentinel-2 to DeepGlobe.Right: output of model.middle: test images from DeepGlobe dataset.Left: ground truth images from DeepGlobe dataset.

Fig. 10 .
Fig. 10.Results of training without domain adaptation from Sentinel to Worldview-2.Right: output of model.middle: test images from Worldview-2 dataset.Left: ground truth images from Worldview-2 dataset.
Fig 11 shows an example of results of WV2FI to PLFI without domain adaptation.
Fig 11  shows an example of the results from the run.

Fig. 13 .
Fig. 13.Results of using domain adaptation from Worldview-2 to DeepGlobe.Right: output of model.middle: test images from DeepGlobe dataset.Left: ground truth images from DeepGlobe dataset.

Fig. 14 .Fig. 15 .
Fig. 14.Improving the ground truth using domain adaptation.Right: output of model.middle: test images from DeepGlobe dataset.Left: ground truth images from DeepGlobe dataset.

Fig. 16 .
Fig. 16.Results of training with domain adaptation from Sentinel-2 to Worldview-2.Right: output of model.middle: test images from Worldview-2 dataset.Left: ground truth images from Worldview-2 dataset.
Fig 19  shows example from test images using this domain adaptation.Although the results are not very precise especially on small details it is much better compared toFig 12.

Fig. 17 .
Fig. 17. Results of training with domain adaptation from Sentinel-2 to DeepGlobe.Right: output of model.middle: test images from DeepGlobe dataset.Left: ground truth images from DeepGlobe dataset.
While the data covering Finland is accurate, the one covering Germany is not.Several errors can be spotted that ignores small objects such as individual houses in farms which defies the purpose of having a high resolution raster.The label data was aligned to the same Coordinate Reference System (CRS) as the corresponding satellite rasters.Furthermore, the CLC raster was upsampled to match the pixel resolution of the satellite raster it is covering, then divided into patches.It was then converted into a single channel 8bit PNG with the values of each pixel ranging from 0 to 6 representing the corresponding class at that pixel.DeepGlobe dataset comes with its own labels which the Corine labels were made to match for easier comparison.The labels are the same ones shown in TableIV.They are made with human annotators.

TABLE V RESULTS
WITHOUT DOMAIN ADAPTATION AND DEEPLABV2

TABLE VI RESULTS
WITH BDL AND DEEPLABV2