Semantic Segmentation of Remote Sensing Images with Sparse Annotations

Training Convolutional Neural Networks (CNNs) for very high resolution images requires a large quantity of high-quality pixel-level annotations, which is extremely labor- and time-consuming to produce. Moreover, professional photo interpreters might have to be involved for guaranteeing the correctness of annotations. To alleviate such a burden, we propose a framework for semantic segmentation of aerial images based on incomplete annotations, where annotators are asked to label a few pixels with easy-to-draw scribbles. To exploit these sparse scribbled annotations, we propose the FEature and Spatial relaTional regulArization (FESTA) method to complement the supervised task with an unsupervised learning signal that accounts for neighbourhood structures both in spatial and feature terms.


Semantic Segmentation of Remote Sensing Images with Sparse Annotations
Yuansheng Hua, Diego Marcos, Lichao Mou, Xiao Xiang Zhu, Fellow, IEEE , Devis Tuia, Senior Member, IEEE Abstract-This is the preprint version.To read the final version, please go to IEEE Geoscience and Remote Sensing Letters.Training Convolutional Neural Networks (CNNs) for very high resolution images requires a large quantity of high-quality pixellevel annotations, which is extremely labor-and time-consuming to produce.Moreover, professional photo interpreters might have to be involved for guaranteeing the correctness of annotations.To alleviate such a burden, we propose a framework for semantic segmentation of aerial images based on incomplete annotations, where annotators are asked to label a few pixels with easy-todraw scribbles.To exploit these sparse scribbled annotations, we propose the FEature and Spatial relaTional regulArization (FESTA) method to complement the supervised task with an unsupervised learning signal that accounts for neighbourhood structures both in spatial and feature terms.For the evaluation of our framework, we perform experiments on two remote sensing image segmentation datasets involving aerial and satellite imagery, respectively.Experimental results demonstrate that the exploitation of sparse annotations can significantly reduce labeling costs while the proposed method can help improve the performance on semantic segmentation when training on such annotations.The sparse labels and codes are publicly available for reproducibility purposes 1 .Index Terms-Semantic segmentation, aerial image, sparse scribbled annotation, convolutional neural networks, semisupervised learning.

I. INTRODUCTION
Semantic segmentation of remote sensing imagery aims at identifying the land-cover or land-use category of each pixel in an image.As one of the fundamental visual tasks, semantic segmentation has been attracting wide attention in the remote sensing community and proven to be beneficial to a variety of applications, such as land cover mapping, traffic monitoring and urban management.Recently, many studies [1] resort to learning deep Convolutional Neural Networks (CNNs) with full supervision for semantic segmentation and have obtained enormous achievements.However, training a fully supervised segmentation CNN requires a huge volume of dense pixellevel ground truths, which are labor-and time-consuming to generate.Moreover, expert annotators might be needed for correctly identifying pixels located at object boundaries and ambiguous regions (e.g., shadows in Fig. 1).
YH, LM and XXZ are with Data Science in Earth Observation, Technical University of Munich, Germany, and Remote Sensing Technology Institute, German Aerospace Center, Germany.(e-mails: yuansheng.hua@dlr.de;lichao.mou@dlr.de;xiaoxiang.zhu@dlr.de)DM is/DT was with the Laboratory of GeoInformation Science and Remote Sensing, Wageningen University, the Netherlands.(e-mail: diego.marcos@wur.nl).He is now with the Ecole Polytechnique Fédérale de Lausanne, Sion, Switzerland.(e-mail: devis.tuia@epfl.ch)(Correspondences:Xiao Xiang Zhu, Devis Tuia.) The work is supported by the German Federal Ministry of Education and Research -AI future lab "AI4EO" (Grant number: 01DD20001).
1 https://github.com/Hua-YS/Semantic-Segmentation-with-Sparse-LabelsTo alleviate the requirement of dense pixel-wise annotations, semi-supervised learning approaches are proposed to make use of additional information, such as spatial relations (e.g.neighboring pixels are likely to belong to the same class) or feature-level relations (e.g.pixels with similar CNN feature representations are likely to belong to the same class), for semantic segmentation.These methods aim to utilize lowcost annotations, such as points [2], scribbles [3], [4] or image-level labels [5], [6].As the first attempt, Bearman et al. [2] proposed to learn semantic segmentation models with point-level supervision, where only one point is labeled for each instance.In scribble-supervised algorithms, annotations are provided in the form of hand-drawn scribbles.Wu et al. [3] propose to learn aerial building footprint segmentation models from scribbles.Maggiolo et al. [4] argue that a network directly trained on scribbled ground truths fails to accurately predict object boundaries and propose to employ a fully connected Conditional Random Field (CRF) to refine the shapes of objects.Compared to fully annotated ground truths, scribbled annotations (cf., Fig. 1(c)) are easier to generate in a user-friendly way.In comparison with point-level annotations (e.g., Fig. 1(b)), scribbles can provide stronger supervisory signals.However, point-and scribble-supervised segmentation methods remain under-explored in the remote sensing community.To this end, we propose a simple yet effective framework for semantic segmentation of remote sensing imagery with low-cost annotations.In this framework, we manually create point-or scribble-level annotations and train networks on them.Besides, we also evaluate polygon-level annotations (see Fig. 1(d)), which can be easily yielded and cover more pixels than the other types of annotations.Since these annotations are sparsely distributed across the images, we call them sparse annotations in the following sections.In order to better exploit sparse annotations, we propose a semi-supervised learning method which encodes and regularizes the feature and spatial relations.To demonstrate the effectiveness of our learning framework, extensive experiments are conducted on two VHR datasets, the Vaihingen and Zurich Summer.

II. METHODOLOGY A. Supervision with Sparse Annotations
In contrast to conventional dense annotations, sparse annotations have two characteristics: 1) a very small proportion pixels are assigned semantic classes, and 2) objects do not need to be entirely annotated (see (b), (c), (d) in Fig. 1).This greatly reduces the effort required from the annotators, as complex boundaries and ambiguous pixels can be avoided.
Here we consider three levels of sparse annotations: point-, scribble-, and polygon-level.Specifically, point-level annotations indicate that, for an annotator interaction, only one single pixel is labeled.Scribble-level annotations, also called linelevel annotations, are yielded by drawing a scribble line within an object and assigning all pixels along this line the same class label.Similarly, polygon-level annotations can be generated by drawing a polygon within an object and classifying pixels located in the polygon into the same semantic class.Examples of these three levels of annotations are shown in Fig. 1.

B. Feature and Spatial Relational Regularization
When using sparse annotations, the vast majority of pixels in the training images are left unlabelled.In order to exploit both labeled and unlabeled pixels, we develop a semisupervised methodology, named FEature and Spatial rela-Tional regulArization (FESTA), to enable a semantic segmentation CNN to learn discriminative features, while leveraging the unlabelled image pixels.An assumption shared by many unsupervised learning algorithms [7] is that nearby entities often belong to the same class.Based on this assumption, a recent work [8] achieves success in representation learning by encoding neighborhood-relations in the feature space.Inspired by this work, we propose to encode and regularize relations between pixels in both feature and spatial domain, as shown in Fig. 2, so that the learned features become more useful for semantic segmentation.
Specifically, given a sample x i (i.e., a CNN feature vector extracted from location i in an image), we first encode its relations to all other samples by measuring the distance in space and feature similarity with respect to all other features in the image.The sample with the smallest similarity is considered as the far-away sample in the feature space, x i f f , while that with the highest similarity is defined as the neighboring sample in feature space, x i nf .According to the aforementioned proximity assumption, it is highly probable that x i and x i nf belong to the same class, and thus, the distance between them should be as small as possible.In order to prevent a trivial solution in which all features collapse to the same point, x i and x i f f are encouraged to further increase their dissimilarity.We apply a similar reasoning in the spatial domain, since images are smooth in spatial terms.Thus, we take the 8 spatial neighbors of x i into consideration and chose the one most similar in feature space as the spatial neighbor, x ins .This operation is intended to prevent pairing x i with a spatial neighbor that belongs to the object boundary.
These priors can be incorporated into the learning objectives by using the following loss function: where D denotes the euclidean distance and S represents cosine similarity.α, β, and γ are trade-off parameters representing the significances of the respective terms, and N represents the number of pixels in a given image.By minimizing L F EST A , x i nf and x ins are forced to move closer to x i , while x i f f is pushed far from x i .In order to jointly exploit the sparse scribbled annotations and FESTA for the network training, the final loss is defined as: where L ce indicates the categorical cross-entropy loss calculated from pixels with annotations.

C. CRF for Boundary Refinement
To further refine the predictions of networks trained on scribbled annotations, we integrate a fully connected CRF [9] into our system, and the energy function of CRF model is where θ u (x i ) is the unary potential and calculated as θ u (x i ) = − log P (x i ).Here i ranges from 0 to the number of pixels in the image, and P (x i ) is the label probability of pixel i. θ p (x i , x j ) is utilized to measure pairwise potentials between pixel i and j.We tested with two Gaussian kernels, where p i and I i indicate the position and color intensity of pixel i. θ 1 , θ 2 , and θ 3 are hyperparameters that control the kernel "scale".In Eq. 4, k 1 is known as appearance kernel and tends to classify neighboring pixels with similar appearances [10], i.e., color intensities, into the same classes, while k 2 , so-called smoothness kernel, penalizes pixels nearby but assigned diverse labels.This step is expected to make the class map smoother within homogeneous areas.

A. Dataset Description
The Vaihingen dataset2 is a benchmark dataset for semantic segmentation provided by the International Society for Photogrammetry and Remote Sensing (ISPRS).33 aerial images with a spatial resolution of 9 cm were collected over the city of Vaihingen, and each image covers an average area of 1.38 km 2 .For each aerial image, three bands are available, near infrared (NIR), red (R), and green (G).Besides, coregistered digital surface models (DSMs) are provided for all images.16 images are fully annotated.In total, six land-cover classes are considered: impervious surface, building, low vegetation, tree, car, and clutter/background.In this paper, we follow the train-test split scheme in most existing works [11], [12] and select five images (image IDs: 11, 15, 28, 30, 34) as the test set.The remaining ones are utilized to train our models.The Zurich Summer dataset [13] is composed of 20 images, which are taken over the city of Zurich in August 2002 by the QuickBird satellite.The spatial resolution is 0.62 m, and the average size of images is 1, 000 × 1, 150 pixels.The images consist of four channels: near infrared (NIR), red (R), green (G), and blue (B).Following previous works [14], [15], we only utilize NIR, R, and G in our experiments and train our model on 15 images; the others (image IDs: 16, 17, 18, 19, 20) are utilized to test.In total, there are 8 urban classes, including road, building, tree, grass, bare soil, water, railway, and swimming pool.Uncategorized pixels are labeled as background.
It is noteworthy that although full pixel-wise annotations are provided for all images in the Vaihingen and Zurich Summer dataset, we only use them in the test phase to calculate evaluation metrics.The training of all models is done with scribbled annotations described below.

B. Scribbled Annotation Generation
To annotate large-scale images, we employ an online labeling platform, LabelMe 3 , and ask annotators to draw by following these rules: 1) for each class, annotations are supposed to cover diverse appearances (see region a, b, and c in Figure 3, where cars of different colors are annotated) and be located in different positions of the image separately.2) polygon-and line-level annotations are not required to delineate object boundaries precisely, see the annotations of trees in Fig. 1(c) and 1(d).In order to make the time spent on each level of scribbled annotations more equivalent, we ask 4 annotators (including 2 non-experts) to label 7, 5, and 3 objects per class for point-, line-and polygon-level annotations in each aerial image.As a consequence, sparse but accurate annotations can be provided rapidly without effort.Since a point-or line-level annotation is often located in the centre area of an object and distant from its boundary, we perform morphological dilation on all point-and line-level annotations with a disk of radius 3. Afterwards, pixels involved in dilated annotations are assigned the same class labels as their central points or lines.For polygon-level annotations, pixels within each polygon are assigned the corresponding classes.
Table I shows the average amounts of pixels with sparse and dense annotations in both datasets.It can be seen that sparse annotations are several orders of magnitude fewer than dense annotations.As to the labeling time, it took on average 133, 126, and 161 seconds per image to produce point-, lineand polygon-level annotations, respectively, for the Vaihingen dataset, and 177, 162, and 238 seconds per image for the Zurich Summer dataset.In Section III-D, we demonstrate the proposed method allows to improve the semantic segmentation results using these sparse annotations.In Section III-D, we discuss the differences observed among the tested annotation types.

C. Training Details
We segment the images with a standard FCN (i.e., FCN-16s [17]) and initialize convolutional layers with Glorot uniform [18] initializers.Specifically, VGG-16 is taken as the backbone, and outputs of the last two convolutional blocks are upsampled to the original resolution and fused with an element-wise addition.The fused feature maps are finally fed into a convolutional layer, where the number of filters is equivalent to the number of classes.In the training phase, all weights are trainable and updated with Nestrov Adam [19], using β 1 = 0.9, β 2 = 0.999, and = 1e−08 as recommended.We initialize the learning rate as 2e−04 and let it decay by a factor of 10 when validation loss is saturated.To train the network, we define the loss as Eq. 2, and λ is set experimentally to 0.1 and 0.01 for the Vaihingen and Zurich Summer datasets, respectively.Tradeoff parameters, α, β, and γ, are set as 0.5, 1.5, and 1, to ensure that 1) the regularizers governing feature and spatial relations are balanced, and 2) neighboring pixels in the image space receive more attention.The network, as well as FESTA, is implemented on TensorFlow and trained on one NVIDIA Tesla P100 16GB GPU for 100k iterations.The size of mini-batch is set as 5 during the training procedure.In the training phase, we use a sliding window to crop training images into 256×256 patches, and its stride is set to 64 pixels.Besides, no class-dependent configurations are considered.In the test phase, we employ dense CRF to refine predictions before calculating metrics.We tuned the parameters of dense CRF (θ 1 , θ 2 , and θ 3 in Eq. 4) on validation images, and find that satisfactory results can be achieved for both FCN and FCN-FESTA when setting them to 30, 10, and 10, respectively.In the case of large homogeneous areas of an image belonging to the same class, α should be set to a small value, which encourages the network to focus more on geographically nearby samples.Besides, a large batch size and sliding window can also help alleviate the influence of such a scenario.

D. Comparing with Existing Methods
We compare a Fully Convolutional Network [17] (FCN) learned using the proposed FESTA (FCN-FESTA) against an FCN learned with weighted loss function (FCN-WL) [16] on sparse annotations.We also report segmentation results of the baseline FCN trained on dense labels.In addition, we study the influence of the fully connected CRF by comparing FCN-FESTA+dCRF and FCN+dCRF [4].Each model is trained and validated on sparse annotations independently.Per-class F 1 scores, mean F 1 scores, and overall accuracy (OA) are calculated on test images with dense annotations.Considering that each model is learned on labels from four annotators, respectively, we average metrics obtained by each annotator and report them in the form of mean ± standard deviation.
Table II exhibits numerical results on the Vaihingen dataset.FCN-FESTA+dCRF achieves the highest mean F 1 scores in training with all kinds of scribbled annotations, which demonstrates its effectiveness.To be more specific, with point-and polygon-level supervision, FCN-FESTA improves the mean F 1 score by 3.95% and 3.33% compared to FCN-WL, respectively.By refining predictions with dense CRF, FCN-FESTA+dCRF achieves improvements of 2.38% and 4.03% in comparison with FCN+dCRF.It is interesting to observe that line-level scribbles improve the segmentation performance the most, and FCN-FAST+dCRF learned with such annotations obtains the highest mean F 1 score, 70.58%.Moreover, we note that FESTA can enhance the network capability of recognizing small objects, i.e., car, in high resolution aerial images.Example segmentation results of networks trained on line annotations are visualized in Fig. 4.
Numerical results on the Zurich Summer dataset are shown in Table III.As can be seen, FESTA contributes to increments of 3.92%, 1.37% and 1.31% in the mean F 1 score when training with point-, line-and polygon-level annotations.By utilizing line annotations and dense CRF, FCN-FESTA+dCRF obtains the highest mean F 1 score, 69.43%.Besides, we note that the exploitation of dense CRF plays a significant role in improving results of networks trained on point-level scribbles.Example visual results of networks trained on line annotations are shown in Fig. 5.In our experiments, we also train networks with multi-class dice loss and find that results are comparative to those learned with crossentropy loss.

E. Discussion on annotation type
To further study the influence of annotations, we also train baseline FCNs on dense annotations and report numerical results in Tables II and III.As shown in Tables II and III, All models are trained on line annotations.Legend-black: road, brown: soil, green: grass, dark green: tree, gray: building, and white: background.line-level annotations lead to the best performance on both datasets, even though the number of labeled pixels is an order of magnitude smaller than polygon annotations (see Table I).Although it was expected that line annotations would outperform point annotations, due to their ability to capture within-object variations, we were surprised to see that they also outperformed polygon annotations.We suspect that this is linked to the fact that the number of pixels per object grows quadratically for polygons and linearly for lines.This would lead to a more balanced weighing of differently sized objects in the case of line annotations and an under-weighing of smaller objects in the case of polygon annotations, which could harm the model's performance.Another reason could be that, since drawing a line is faster than drawing a polygon, annotators for the line features provided more scribbles in the same time budget.
In spite of the mean F 1 performance boost provided by FESTA, there is still a large gap with respect to the FCN model trained with dense ground truths, of 13% in Vaihingen and 8% in Zurich.This gap is, however, not evenly distributed across the classes.The gap is smaller or non-existent in classes such as water, tree, grass or soil, which are often homogeneous in terms of materials.On the contrary, it is larger for classes with more diverse materials (and therefore observed spectral values), such as building and car (in the Vaihingen dataset).It is noteworthy to mention that the class railway, in the Zurich dataset, is systematically missed in all cases, including the densely supervised FCN.

IV. CONCLUSION
In this paper, we propose a simple yet efficient for semantic aerial image segmentation using sparse annotations and a semi-supervised learning objective.In order to validate the effectiveness of our approach, we conduct experiments on the Vaihingen and Zurich Summer datasets.Numerical and visual results suggest that the proposed method contributes to the improvement of semantic segmentation results using several kinds of sparse annotations.Although models learned on sparse annotations achieve relatively lower accuracies than those using dense annotations, we show that using a semisupervised deep learning approach can help closing this performance gap while leveraging sparse annotations that can significantly reduce the costs of label generation.As future work, the proposed framework can be further improved by introducing graph-based models and prior knowledge learned from label semantics.
Comparisons of different levels of scribbled annotations.Trees (marked as green) are taken as an example here.Images from left to right are (a) an aerial image, (b) point-, (c) line-and (d) polygon-level scribbled annotations, and (e) dense pixel-wise labels.

Fig. 2 .
Fig. 2. Illustration of the proposed FESTA.A Sample x i belonging to building (filled with black) is taken as an example.

Fig. 3 .
Fig. 3. Example polygon-level annotations of an image (ID: 13) on the Vaihingen dataset.Annotations of cars are zoomed in to illustrate that annotations should include variant visual appearances for one class.Legendwhite: impervious surfaces, blue: buildings, cyan: low vegetation, green: trees, yellow: cars.
(a) image (b) dense GT (c) FCN-WL (d) FCN+dCRF (e) ours Fig. 4. Examples of segmentation results on the Vaihingen dataset.All models are trained on line annotations.The legend is the same as that in Fig. 3. (a) image (b) dense GT (c) FCN-WL (d) FCN+dCRF (e) ours Fig. 5. Examples of segmentation results on the Zurich Summer dataset.

TABLE I THE
TOTAL NUMBERS OF PIXELS LABELED WITH SPARSE POINT-, LINE-, AND POLYGON-LEVEL ANNOTATIONS (MIDDLE THREE COLUMNS) AND DENSE ANNOTATIONS (RIGHT COLUMN) IN THE VAIHINGEN AND ZURICHSUMMER DATASETS.
*Background/Clutter is not considered.

TABLE II NUMERICAL
RESULTS ON THE VAIHINGEN DATASET (%): WE SHOW THE PER-CLASS F 1 SCORE, MEAN F 1 SCORE, AND OVERALL ACCURACY ON THE TEST SET.MEAN AND STANDARD DEVIATION OF EACH METRIC ARE CALCULATED FROM RESULTS ON SPARSE ANNOTATIONS PRODUCED BY 4 ANNOTATORS.RESULTS ON DENSE ANNOTATIONS ARE PROVIDED AS REFERENCE.