Semi-Supervised Building Footprint Generation with Feature and Output Consistency Training

Accurate and reliable building footprint maps are vital to urban planning and monitoring, and most existing approaches fall back on convolutional neural networks (CNNs) for building footprint generation. However, one limitation of these methods is that they require strong supervisory information from massive annotated samples for network learning. State-of-the-art semi-supervised semantic segmentation networks with consistency training can help to deal with this issue by leveraging a large amount of unlabeled data, which encourages the consistency of model output on data perturbation. Considering that rich information is also encoded in feature maps, we propose to integrate the consistency of both features and outputs in the end-to-end network training of unlabeled samples, enabling to impose additional constraints. Prior semi-supervised semantic segmentation networks have established the cluster assumption, in which the decision boundary should lie in the vicinity of low sample density. In this work, we observe that for building footprint generation, the low-density regions are more apparent at the intermediate feature representations within the encoder than the encoder's input or output. Therefore, we propose an instruction to assign the perturbation to the intermediate feature representations within the encoder, which considers the spatial resolution of input remote sensing imagery and the mean size of individual buildings in the study area. The proposed method is evaluated on three datasets with different resolutions: Planet dataset (3 m/pixel), Massachusetts dataset (1 m/pixel), and Inria dataset (0.3 m/pixel). Experimental results show that the proposed approach can well extract more complete building structures and alleviate omission errors.


I. INTRODUCTION
Building footprint generation is a hot topic in the community of remote sensing, which involves numerous applications such as identifying undocumented buildings and assessing building damage after natural disasters.Remote sensing imagery that offers potential for meaningful geospatial target extraction on a large scale, becomes a fundamental data source for building footprint generation.However, obtaining accurate and reliable building footprint maps from remote sensing imagery is still challenging due to several reasons.On the one hand, the complex and heterogeneous appearance of buildings leads to internal variability.On the other hand, the mixed backgrounds and other objects with similar spectral signatures further limit the class separability.
Nowadays, convolutional neural networks (CNNs) have been widely used for remote sensing tasks [1] [2] [3] , as they surpass conventional methods in terms of accuracy of efficiency.CNNs are capable of directly learning hierarchical contextual features from the original input, which have greater generalization capabilities for the building footprint generation from remote sensing imagery.Although the existing CNNs are able to deliver very promising results [2] [4] [5] [6], there remains a challenge for extracting building footprints on a large scale.This challenge arises from that CNNs require massive annotated data to obtain strong supervisory information.However, manual annotation of reference data is a time-consuming and costly process.
To address this issue, a straightforward idea is to utilize semi-supervised learning, which can leverage a large amount of unlabeled data and alleviate the need for labeled examples.In general, semi-supervised semantic segmentation methods are summarized into three types: weakly-supervised trainingbased, adversarial training-based, and consistency trainingbased.Nevertheless, weakly-supervised training-based methods need additional annotations, e.g.image-level labels or region-level labels.Adversarial training-based methods are able to make use of the unlabeled data but are difficult to train.Consistency training-based approaches, while not only are simple to implement, but also require no additional weakly labeled examples.The core idea of consistency training-based methods is to encourage the network to give consistent outputs for unlabeled inputs that are perturbed in various ways, thus, improving the generalization of the network [7].
The state-of-the-art consistency training-based methods exploit the teacher-student framework [8].Specifically, a student model is applied to the unlabeled sample, while a teacher model is applied to a perturbed version of the same sample.
Afterward, the consistency is imposed between the outputs of two models to improve the performance of the student model [8].However, there is still a certain gap in performance between these two models when the outputs are not completely correct during training.Inspired by [9] that feature maps can capture more discriminative contextual information, we further improve the performance of consistency training by proposing a new consistency loss that measures the discrepancy between both feature maps and outputs of student model and those of teacher model.By doing so, it can offer a strong constraint to regularize the learning of the network.
The effectiveness of consistency training-based approaches depends heavily on the behavior of the data distribution, i.e., the cluster assumption, where the classes must be separated by low-density regions.However, the low-density regions separating the classes are not within the inputs, which offers an explanation for why semi-supervised is a challenging problem for semantic segmentation [10].[11] observes that for natural images low-density regions separating the classes are present at the encoder's output, thus, proposing to assign the perturbation at this position.However, for remote sensing imagery with low spatial resolution, we observe the presence of low-density regions separating the classes is within the intermediate feature representations in the encoder rather than the encoder's input or output.Motivated by this observation, in this work, we propose to enforce the consistency over the perturbation applied to feature representations at a certain depth within the encoder, where this depth should be in line with the spatial resolution of remote sensing imagery and the mean size of individual buildings in the study area.
Specifically, we consider a shared encoder and a main decoder that are trained together using the labeled examples.To leverage unlabeled data, we then consider an auxiliary decoder whose inputs are perturbed versions of the shared encoder's output.The consistency is imposed between outputs and feature maps of the main decoder and those of the auxiliary decoder.By doing so, the shared encoder's representation is enhanced by using the additional training signal extracted from the unlabeled data.This work's contributions are threefold.
(1) We propose a semi-supervised network for building footprint generation, which has not been adequately addressed in the current literature.When the annotated samples are insufficient, the proposed method can leverage a large amount of unlabeled data to improve the performance of a model.
(2) Our proposed method integrates the consistency training of features and outputs into a unified objective function, which formulates an efficient end-to-end training framework.Compared with other competitors, our approach gains significant improvements.
(3) Observing that the low-density regions separating the classes are within the intermediate feature representations in the encoder, we propose an instruction, in which the perturbation is applied on the feature representations at a certain depth within the encoder according to the spatial resolution of input remote sensing imagery and the mean size of individual buildings in the study area.
The remainder of the paper is organized as follows.Related work is reviewed in Section II.Section III details the proposed network for building footprint generation.The experiments are described in Section IV. Results and Discussions are provided in Section V and VI, respectively.Eventually, Section VII summarizes this work.

A. Building Footprint Generation
A tremendous amount of remote sensing imagery can be collected with recent technological advances, providing huge potential for mapping buildings.A variety of methods have been proposed to generate building footprints from remote sensing imagery.
Early studies can be categorized into four types: geometrical primitive-based, index-based, segmentation-based, and classification-based methods.The geometrical primitive-based methods [12] first extract geometric primitives (e.g., building edges and corners) and then group them to form building hypotheses.In the index-based methods [13], an index is designed to discriminate buildings from other objects.Afterward, buildings are extracted by selecting an empirical threshold.By utilizing over-segmentation algorithms, the segmentationbased methods [14] aims at partitioning an image into different segments, so-called superpixels, and identify those belonging to buildings.In the classification-based methods [15], spectral and/or spatial features of each pixel are taken as input of classifiers to differentiate building from other classes.Nonetheless, a general limitation of these methods is that they rely heavily on manually defined rules and handcrafted features, resulting in a decrease in accuracy and efficiency.
In the past few years, deep learning-based methods have shown remarkable performance on this task, as discriminative features from raw images can be automatically and adaptively learned.Early methods [16] [17] employ a patch-wise classification framework, and assign the label to each pixel according to the class of its enclosing patch.However, the large overlap among patches leads to redundant operation and low efficiency.Therefore, semantic segmentation networks that can efficiently perform pixel-wise segmentation, becomes more popular in the task of building footprint generation [18] [25].The commonly used network architectures involve fully convolutional networks (FCNs) [26] and encoder-decoder based architectures (e.g., DeepLabv3+ [27] Efficient-UNet [28], FC-DenseNet [29]).In order to take the characteristics of buildings in remote sensing imagery into account, some methods (e.g., ESFNet [30], MA-FCN [31], HA U-Net [32], and Multi-task [33]) have made some specific adaptations to these network architectures, e.g., attention block and multi-scale feature aggregation.More recently, instance segmentation networks are exploited to delineate individual building instances in several novel studies [34] [35].Instance segmentation networks can not only assign a semantic label to each pixel with the class of its enclosing object but also distinguish different instances.The commonly used instance segmentation architecture for this task is Mask R-CNN [36].

B. Semi-Supervised Semantic Segmentation
Deep learning methods require strong supervisory information for network training, however, the collection of large volumes of annotated data is time-consuming and costly.Especially for the task of semantic segmentation, the acquisition of pixel-level labels is more expensive and laborious.Therefore, semi-supervised learning is favored in this task, and it can leverage a large amount of unlabeled data to compensate for limited supervisory information.In general, semi-supervised semantic segmentation methods are summarized into three types: weakly-supervised training-based, adversarial trainingbased, and consistency training-based.
Weakly-supervised training-based methods [37] [38] [39] [40] integrate weakly-supervised learning in their approaches.Apart from the limited pixel-level labels, they still require weaker labels that can be regarded as supervisory information for network training.For the application of building footprint generation, weaker labels include image-level labels, bounding boxes, and point labels.The image-level label has two classes, where "building" refers to the images occupying building pixels more than a certain amount of the total pixels, and "non-building" corresponds to images without building pixels [41] [42].In [43], bounding box annotations are utilized to generate probabilistic masks using bivariate Gaussian distribution for every image.Point labels (two points inside and outside each small building, respectively) are employed in [44], which is helpful to detect small buildings.Nevertheless, weakly-supervised training-based methods fail to take advantage of massive unlabeled data.Adversarial training-based methods [45] [46] are able to exploit unlabeled samples, which adapt generative adversarial networks (GANs) [47] for semisupervised semantic segmentation.Both the generator and the discriminator are first trained by labeled samples.Afterward, the generator outputs the segmentation masks of unlabeled images, while the discriminator distinguishes trustworthy regions in their predicted results to provide additional supervisory signals.Considering that the adversarial training strategy may be insufficient to guide network training, pseudo labels are generated by selecting high-confident segmentation predictions for unlabeled images [48].Afterward, pseudo-building masks are incorporated to expand the training data and the generator is retrained.However, adversarial training-based methods are very hard to train due to the instability of GANs [49].By contrast, consistency training-based methods not only can leverage unlabeled images to improve the performance of the segmentation network but also are simple and efficient to implement.The goal of consistency training is to enforce the consistency of the model's predictions for unlabeled inputs that are applied by small perturbations.By doing so, the robustness of the learned model will be enhanced Recently, several consistency training-based methods are proposed for the task of semi-supervised semantic segmentation, e.g., CutMix [10] and CCT [11].CutMix [10] applies the perturbations to the raw input and uses MixUp [50] to enforce the consistency between the mixed outputs and the outputs from the mixed inputs.CCT [11] imposes an invariance of the model's outputs over small perturbations applied to the encoder's output.In the remote sensing community, two consistency training-based methods have been proposed for the application of building footprint generation, i.e., CR [51] and PiCoCo [52].Color jitter and random noise are chosen as the perturbation for CR [51], and are applied to the raw input.Then, the consistency of their outputs is enforced.PiCoCo [52] is also an input perturbation method, which augment the input images randomly and impose the consistency constraint between the predictions of augmented images.In addition, it implements contrast learning on labeled images, which can regularize the compactness of intra-and interclass latent representation space [52].
However, these consistency training-based methods still have two limitations.On the one hand, these methods ignore the rich information encoded in feature maps and generally impose consistency only over the outputs of the models.On the other hand, they add perturbations over the raw input or encoder's output for all types of data, failing to take the characteristics of target objects into consideration when selecting the optimal position to apply perturbations.

III. METHODOLOGY
In this section, consistency training-based methods are first introduced.Afterward, the proposed framework in the endto-end network learning procedure is described.Finally, we propose an instruction to assign perturbation for the task of building footprint generation, which is based on our observation and analysis of cluster assumption.

A. Consistency Training-based Methods
Given a small set of n input-target pairs S l = {(x l 1 , y 1 ), ..., (x l n , y n )} sampled from an unknown joint distribution β(x, y), the goal of supervised learning is to derive a prediction function f θ (x) parametrized by θ, and this prediction function is able to assign the correct target y to an unseen sample from β(x).In semi-supervised learning, a larger set of m unlabeled examples S u = {x u 1 , ..., x u m } is additionally provided.Semi-Supervised learning aims to derive a more accurate prediction function than what is obtained by only using S l .For instance, additional structure about the input distribution β(x) can be learned from S u to produce a estimate of the decision boundary, which makes a better separation of samples into different classes [53].
Consistency training-based methods follow an intuitive goal to perform semi-supervised learning: when a perturbation is assigned to the data points x ∈ S u as x, the output of f θ (x) should not be significantly changed.Therefore, the objective of consistency training-based methods is to minimize the following loss function: where L s is a supervised loss on labeled data.λ u is a weighting function to control the importance of a consistency loss term L cons which is formalized as: where T(., .)measures a discrepancy between the outputs of the prediction functions.In this regard, the unlabeled data can be leveraged to find a smooth manifold where the dataset lies [54].
Different settings in assigning perturbation or minimizing the L cons lead to a wide variety of approaches for semisupervised classification, e.g., Virtual Adversarial Training (VAT) [55] and Interpolation Consistency Training (ICT) [56], and those from semi-supervised semantic segmentation, e.g., CutMix [10], CCT [11], CR [51], and PiCoCo [52].These methods are conducted in teacher-student frameworks, where a teacher model is first constructed from data perturbation, and then the output of the teacher model on unlabeled data is utilized to supervise a student model [8].However, they have not fully leveraged the information of the teacher model.This is because they fail to use intermediate feature maps of the teacher model that can also be regarded as knowledge to guide the learning of the student model.Therefore, a more precise consistency towards the underlying invariance of features and outputs between the student model and the teacher model is preferable in our research.

B. Proposed Framework in End-to-End Network Learning
Recently, the perceptual mechanism has achieved promising results for image reconstruction [9], and they make use of the extracted high-level feature maps to improve the network performance.Inspired by it, we propose to impose consistency on both features and predictions for the training of unlabeled data, which is capable of fully harnessing information in deep features and output predictions.As a consequence, our network can guarantee that the deep feature maps are consistent, alleviating the loss of detailed information during network training.
As shown in Fig. 1, the proposed framework is composed of a shared encoder E, a main decoder D, and an auxiliary decoder G.The segmentation network F is constituted as F = E • D and is trained on the labeled set in a fully supervised manner.The auxiliary network A = E • G is trained on the unlabeled examples by enforcing the consistency of both features and outputs between D and G. D takes as input the encoder's output z out , but G is fed with its perturbed version zout , in which the perturbation p is applied to feature representations z in at a certain depth within E.
By doing so, the representation learning of E can be further improved by unlabeled examples, and subsequently, that of the segmentation network F .
For each iteration of training, a labeled input image x l and its label y together are sampled together with an unlabeled image x u .Both x l and x u are passed through E and D, obtaining two main predictions ŷl and ŷu , respectively.The supervised loss L s is computed with y and ŷl .For x u , the perturbation p is applied to z in with z in being its feature representation within E and its output from E is zout .Afterward, an auxiliary prediction ŷu a is generated from G using the zout .The consistency loss L cons consists of two parts L uf and L up , where L uf is computed between the features of G and those of D, and L up is computed between the outputs of G and that of D.
In the proposed approach, S l and S u are jointly trained by minimizing a global loss function L as Eq. 1.Following [57], λ u is set to ramp up starting from zero along a Gaussian curve up to a fixed weight α, which can avoid the use of the initial noisy output from the main encoder.The total loss L is derived and back-propagated to train the segmentation network F and the auxiliary network A. Note that L cons is not backpropagated through D, and D is trained only by labeled examples.By doing so, D is only trained on original input data.This is helpful from two aspects.On the one hand, it can avoid collapsing solutions.If L cons is backpropagated through both main decoder D and auxiliary decoder G, main decoder D will collapse since L cons will be minimized if predictions of both D and G are zeros.On the other hand, the method can be better adapted to the test stage since no perturbation is applied to test images.
For the labeled set, a supervised loss L s is exploited to train the segmentation network F .In order to avoid overfitting, an annealed version of the bootstrapped Cross-Entropy loss [11] is chosen to compute the supervised loss L s , and it is denoted as: where F (x i ) is the output probability from F for a labeled example x i , y i is its ground reference label, and H(., .) is the cross entropy-based loss.In semi-supervised learning, the model is often overfitted to the limited amount of labeled data while being under-fitted to the unlabeled data.To address this issue, a labeled example is utilized only if the model's confidence in it is lower than a predefined threshold η.In other words, L s is computed only over the pixels with a probability less than the threshold η that serves as a ceiling to prevent over-training on easy labeled data [58].Following [11], we gradually increase η from 0.5 to 0.9 during the beginning of training.
For an unlabeled example x u i , z out is derived as the output from the shared encoder E. One contribution in our approach is to apply the perturbation to the feature representation z in for x u within the encoder E according to our proposed instruction.Afterward, the perturbed feature representations zin will be fed to the subsequent layers in the encoder to generate the perturbed encoder's output zout .Finally, z out and zout are taken as input for D and G, respectively.
The training objective of the unlabeled set is to minimize a consistency loss L cons , which is defined as: where L uf and L up measure the discrepancy between the features and outputs of D and those of G, respectively.ω u is a hyperparameter to introduce a weight to model the relative importance of two losses.More specifically, L up is defined as: with T(., .) as mean squared error-based loss.
Note that a contribution of our approach is that a loss term L uf is introduced into the proposed network by imposing the consistency on features between the main decoder and auxiliary decoder, which is able to harness the detailed information in the feature maps.Let φ j (q) be the activations of the jth layer of the network φ when processing the input q.For D and G, D j (z out ) and G j (z out ) will be the corresponding feature maps at jth depth in the decoder.Here, j represents the position where upsampling operations are applied in the decoder.Then, L uf is denoted as: where J is the total number of depth in the decoder.In other words, J represents how many upsampling operations are applied in the decoder.The proposed semi-supervised method can be summarized by the following Algorithm 1: C. An Instruction to Assign Perturbation for the Task of Building Footprint Generation The effectiveness of consistency training-based methods relies on the cluster assumption, i.e., two samples belonging to the same cluster in the input distribution are likely to have the same label [59].In this case, the decision boundary should lie in the low-density regions [60].In other words, if a decision boundary crosses a high-density region, it will divide a cluster into two different classes, which violates the cluster assumption.From the formal analysis, the expected value of L cons is proportional to the squared magnitude of the Jacobian of the network's outputs with respect to its inputs [7].Therefore, minimizing L cons indicates that the decision function in the regions of unsupervised samples will be flattened, and the decision boundary will be moved into the vicinity of low sample density [10].
The cluster assumption has inspired many recent consistency training-based methods for semi-supervised semantic segmentation [10] [11] which propose to assign the perturbation to the raw input or encoder's output.However, they are not suitable for the task of building footprint generation, as the characteristics of both building objects and remote sensing imagery haven't been taken into account.Therefore, we propose an instruction to assign perturbation for this task, Algorithm 1 Algorithm for Feature and Output Consistency Training Input: Labeled image x l and pixel-level label y, as well as unlabeled image x u Require: Shared encoder E, main decoder D with the total depth number J, and auxiliary decoder G 1: Forward x l through E and D: ŷl = D(E(x l )) 2: Forward x u through E: z out = E(x u ) 3: Generate the main decoder's feature maps for z out : 4: for j = 1 to J do 5: Derive D j (z out ) 6: end for 7: Generate the main decoder's output for z out : Derive D(z out ) 8: Forward x u through E and apply a noise perturbation N to feature representations z in : zin = (z in N) + z in 9: Forward zin through the subsequent layers in E to generate the perturbed encoder's output zout 10: Generate the auxiliary decoder's feature maps for zout : 11: for j = 1 to J do Derive G j (z out ) 13: end for 14: Generate the auxiliary decoder's output for zout : Derive G(z out ) 15: Training the network.
L s = {ŷ l < η}H(y, ŷl ) which is inspired by the observation and analysis of the cluster assumption in building footprint generation from remote sensing imagery.In order to examine the cluster assumption, the local variations at an encoder depth d are measured between the value of each pixel and its local neighbors, and local variations with high values depict the presence of lowdensity regions [10].Here, d represents the position where how many downsampling operations are applied in the encoder.For instance, when d = 1, the spatial size (i.e., height and width) of feature representation is half of that of the raw input.Similarity, when d = 2, the spatial size (i.e., height and width) of feature representation is 1/4 of that of the raw input.Following [11], the average Euclidean distance at each spatial location and its 8 intermediate neighbors is computed for the encoder's input (d = 0), and the feature representations of both intermediate layer (d = 2) and encoder's output (d = 5).Both feature representations are first resampled to the input size, and then the average distance between the neighboring activations is calculated.Fig. 2 illustrates the example results for Planet satellite imagery (3 m/pixel).The feature representations from intermediate layer and encoder's output are 24-dimensional and 1280-dimensional feature vectors learned from Efficient-UNet [28], respectively.It can be observed that the low-density regions are not aligned with the class boundaries at the encoder's input or encoder's output, where the cluster assumption is violated.By contrast, the cluster assumption is maintained at the intermediate layer, given that the class boundaries with high average distance coincide with low-density regions.This observation may be related to the receptive field of the network.The receptive field will be enlarged when the depth increases within the encoder, but when the receptive field exceeds a certain value that is much beyond the size of target objects, it might introduce more noise for network learning [61].Furthermore, for remote sensing imagery with varying resolutions, the receptive fields of the network are various at the same depth within the encoder, when the unit is meter.
Based on the above observation and analysis, we propose an instruction to assign the perturbation.The perturbation should be added to the feature presentations at depth d within the encoder according to the spatial resolution of remote sensing imagery and the mean size of individual buildings in the study area.More specifically, d is computed as: where r is the spatial resolution of the remote sensing imagery, l min and l max are mean values of max and min length that are derived from the ground reference of individual buildings in the study area. is the rounding down function, which aims to get the largest integer that does not exceed the original value.
A noise tensor N ∼ µ(−0.3,0.3) of the same size as the feature presentations z in is uniformly sampled as the perturbation p.It is first multiplied with z in to adjust its amplitude, and then injected into z in to get perturbed feature maps zin : where denotes element-wise multiplication.Afterward, it will be fed to the subsequent layers in the encoder to generate the perturbed intermediate representation zout of the unlabeled input sample x u .

A. Dataset
The effectiveness of the proposed method is validated on three datasets with different spatial resolutions, i.e., Planet dataset [62], Massachusetts dataset [16], and Inria dataset [18].
1) Planet dataset: In this research, PlanetScope satellite imagery is collected from 8 European cities (Amsterdam, Berlin, Lisbon, Madrid, London, Paris, Milan, and Zurich) to create a Planet dataset.The PlanetScope satellite images have three bands (i.e., red, green, blue) at a spatial resolution of 3 m/pixel.The corresponding building footprints that are stored as vector files are acquired from OpenStreetMap.Fig. 3 presents example imagery of Lisbon.
2) Massachusetts dataset: The Massachusetts dataset is composed of 151 tiles of aerial imagery over the city of Boston.Each aerial imagery has three bands (i.e., red, green, blue) at a spatial resolution of 1 m/pixel, and its size is 1500 × 1500 pixels.A sample aerial image is illustrated in Fig. 4. The corresponding ground reference building masks are also included in this benchmark dataset.3) Inria dataset: The Inria dataset is a benchmark dataset consisting of 360 large-scale aerial images, in which each image is of the size of 5000 × 5000 and has three bands (i.e., red, green, blue) at a spatial resolution of 0.3 m/pixel.A sample aerial image is showed in Fig. 5.The ground reference building masks of this dataset are only publicly released for five cities (Austin, Chicago, Kitsap County, Western Tyrol, and Vienna).
For all three datasets, all remote sensing images and groundtruth building masks are cut into small patches with the size of 256 × 256 pixels.For the Planet dataset, we have manually selected 1100 pairs of proper patches for each of eight European cities.The selected pairs are then separated

B. Experiment Setup
Since the semantic segmentation network is an essential part of our approach, we first investigate which CNN model (i.e., Efficient-UNet [28], FC-DenseNet [29], DeepLabv3+ [27], ESFNet [30], MA-FCN [31], HA U-Net [32], and Multitask [33]) has better performance for the task of building footprint generation.The CNN model achieving the best results under the fully supervised setting is selected as the backbone.Afterward, for each dataset, we randomly split the training data into two parts, which are labeled set and unlabeled set, and the pixel-level annotations are excluded in the unlabeled set.Under the semi-supervised setting, the ratios of labeled data to unlabeled data are set as three different ratios (e.g., 1:2, 1:5, 1:10).To validate the superiority of the proposed method, we make a comparison with other competitors, including Supervised Learning (SL), Supervised Learning + Data Augmentation (SL+DA), ICT [56], VAT [55], CutMix [10], CCT [11], CR [51] and PiCoCo [52].The settings of λ u , ω u being the weights of consistency loss term and feature consistency loss term, and the position of the assigned perturbation in different methods are shown in Table II for a better understanding of their differences.Furthermore, the effectiveness of our proposed feature and output consistency, being imposed between the main decoder and the auxiliary decoder, is analyzed.The position within the encoder to apply perturbation is also carefully investigated for different datasets.Finally, we explore whether the auxiliary decoder is able to improve the performance of the proposed method.

C. Training Details
Our experiments are conducted within a Pytorch framework on an NVIDIA Tesla with 16 GB of memory.For all methods, the optimizer is stochastic gradient descent (SGD) with a learning rate of 0.1 and a momentum of 0.9, and the training batch size is set as 4. Detailed configurations of all methods included in our experiments are listed as follows: (1) Efficient-UNet [28]: EfficientNet [63] is adopted as the encoder to learn feature maps.The decoder is comprised of five transposed convolutional layers that upsample the convolved image to predict segmentation masks.
(3) FC-Densenet [29]: Both the encoder and decoder in FC-DenseNet are composed of five dense blocks, and each dense block has five convolutional layers.
(4) ESFNet [30]: This method employs Separable Factorized Residual Block (SFRB) as the core module.The encoder is composed of 16 blocks, where 3 blocks are downsampling blocks and 13 blocks are SFRB.The decoder consists of 7 blocks for transposed convolutions and SFRB.
(5) MA-FCN [31]: This approach has proposed a feature fusion structure to aggregate multi-scale feature maps.It utilizes a Feature Pyramid Network (FPN) [65] -based structure as the backbone where the encoder is a four-layer VGG-16 [66] architecture and a corresponding decoder implements lateral connections between them.
(6) HA U-Net [32]: The encoder of this network adopts ResNet34 [67].The decoder is comprised of four modules that include up-sampling module, attention module, overall nesting module, and auxiliary loss module.
(7) Multi-task [33]: This method is based on SegNet [68].It first adds one convolutional layer after the decoder to learn the distance to the border of buildings.Afterward, this learned distance mask and feature maps produced by the decoder are concatenated and fed into another convolutional layer to learn the final building masks.
(8) Proposed method: The hyperparameter α in the unsupervised loss weighting function λ u is set as 0.6.The loss term weighting parameter of feature consistency ω u is chosen as 0.2.The network architectures of F and A are the same as that of the backbone.
(9) SL: The backbone is learned from labeled samples.Note that unlabeled samples are not considered during training.
(10) SL+DA: Following [69], data augmentation is first performed by randomly horizontally or vertically flipping, or rotating the image patches before training.Afterward, the backbone is trained on labeled samples.
(11) ICT [56] and VAT [55]: Following [10], we adapt these two semi-supervised classification methods for the task of semantic segmentation.The CNN model is the same as the backbone in our proposed method.

D. Evaluation Metrics
The performance of models is evaluated by two metrics: F1 score and intersection over union (IoU).They can be computed as follows.
where T P indicates the number of true positives, F N is the number of false negatives, and F P is the number of false positives.F1 score realizes a harmonic mean between precision and recall.
V. RESULTS

A. Results of Different Semantic Segmentation Networks for Supervised Learning
The comparisons among different semantic segmentation networks for supervised learning are presented in this section.Their respective performance is evaluated according to both quantitative (cf.Table III) and qualitative results (cf.Fig. 6, 7, and 8) on three datasets, respectively.The goal of this comparison is to select the best semantic segmentation network as the backbone for different learning methods in further experiments.In this case, we can avoid potential impacts due to convolutional layers and architectural differences.
Among these semantic segmentation networks, Efficient-UNet [28] performs better than DeepLabv3+ [27], FC-DenseNet [29], ESFNet [30], HA U-Net [32], and Multi-task [33] on all three datasets.Especially for the Planet dataset that has a relatively low spatial resolution, Efficient-UNet [28] obtains increments of 13.04% and 12.01% in F1 score and IoU when compared with DeepLabv3+ [27].Although MA-FCN [31] is superior to Efficient-UNet [28] on the Massachusetts dataset, Efficient-UNet surpasses it by about 0.5% in IoU on both Planet and Inria datasets.Fig. 8 presents a visual comparison among different methods on three datasets.For the Inria dataset with relatively high spatial resolution, some non-building objects are wrongly identified as buildings by other methods.On the contrary, Efficient-UNet [28] is able to avoid such false alarms.The superiority of Efficient-UNet [28] on different resolution data can be attributed to its capability of systematically improving performance with all compound coefficients of the architecture (width, depth, and image resolution) balanced [28].Thus, we take Efficient-UNet [28] as the backbone in both supervised learning and semisupervised learning approaches for further comparisons.

B. Comparison with Other Competitors
Furthermore, we make comparisons among the proposed method, SL, SL+DA, ICT [56], VAT [55], CutMix [10], CCT [11], CR [51] and PiCoCo [52].Here, the ratios of labeled data to unlabeled data are designed as 1:2, 1:5, and 1:10, respectively.SL is regarded as the baseline method that is only trained with labeled data, while SL+DA is trained on the labeled data that are already augmented.Labeled and unlabeled data are jointly trained for the proposed method, ICT [56], VAT [55], CutMix [10], CCT [11], CR [51] and PiCoCo [52].Their performance is evaluated from quantitative (cf.Tables IV, V, and VI) perspectives.As an example, experiments are carried out for five runs on the Massachusetts dataset where the ratio of labeled data to unlabeled data is 1:2.This provides a fair comparison, and the corresponding F1 score and IoU are shown as mean and variance.Fig. 9, 10, and 11 illustrate visual results obtained by different methods for the ratio 1:10.
It can be seen from the statistics of three datasets that the proposed approach significantly boosts performance in F1 score and IoU when compared with other methods.The challenge induced by the ratio of 1:10 is the limited data representation for buildings, however, we notice that the   proposed method still manages to perform better on three datasets when compared to its competitors.Our method gains improvements of 5.18%, 10.40%, 7.91% in IoU than SL for the Planet, Massachusetts, and Inria datasets, respectively.In particular, on the Massachusetts dataset, the IoU of the proposed approach is improved by more than 7% when compared to other methods.When the ratio of labeled data to unlabeled data is 1:2, the number of labeled samples is already sufficient for SL, but our method still provides advantages over it.Note that the proposed approach performs even better than the other semantic segmentation networks (cf.Table III) that are trained on the full labeled sets.This proves that the effectiveness and robustness of the proposed approach for the task of building footprint generation.
The accuracy metric of IoU obtained by our method for the ratio of 1:2 is higher than that for the ratio of 1:10.This suggests that using more labeled samples increases the overall performances (42.20% vs. 36.78% in the Planet dataset, 54.15 ± 0.68 % vs. 51.16% in the Massachusetts dataset, 75.22% vs. 72.03% in the Inria dataset).It should be mentioned that the proposed approach is capable of reducing the gap between the different ratios.For instance, Table V shows that the IoU produced by our method, which is trained on the data of ratio of 1:10, only drops 1% than that of ratio of 1:5.This demonstrates that our method can obtain reliable segmentation results even when there is only a small number of annotated samples.
The visual results on the Planet dataset are illustrated in Fig. 9.There is a lot of missed detection in results provided by SL, VAT [55], CCT [11], CR [51] and PiCoCo [52], as the number of labeled samples is insufficient.On the contrary, our method can extract more building structures.Fig. 11 presents results on the Inria dataset.It can be clearly seen that our method is able to avoid more false alarms than its competitors.This suggests that the proposed method has a better capability of utilizing unlabeled data to improve network performance.
VI. DISCUSSION As shown in the results on three datasets for a semisupervised setting, our proposed method with the ratio of 2:1 can deliver the best results.Therefore, in this section, we carry out ablation studies of the proposed method under this data split.

A. Ablation Study of the Imposed Consistency
One contribution of our approach worthy of being highlighted is that we introduce a novel objective function by imposing consistency on both features and outputs between the main decoder and the auxiliary decoder.
The statistical results of different types of the imposed consistency are reported in Table VII.Experimental results show that implementing feature and output consistency for this task is helpful to improve the network performance, and we can see nearly 1% gains in IoU on all datasets when compared to solely output consistency.This may be because that more abstract and invariant information are included in the feature representations [70], and the network is able to learn more   knowledge when feature consistency is additionally imposed.
Fig. 12 illustrates a visual comparison between different types of the imposed consistency.Some buildings are omitted in the results provided by sole output consistency in the example areas of the INRIA dataset.The reason is that the sole output consistency ignores the rich information in feature representations.On the contrary, building masks obtained by the feature and output consistency are much closer to real building shapes.This suggests that our method can capture information in both feature representations and outputs, enabling the enhancement of semantic information of buildings.

B. Ablation Study of the Assigned Perturbation
For the perturbation being assigned to the feature representations within the encoder, we propose an instruction to select the optimal position: the encoder depth d.To verify this instruction, we apply the perturbation to five different positions within the encoder, respectively.Specifically, d is first set as five numbers i.e., 1, 2, 3, 4, and 5, to investigate its impact on final results.The spatial size of their corresponding feature maps is 128 × 128, 64 × 64, 32 × 32, 16 × 16, 8 × 8.
The statistical results of the perturbation applied to different depths within the encoder are shown in Table VIII.We can see that the best position to assign the perturbation is varied across different datasets.Moreover, increasing the value of the depth will promote the improvement of results on the higher resolution dataset (Inria dataset).However, we note that a large value of d will lead to a reduction in accuracy metrics on the relatively low-resolution dataset (Planet dataset).The best results are obtained when d = 2 for the Planet dataset, d = 4 for the Massachusetts dataset, and d = 5 for the Inria dataset.This coincides with our proposed instruction to apply the perturbation.
Taking the spatial resolution of remote sensing imagery into consideration, the respective field of these positions are corresponding to 3×2 2 = 12m (Planet dataset), 1×2 4 = 16m (Massachusetts dataset), 0.3×2 5 = 9.6m (Inria dataset), which are close to the size of a building that usually has a length within the range from 10 m to 20 m.Afterward, we calculate the statistics of individual buildings of all three datasets, i.e., max length and min length (cf.Fig. 13).We found that the mean values of the max length of individual buildings are 19 m for the Planet dataset, 17 m for the Massachusetts dataset, 16 m for the Inria dataset.Mean values of the min length of individual buildings are 17 m for the Planet dataset, 14 m for the Massachusetts dataset, 12 m for the Inria dataset.That is to say, mean values of max length and min length of individual buildings also range from 10 m to 20 m among all datasets.This indicates the geometrical characteristics of the building are related to the effective receptive field of the network, which may place an emphasis on how to select the optimal position to assign the perturbation in the whole framework.Therefore, we infer that the perturbation should be assigned to the different positions within the encoder according to the spatial resolution of remote sensing imagery and the mean size of the individual buildings in the study area.

C. Ablation Study of the Auxiliary Decoder
In our approach, an auxiliary decoder is employed to train the unlabeled set, and additional training signals can be extracted by enforcing the consistency of features and predictions between the main decoder and the auxiliary decoders.In order to validate the effectiveness of the auxiliary decoder, we perform an ablation study with another competitor, i.e., the proposed method without auxiliary decoder.That is to say, the auxiliary decoder is removed, and the main decoder takes as input both an uncorrupted and perturbed version of the encoder's output to impose consistency on their features and outputs.
The ablation study is carried out on Planet, Massachusetts, and Inria datasets.Numerical results are shown in Tables IX.As can be seen in statistical results on all three datasets, an auxiliary decoder brings a nearly 1% improvement in IoU, leading to a positive influence on the performance of our network.Fig. 14 shows a visual comparison of segmentation results, which demonstrates that the performance of our approach can be boosted up by the leverage of an auxiliary decoder.In Fig. 14 (e) and (h), the method without auxiliary decoder wrongly identifies cars as buildings on both Massachusetts and Inria datasets.This is because, the colors of cars are similar to those of buildings, which leads to a misjudgment.The use of an auxiliary decoder is able to avoid such false alarms.The main reason is that supervision from the same decoder might guide the network to better approximate the features and outputs of the perturbed inputs, making the network converges in the wrong direction.In contrast, supervision by the features and predictions from the other decoder is able to avoid over-fitting the wrong direction.

VII. CONCLUSION
Considering that the performance of semantic segmentation networks is limited when the annotated training samples are insufficient, a novel semi-supervised building footprint generation method with feature and output consistency training is proposed in this paper.The proposed model comprises three modules: a shared encoder, a main decoder, and an auxiliary decoder.More specifically, the shared encoder and the main decoder are designed to learn from labeled data in a fully supervised manner.Afterward, we assign the perturbation at the intermediate feature representations within the encoder and aims to encourage the auxiliary decoder to give consistent predictions for unlabeled inputs as the main decoder.The consistency is imposed between outputs and features of the main decoder and those of the auxiliary decoder.The performance of the proposed end-to-end network is assessed on three datasets with different resolutions: Planet dataset (3 m/pixel), Massachusetts dataset (1 m/pixel), and Inria dataset (0.3 m/pixel).Experimental results suggest that the incorporation of both feature and output consistency in our method can offer more satisfactory building footprints, where omission errors can be alleviated to a large extent.Therefore, We believe that our method is a robust solution for building footprint generation when dealing with scarce training samples.Furthermore, the best position to assign the perturbation has been investigated that the perturbation should be applied to the different depths within the encoder according to the spatial resolution of input remote sensing imagery and the mean size of the individual buildings in the study area.This practical strategy is beneficial to other semi-supervised building footprint generation works that use remote sensing imagery.A subsequent study will intend to investigate the potential of the feature and output consistency training in the instance segmentation of buildings.

Fig. 1 .
Fig. 1.Overview of the proposed semi-supervised building footprint generation network.

Fig. 2 .
Fig. 2. The cluster assumption in consistency training-based methods for building footprint generation.Examples from (a) Planet satellite imagery (3m/pixel), (b) pixel-level labels, as well as local variations at (c) encoder's input, (d) intermediate layer in the encoder, and (e) encoder's output.Bright regions indicate large variation.

Fig. 3 .
Fig. 3.The satellite imagery of Lisbon in the Planet dataset (spatial resolution: 3m/pixel) and three zoomed in areas.

Fig. 4 .
Fig. 4.An aerial image in the Massachusetts dataset (spatial resolution: 1 m/pixel)and three zoomed in areas.

Fig. 5 .
Fig. 5.An aerial image in the Inria dataset (spatial resolution: 0.3 m/pixel) and three zoomed in areas.

Fig. 13 .Fig. 14 .
Fig. 13.Summarized statistics of (a) max length and (b) min length of individual buildings on three datasets.

TABLE I THE
STATISTICS OF THE SELECTED DATASETS UTILIZED IN THIS

TABLE II THE
SETTINGS OF ALL METHODS UTILIZED IN THIS RESEARCH.λu AND ωu REPRESENT THE WEIGHTS OF CONSISTENCY LOSS TERM AND FEATURE CONSISTENCY LOSS TERM, RESPECTIVELY.

TABLE III ACCURACIES
OF DIFFERENT SEMANTIC SEGMENTATION NETWORKS FOR SUPERVISED LEARNING ON THREE DATASETS.(%)

TABLE IV ACCURACIES
OF DIFFERENT METHODS ON PLANET DATASET M/PIXEL).(%)

TABLE VII ABLATION
STUDY OF THE IMPOSED CONSISTENCY ON THREE DATASETS.(%)

TABLE VIII ABLATION
STUDY OF THE ASSIGNED PERTURBATION ON THREE DATASETS.(%)

TABLE IX ABLATION
STUDY OF THE AUXILIARY DECODER ON THREE DATASETS.(%)