Building Footprint Generation Through Convolutional Neural Networks With Attraction Field Representation

—Building footprint generation is a vital task in a wide range of applications, including, to name a few, land use management, urban planning and monitoring, and geographical database updating. Most existing approaches addressing this problem fall back on convolutional neural networks (CNNs) to learn semantic masks of buildings. However, one limitation of their results is blurred building boundaries. To address this, we propose to learn attraction ﬁeld representation for building boundaries, which is capable of providing an enhanced representation power. Our method comprises two elemental modules: an Img2AFM module and an AFM2Mask module. More speciﬁcally, the former aims at learning an attraction ﬁeld representation conditioned on an input image, which is capable of enhancing building boundaries and suppressing the background. The latter module predicts segmentation masks of buildings using the learned attraction ﬁeld map. The proposed method is evaluated on three datasets with different spatial resolutions: the ISPRS dataset, the INRIA dataset, and the Planet dataset. From experimental results, we ﬁnd that the proposed framework can well preserve geometric shapes and sharp boundaries of buildings, which brings signiﬁcant improvements over other competitors. The trained model and code are available at https://github.com/lqycrystal/AFM_building.


I. INTRODUCTION
A UTOMATIC building footprint generation from remote sensing data has been of great interest in the community for a range of applications, such as land use management, urban planning and monitoring, and disaster management.However, accurate and reliable building footprint generation remains particularly challenging due to two reasons.On the one hand, different materials and structures lead to large variations of buildings in terms of color, shape, size, and texture.On the other hand, buildings and other man-made objects (e.g., roads and sidewalks) share similar spectral signatures, which can result in a low between-class variability.
Early efforts have been gone into seeking out hand-crafted features of being to effectively exploit spectral, structural, and context information.For example, Huang et al. [1] propose a framework for automatic building extraction, which utilizes spectral, geometrical, and contextual features extracted from imagery.Nonetheless, these methods still fail to satisfy accuracy requirements because they rely on a heuristic feature design procedure and usually have poor generalization capabilities.
More recently, convolutional neural networks (CNNs) have surpassed traditional methods in many remote sensing tasks [2]- [10].CNNs can directly learn feature representations from the raw data; thus, they provide an end-to-end solution to generate building footprints from remote sensing data.Most of the studies in this field assign a label "building" or "nonbuilding" to every pixel in the image, thus yielding semantic masks of buildings.The existing CNNs seem to be able to deliver very promising segmentation results for the purpose of building footprint generation at a large scale (cf.Fig. 1).However, when we zoom in on some segmentation masks (see results from U-Net [11] in Fig. 1), it can be clearly seen that such results are not that perfect, and the boundaries of some buildings are blurred.
We have observed that buildings usually have clear patterns (e.g., corners and straight lines).Therefore, geometric primitives of buildings can be exploited as the most distinguishable features for extraction purposes.There have been several works based on this idea [12]- [15].In this work, we want to exploit building boundaries as a primary visual cue to achieve our task.[11] and our proposed method (U-Net with attraction field representation) at large scale and two zoomed in areas.Recently, attraction field representation is used for the task of line segment detection in computer vision [16], which seeks the most attractive line segment for each pixel.Our observation is that, when building boundaries in remote sensing images are represented by the attraction field, they can be greatly enhanced, while background clutters (e.g., car, courtyard, and road) are suppressed.Fig. 2 shows an example.Motivated by this observation, in this work, we want to make use of the attraction field to represent buildings and propose an end-toend trainable network for automatic building footprint generation.This network consists of two modules: Img2AFM and AFM2Mask.The former takes as input an image and is responsible for learning a corresponding attraction field map (AFM) using a CNN.By doing so, fine-grained building boundaries can be preserved, and the impact of background clutters can be alleviated.The latter module learns another subnetwork to obtain semantic masks of buildings from augmented building edges in the learned AFM.Note that both these two modules are jointly optimized.In addition, the AFM2Mask module is flexible enough to use different semantic segmentation network architectures.This work's contributions are threefold.1) We propose to use the boundary-aware attraction field to represent building footprints in remote sensing images.This helps to enhance building boundaries while suppressing the impact of background clutters.To the best of our knowledge, it is the first work that utilizes the attraction field for the task of building footprint generation.2) We propose a novel network that first learns an AFM by a subnetwork, termed Img2AFM, and then uses another subnetwork called AFM2Mask to reconstruct segmentation masks of buildings.These two modules are trained in an end-to-end fashion.
3) The proposed framework obtains satisfactory performance on three datasets with different spatial resolutions, including ISPRS, INRIA, and Planet datasets.
Compared with naive semantic segmentation networks and networks with other visual cues (e.g., building boundary maps), our method can significantly improve accuracies in terms of both semantic mask and boundary.The remainder of this article is organized as follows.Related work is reviewed in Section II.Section III details the proposed framework for building footprint generation.The experiments are described in Section IV. Results and discussion are provided in Section V. Eventually, Section VI summarizes this work.

II. RELATED WORK
There are a significant number of studies working on building footprint generation from remote sensing imagery.According to used visual cues, they can be categorized into three classes: semantic mask, corner, and boundary of the building.

A. Building Footprint Generation Based on the Semantic Mask
Most methods for building footprint generation involve learning semantic masks of buildings from remote sensing imagery.Early efforts include segmentation-, classification-, and index-based methods.The segmentation-based methods extract buildings using image segmentation algorithms.For example, based on a two-level graph theory, Ok [17] proposes a segmentation approach to identify building regions.For classification-based methods, building masks are extracted by machine-learning classifiers which take spectral information and/or spatial features as input to make a prediction for each pixel.For instance, Turker and Koc-San [18] utilize a support vector machine (SVM) to identify building regions based on spectral bands and the normalized difference vegetation index (NDVI).The objective of index-based approaches is to design a feature index that can be directly applied to obtain building regions without any classification or segmentation process.Morphological building index (MBI) [19] is a widely used one, and this index integrates multiscale and multidirectional morphological operators.However, a general limitation of these early works is the use of handcrafted features and complex feature engineering, which leads to a poor generalization.
Instead of the heuristic design of features, CNNs can offer a better generalization capability.Driven by recent advances in semantic segmentation networks, results of building footprint generation have been significantly improved.These networks are usually fully convolutional network (FCN) [20] and encoder-decoder architecture, such as U-Net [11], SegNet [21], and FC-DenseNet [22].In [23], FCN has been demonstrated to be effective in processing large amounts of remote sensing data and providing reliable building segmentation results.SegNet is used in [24] to generate the first seamless building footprint map for the United States.In order to improve the accuracy of segmenting large buildings, a U-Net-based architecture is proposed in [25], where original images and their downsampled counterparts are taken as inputs of two branches sharing the same weights.In [26], an adversarial training strategy is proposed for building extraction from remote sensing imagery, and FC-DenseNet is exploited as a base semantic segmentation network to generate accurate building footprints.
However, many experiments show that predicted semantic masks of buildings from CNNs are still not that satisfactory, where building boundaries are blurred.In this regard, signeddistance transform (SDT) [27] is proposed to represent building footprints.The signed-distance function value is derived as the distance from a pixel to its closest point on a building boundary; positive values indicate the interior of a building and negative values otherwise.Then, the learning problem of the SDT representation can be regarded as a multiclass classification problem, which categorizes signed-distance values into a certain number of classes [24].Compared to the widely used binary building mask, SDT can encode more fine-grained information for network learning.

B. Building Footprint Generation Based on the Corner
Some algorithms generate building footprints based on geometrical primitives, such as building corners.In these methods, geometric primitives are first detected and then grouped together to reconstruct individual building hypotheses.A building corner refers to a point with its local neighborhoods in two varying line segment directions and is invariant to translation, rotation, and illumination [28].Early studies extract building corners with the help of some point feature operators, such as Harris corner detector [29] and scale-invariant feature transform (SIFT) operator [30].Cote and Saeedi [12] and Zangrandi et al. [31] employ a Harris corner detector to extract corner points of buildings.Afterward, these detected corner points are connected in the order of their polar angles with respect to building central markers.By doing so, polygonal representations of buildings can be constructed.In [32], SIFT is exploited to extract corners that are regarded as seed points to estimate rectangle shapes of buildings with a region growing method.
With the development of keypoint detection networks, several novel studies propose to delineate building footprints by detecting corner points using CNNs.PolyMapper [33] extracts corner points with a CNN in the first stage and then connects them by a recurrent neural network (RNN) to realize closed polygon representations of individual buildings.The other research [34] utilizes the same pipeline as PolyMapper [33], and various blocks are integrated to enhance the feature extraction and object detection modules.Another method [13] also exploits a CNN to detect corners but adopts a fully geometric-based grouping strategy without any deep feature learning.Recently, Girard's method [35] proposes to learn a frame field output instead of building corners.The frame field is regarded as a geometric feature that can help to improve the segmentation of buildings.

C. Building Footprint Generation Based on the Boundary
Building boundary is another commonly used geometric primitive and can be taken as a primary visual cue to generate building footprints.Early works extract building boundaries from remote sensing data in two steps.Given that lines are strongly relevant to building boundaries, the first step is to detect line segments.Afterward, the extracted lines are grouped to form closed boundaries for individual buildings.A commonly used line detection algorithm is the Hough transformation [36] that utilizes a voting procedure to find straight lines in parameter space.Compared with the Hough transformation, the Burns algorithm [37] only uses gradient orientations and, therefore, requires a relatively lower computation cost.In [14] and [38], line segment sets are extracted with the Hough transformation and the Burns algorithm.Then, intersection nodes of the two line segment sets are employed to build a structural graph.Finally, building boundaries are identified with a graph search algorithm.However, both Hough transformation and Burns algorithm highly depend on parameter settings and have a very high false alarm rate.In this regard, EDLines [39] are proposed to avoid parameter tuning.Moreover, it has a faster computation speed and a lower false alarm rate.In [40] and [41], EDLines are, therefore, adopted for the automatic extraction of line segments, but they make use of different strategies to group these line segments.
These early works still encounter issues when dealing with more complex building shapes and large-scale applications.Considering that, nowadays, CNNs are the de facto leading approach for building footprint generation tasks, two novel works, [15] and [42], propose to learn building boundaries in their end-to-end CNNs.Marcos et al. [15] present a method termed deep structured active contours (DSACs), which learns active contour model (ACM) [43] parameterizations per instance using a CNN.Although DSAC improves geometric correctness, results are still not that satisfactory, e.g., there exist blob-like shapes and some self-intersections of building.Besides, the representation of boundary points in DSAC adopts Euclidean coordinates, which leads to extra computational overheads during energy minimization.On this point, another research [42] proposes to use polar coordinates, as this can not only simplify the energy function but also prevent self-intersection.However, these two methods still have two limitations.On the one hand, the initialization of them relies on external methods that are not included in an end-to-end learning process.On the other hand, their results are promising only in very high-resolution remote sensing images where strong geometric priors exist.

III. METHODOLOGY
In this work, we explicitly take building boundaries as a primary visual cue.By doing so, building footprint generation tasks can be benefited from the precise delineation of building boundaries.In this section, an overview of the proposed approach is first presented.Then, two key modules, Img2AFM and AFM2Mask, are introduced in detail, respectively.Finally, the method of integrating and jointly optimizing the two modules in an end-to-end architecture is described.

A. Overview
As shown in Fig. 3, the proposed method consists of two modules.The Img2AFM module exploits a U-Net architecture to learn the attraction field representation, which can enhance building boundaries and suppress background clutters.It takes an image as input and outputs two AFMs in x-and y-directions.Afterward, the output is then fed into the AFM2Mask module along with the input image to generate a building mask.Moreover, the AFM2Mask module is very flexible to utilize different semantic segmentation networks.Note that these two modules can be integrated into an end-toend framework and optimized jointly.In this way, the optimal output can be obtained by the coadaptation of these two modules.

B. Img2AFM Module 1) Definition of Attraction Field Map:
An image I can be regarded a lattice.Let E = {e 1 , e 2 , . . ., e n } be the set of building line segments in the image lattice with n being the number of building line segments.A building line segment e i is represented by two end points p a i and p b i .For the sake of simplicity, the set E is named boundary map in our case.The boundary map characterizing all building boundaries in the ground reference is shown in Fig. 4(c).
For each pixel, we try to find its most "attractive" building line segment that is the closest to it.Following this criterion, a region partition map R is formed by partitioning all pixels into n regions and assigning each pixel x ∈ I to its closest building line segment.R i denotes a region for the building line segment e i in E. Specifically, in order to derive the distance between a pixel x and a building line segment e i , the pixel x is first projected to the straight line passing through e i .If the projection point is not on e i , the nearest endpoint is utilized as the projection point.The definition of the projection point p is When c x ∈ (0, 1), p belongs to the original point-to-line projection, and if c x = 0 or 1, p is its nearest endpoint of e i .
Then, the distance d(x, e i ) between x and e i can be defined as the Euclidean distance between the pixel and the projection point.Then, R i in the image lattice for e i can be defined as It should be noted that R i ∩ R j = ∅ and ∪ n i=1 R i = R. Fig. 4 shows an example that, for the green building line segment, its corresponding region partition map is highlighted in green.
Afterward, the geometric property of a building line segment can be characterized by a 2-D attraction of all individual pixels in R i .For instance, the attraction function of the pixel x in R i is defined as When c x ∈ (0, 1), the attraction vector is perpendicular to the line segment.Fig. 4(d) shows the attraction vectors of the green line segment.Finally, by enumerating (3) over all pixels in I , the AFM A with respect to E can be obtained as follows: The superiority of AFM lies in two aspects compared with the boundary map used in previous studies (see [15] and [42]).One is that the geometry of boundaries can be depicted more precisely by the AFM, while the boundary map is only characterized by few pixels.Thus, directly learning boundary maps can lead to a zig-zag effect that results from the extreme imbalance between the number of boundary pixels and that of nonboundary pixels.The other benefit is that the AFM associates each line segment with a support region, which avoids the blurring effect.
2) Learning Attraction Field Map: Each pixel in the attraction field representation has two components (x-and y-directions) that are represented by attraction vectors from it to its projection point.In this respect, an attraction field representation can be regarded as a 2-D feature map, which is feasible to be learned by a network.Hence, in this article, we view the learning of the AFM as a dense prediction problem and solve it using a semantic segmentation network architecture.Among all semantic segmentation networks, U-Net  [16].(e) Recovered boundary map obtained by the heuristic algorithm in [16]. is more favorable than others for this task.Because learning the attraction field representation relies heavily on low-level visual cues (e.g., object edges) that exist in lower layers, and multiscale skip connections of U-Net are able to effectively use such information.In fact, in our experiments, we found that taking other network architectures as the Img2AFM module fails.

C. AFM2Mask Module
By learning the AFM, a representation encoding building boundaries can be obtained.Then, we need to remap the learned AFM into building masks.In [16], a heuristic algorithm has been proposed to recover line segments from the AFM.In this heuristic algorithm, attraction vectors are rearranged mathematically to generate a proposal map of line segments, and final line segments are then extracted with a greedy grouping strategy.However, we found that, in our building footprint generation task, the recovered boundary map from this algorithm is not satisfactory [cf.Fig. 4(e)] since there is a relatively high false alarm rate [see short line segments in Fig. 4(e)].The reason is that predicted attraction vectors from CNNs are not mathematically precise enough.In this case, some potential outliers have been included in the following heuristic method, which leads to inaccurate line segment detections.Another reason is that this heuristic algorithm is not robust to imprecise estimates of the AFM.Furthermore, it requires a set of heuristics and makes the whole process inefficient.Therefore, in this work, we propose to learn this process, i.e., recovering building masks from the learned AFM, using a network.By doing so, the whole process can be trained in an end-to-end manner, which makes it more efficient and robust.
In the AFM2Mask module, the input image and learned attraction field representation from the previous module are concatenated as the input to this module.Afterward, the network can directly generate building masks without using math heuristics (that do not work well in our case).It is noteworthy that different semantic segmentation network architectures are quite flexible to be utilized in this module, which makes full use of the power of state-of-the-art networks to generate building footprint maps.

D. End-to-End Network Learning
We propose an end-to-end training pipeline for the supervised learning of our network.More specifically, the Img2AFM module is appended before the AFM2Mask module, and the two modules are jointly trained by minimizing a global loss function.The global loss function L is defined as follows: where L Img2AFM and L AFM2Mask are two loss functions for optimizing the Img2AFM and AFM2Mask modules, respectively.λ is a hyperparameter to introduce a weight on the second loss and can model the relative importance of two modules.
For the first term, an L 1 loss function is utilized, and it is defined as follows: where â(x) is the predicted AFM and a(x) is ground reference AFM for the input image.
For the AFM2Mask module, we make use of a cross entropy loss function to guide the learning.L AFM2Mask is defined as where y is the ground truth of pixel x, y = 1 denotes building, and y = 0 represents non-building.f (x) ∈ [0, 1] is the output probability value of x.
In the backward propagation, L AFM2Mask is first backpropagated through the AFM2Mask module and then together with λ • L Img2AFM propagated backward through the Img2AFM module.

A. Dataset
We validate the proposed method on three datasets with different spatial resolutions, i.e., the ISPRS dataset, the INRIA dataset, and the Planet dataset.2) INRIA Dataset: The INRIA dataset [46] is composed of 360 large-scale aerial images that are collected over ten different cities.The size of each imagery is 5000 × 5000, and each image consists of three bands (RGB) at a spatial resolution of 30 cm/pixel.A sample aerial image is showed in Fig. 5(b).The ground reference data of this dataset provide building masks but are only publicly available for five cities (Austin, Chicago, Kitsap County, Western Tyrol, and Vienna).In this article, data are split into training and test sets according to the setup in [46] and [47].For each city, images with ids 1-5 are used for validation, and the remaining 31 images are for training.The statistics are derived from the validation set.
3) Planet Dataset: In addition to the aforementioned two public datasets, we create a Planet dataset by collecting PlanetScope satellite images and their corresponding building footprints from OpenStreetMap.The PlanetScope satellite images are gathered from eight European cities (Munich, Berlin, Amsterdam, Paris, Cologne, Milan, Rome, and Zurich) with three bands (RGB) at 3-m spatial resolution.Compared to the former two datasets, the Planet dataset is more challenging due to its coarser spatial resolution.Fig. 5(c) shows an example of Munich.In our experiment, the image of Munich is used as the test set to evaluate the performance of models.The remaining seven cities are utilized as training and validation sets.Specifically, for each city, 80% of samples are used for training, while 20% of samples are for validation purposes.

B. Experiment Setup
Our proposed model consists of two modules in an end-to-end framework, where the Img2AFM module utilizes a U-Net to learn the attraction field representation of an image with respect to building edges, and the AFM2Mask module can learn building masks from the representation using different semantic segmentation networks.To explore the flexibility of the AFM2Mask module, we select four state-of-the-art semantic segmentation networks: FCN-8s [20], SegNet [21], U-Net [11], and FC-DenseNet [22].The attraction field representation encodes the geometric relation between pixels and building boundaries in an image, and it can be considered as a variant of distance transform, such as SDT [27] that measures the distance from the pixel to the boundary.Hence, we compare our model with existing works [24], [27] learning SDT representations of buildings.On the other hand, it is clearly seen that the learned AFMs from the Img2AFM module can well enhance building boundaries.In this aspect, the function of the attraction field representation seems to be similar to other visual cues, such as building boundaries and SDT masks.Thus, we also compare our network with two methods, SDT-recursive and boundary-recursive, where, basically, we incorporate SDT/edge learning into the proposed framework (cf.Fig. 3).Comparing the proposed approach and the two models can verify whether the attraction field representation is effective.Besides, the sensitivity of the hyperparameter λ, being the coefficient of loss of the AFM2Mask module, is investigated.

C. Training Details
Our experiments are conducted within a Pytorch framework on an NVIDIA Tesla P100 GPU with 16 GB of memory.For the model training, remote sensing images and their corresponding ground reference building masks are cropped into small patches with a size of 256 × 256 pixels.Afterward, the boundaries, SDT, and AFMs are generated from the ground-truth building masks for further experiments as a ground reference in the training set.All models are trained for 100 epochs, and the optimizer is stochastic gradient descent (SGD) with a learning rate of 0.00001.The training batch size of all models is set as 4. The cross-entropy function is used as the loss function for other competitors.
The configurations of competitors included in experiments are listed as follows.
2) The encoder in SegNet is based on VGG16, and the decoder utilizes a reversed VGG16 architecture.3) U-Net is composed of five blocks in both the encoder and the decoder.Each block in the encoder has two convolution layers, and in the decoder, it has one transposed convolution layer.
4) Both the encoder and the decoder in FC-DenseNet consist of five dense blocks, and each dense block has five convolutional layers.5) For the SDT-based network that directly learns the SDT representations of buildings, they utilize the aforementioned four semantic segmentation networks and, finally, convert the learned SDT representations of buildings to semantic masks by definition [24], [27].6) The SDT-recursive model or boundary-recursive model first utilizes a U-Net to learn the SDT representation or boundaries of buildings.Afterward, they also utilize the aforementioned four semantic segmentation networks to reconstruct semantic masks of building.It should be noted that the whole method is trained in an end-to-end fashion.

D. Evaluation Metrics
The performance of models is evaluated from two aspects.Mask metrics are focused on building masks, while boundary metrics are exploited to measure the quality of boundaries of the predicted building masks.
1) Mask Metrics: In our experiments, F1 score and intersection over union (IoU) are selected as two mask metrics.They can be computed as follows: precision = TP TP + FP (10) recall = TP TP + FN (11) where TP indicates the number of true positives, FN is the number of false negatives, and FP is the number of false positives.Notable that these metrics are calculated based on building pixels rather than building objects.F1 score realizes a harmonic between precision and recall.
2) Boundary Metrics: In order to assess building boundaries, structural similarity index (SSIM) [49] and F-measure [50] are exploited as two evaluation criteria.SSIM is a measure to calculate the similarity between two images, which can be used for the quality assessment of boundaries [51].Before the calculation of F-measure, building boundaries are extracted first from predicted semantic masks by the Sobel edge operator [52].F-measure is used to score the boundary and is defined as the geometric mean of the precision and recall precision = TP TP + FP (12) recall = TP TP + FN ( 13) where TP is the number of correctly identified boundary pixels, FN is the number of boundary pixels in the ground reference but being failed to be detected, and FP is the number of nonboundary pixels mislabeled as "boundary."

A. Comparison With Other Competitors
The comparisons among the proposed method, naive semantic segmentation networks, SDT-based networks, SDT-learning methods, and boundary-learning methods are presented in this section.Their respective performance is evaluated according to both quantitative (cf.Tables I-III) and qualitative results (see Figs. 6-8) on three datasets.
Naive semantic segmentation networks that are regarded as baseline methods are first compared with the proposed framework.Specifically, we implement four baseline models, i.e., FCN-8s, SegNet, U-Net, and FC-DenseNet.For a fair comparison, the AFM2Mask module is instantiated with these four networks separately.It can be seen from the statistics of three datasets that the proposed approach significantly boosts performance in both mask and boundary metrics compared with baseline networks.This indicates that the integration of learning attraction field representation is effective, and our framework can offer more robust results for the task of building footprint generation.For the ISPRS dataset (cf.Table I), our proposed FCN-8s-AFM obtains increments of 6.65% and 10.1% in F1 score and IoU, respectively.Moreover, the proposed U-Net-AFM reaches improvements of 4.65% and 4.18% in SSIM and F-measure, respectively.The increases in boundary metrics suggest that our method can better preserve geometric details.The spatial resolution and image quality of the Planet dataset are much lower than the other two datasets, which may lead to a negative effect on accurately extracting buildings.In this case, although improvements in both mask and boundary metrics on the Planet dataset are less significant than those on the other two datasets, the nearly 1% gain is still not trivial.
From qualitative results, we can observe that building boundaries obtained from naive semantic segmentation networks are blurred, which is also pointed out in [53]- [55].The visual comparisons (cf.Figs.6-8) demonstrate the effectiveness of the proposed method.As illustrated in Fig. 7, semantic masks provided by naive networks have blob-like shapes.Even with skip connections that help compensate spatial details in networks, U-Net and FC-DenseNet fail to achieve accurate building boundaries.Moreover, this scene is a residential area, and some consecutive buildings are identified as a large building by most of the baseline models.Note that building boundaries produced by our algorithm are more rectilinear and precise.Even for buildings with complex structures (cf.Fig. 6 and 8), building footprints generated from our framework are more adherent to the ground reference.These observations suggest that our model really benefits from learning attraction field representation, enabling us to gain more geometric details of buildings.
The attraction field representation can be considered as a type of distance transform, which represents the relationship between the pixel and the boundary.Therefore, we also take another type of distance transform: SDT as competitors.One competitor is an SDT-based network that utilizes variant backbones to learn the SDT representation of buildings and then convert this representation to semantic masks by definition [24], [27].Compared to baseline networks, the SDT-based network can contribute to the F-measure only on the ISPRS dataset.However, there are even decreases in mask metrics.This suggests that directly learning SDT labels as final output have the potential for the improvement of geometric details only in remote sensing data with very high resolution (e.g., 5 cm/pixel).The other competitor is the SDT-recursive model, which first learns the SDT representation of buildings with a U-Net and then reconstructs semantic masks by different backbones.Notable that the whole method is trained in an endto-end fashion.The SDT-recursive model that feeds the learned SDT representations into semantic segmentation networks is much superior to the SDT-based network, as we can see gains in both mask and boundary metrics.This may be because the SDT representation learned from the remote sensing imagery carries useful information to capture the global semantic context in semantic segmentation networks, which indicates the potential of SDT in a recursive learning way for building footprint generation.It is worthy to note that the performances of both SDT-based network and SDT-recursive model are more sensitive to the backbone semantic segmentation networks.For the ISPRS dataset (see Table III), when the backbone is FCN-8s, both SDT-based network and SDT-recursive model can boost the performance.However, the performance of SegNet-SDT and SegNet-SDT-recursive is worse than that of SegNet.
The geometric property of building boundaries can be significantly enhanced by AFMs (see Fig. 2).From Tables I-III, it can be observed that our framework can improve results in terms of both mask and boundary metrics, which confirms that explicitly encoding geometric information is essential to building footprint generation tasks.In this regard, we investigate another competitor, the boundary-recursive model, to further validate the effectiveness of the attraction field representation.This method first learns building boundaries from remote sensing images with a U-Net and then uses them as auxiliary information to extract building masks by variant semantic segmentation networks.Notable that these two subnetworks are jointly optimized.Experimental results show that this model does not bring this task any benefits in terms of building boundary quality, and we can see decreases in boundary metrics and more blurred boundaries compared to the naive semantic segmentation network.This may be because building boundaries are characterized with very few pixels, and this class imbalance leads to ambiguity in the network learning.
By contrast, our method can always provide significant gains, regardless of which semantic segmentation network architecture is chosen as the AFM2Mask module, and the proposed approach outperforms other competitors in most of the statistical metrics for three datasets.This is due to two facts.One is that the attraction field representation can encode geometric properties in 2-D (x-and y-directions), but SDT only relies on the Euclidean distance and, thus, characterizes the information in 1-D.This indicates that the use of the information in different dimensions is more reliable and accurate.Fig. 9

B. Analysis of Hyperparameter Tuning
As shown in the results on three datasets, taking U-Net as the AFM2Mask module can deliver relatively satisfactory results on all three datasets.Therefore, in this section, we use U-Net-AFM for further studies.Moreover, the INRIA dataset is selected as an example dataset to carry out the following experiments.
In the proposed framework, the global loss function is utilized to guide the end-to-end learning of building masks from remote sensing data.This function is a sum of losses from two separate modules, where the hyperparameter λ is the coefficient of the AFM2Mask module.Here, λ is set as three different numbers, i.e., 0.1, 1, and 10, to investigate its impact on final results.
The statistical results with different values of λ are shown in Table IV.We can see that our model is insensitive to this parameter, and networks with all different λ values outperform the naive U-Net.Furthermore, increasing the value of λ will lead to a slight reduction in both mask and boundary metrics.The best result is obtained when λ = 0.1.A small value of λ indicates more significance of the Img2AFM module than the AFM2Mask module, which places an emphasis on the attraction field representation learning in the whole framework.It can be clearly seen from the Fig. 10 that gradually lowering λ can reduce false detections.This is mainly because the attraction field representation can alleviate the impact of background clutters.

C. Different Methods to Incorporate Attraction Field Representation
It is worth noting that building boundaries leaned by the proposed method are significantly improved due to the exploitation of attraction field representation.In order to further explore how to well leverage attraction field representation,    separate decoders to jointly optimize two complementary tasks, namely, building semantic segmentation and attraction field representation learning.Note that the architecture of encoder and decoders in this design is the same as those in U-Net.The statistical and visual results are reported in Table V and Fig. 11, respectively.From both mask and boundary metrics in Table V, all methods have shown superior results than naive U-Net, which again confirms the significance of attraction field representation in our task.Among all design options, the proposed framework has achieved the best performance.In particular, the F-measure achieved by our approach is increased by more than 3% when compared to the other methods.Besides, it can be seen that the building boundaries and corners learned by the proposed framework are more accurate than its competitors.This suggests that our approach is able to effectively leverage information of attraction field representation, which is attributed to our recursive learning strategy.

D. Comparison With State-of-the-Art Methods
To verify the superiority of our approach on datasets with different spatial resolutions, we make a comparison with other state-of-the-art methods on the ISPRS, INRIA, and Planet datasets.The statistical results of different algorithms on three datasets are shown in Tables VI-VIII, respectively.On both ISPRS and Planet datasets, the proposed method surpasses all other models in both mask and boundary metrics.For the INRIA dataset, our approach achieves the highest scores in boundary metrics and comparative performance in mask prediction.Compared to our methods, Girard's method [35] gains a marginal improvement in mask metrics at the cost of additional ground-truth annotations (i.e., vector format of building footprints).For an intuitive comparison, the visual results of our method and Girard's method [35] are illustrated in Fig. 12.As we can see, Girard's method [35] fails to recover detailed structures of complicated buildings.On the contrary, our approach can accurately capture more geometric details,  which again demonstrates the strength of the AFM for the task of building footprint generation.

VI. CONCLUSION
Considering that building boundaries are easily blurred when using semantic segmentation networks to directly learn building footprints, a new end-to-end building footprint generation method through learning the attraction field representation is proposed in this article.The proposed model comprises two modules: an Img2AFM module and an AFM2Mask module.More specifically, the former is designed to learn the attraction field representation, which enables not only the enhancement of building boundaries but also the suppression of background clutters.Afterward, the latter exploits the input remote sensing image and learned AFM to reconstruct building masks.The performance of the proposed end-to-end network is assessed on three datasets with different spatial resolutions: the ISPRS dataset (5 cm/pixel), the INRIA dataset (30 cm/pixel), and the Planet dataset (3 m/pixel).Experimental results suggest that the incorporation of the attraction field representation in our framework can offer more satisfactory building footprint maps.On the one hand, sharp boundaries and geometric details of buildings can be better preserved.On the other hand, nonbuilding objects that are wrongly detected as buildings can be avoided to a large extent.Thus, we believe that our method has the potential to be a robust solution for building footprint generation at a large scale.Looking into the future, we intend to investigate the potential of the attraction field representation in other tasks, e.g., road extraction and vehicle detection.

Fig. 1 .
Fig. 1.Building footprints generated by U-Net [11] and our proposed method (U-Net with attraction field representation) at large scale and two zoomed in areas.

Fig. 2 .
Fig. 2. (a) Satellite imagery, and the AFMs in both (b) x-and (c) y-directions estimated by our method.

Fig. 3 .
Fig.3.Overview of the proposed framework.The Img2AFM module takes an image as input and outputs two AFMs in x-and y-directions.Afterward, the output is then fed into the AFM2Mask module along with the input image to generate a building mask.Notable that these two modules are trained in an end-to-end fashion.

Fig. 4 .
Fig. 4. (a) and (b) Semantic masks and boundaries of buildings in an image.(c) and (d) Region partition map and attraction vectors of the green building line segmentaccording to the method in[16].(e) Recovered boundary map obtained by the heuristic algorithm in[16].
(a) and (b) presents the AFM learned by the proposed U-Net-AFM, and Fig. 9(c) shows the SDT representation learned by U-Net-SDT-recursive.It can be observed that attraction field representation can better delineate sharp building boundaries.The other factor is that the attraction field representation takes the nonboundary pixels into account, which have addressed the challenges of class imbalance in boundary-learning methods.

Fig. 10 . 1 )
Fig. 10.Results obtained by the proposed method (U-Net-AFM) with coefficient λ = (a) 0.1, (b) 1, and (c) 10.(d) Result obtained by the naive U-Net.Pixel-based true positives, false positives, and false negatives are marked in white, green, and red, respectively.(e) and (f) Corresponding aerial imagery and ground reference from the INRIA dataset (spatial resolution: 30 cm/pixel).

FOOTPRINT 2 )
GENERATION IN THE ISPRS DATASET (SPATIAL RESOLUTION: 5 cm/pixel) learn semantic masks and attraction field representation, respectively.Bischke et al. [47]: It takes a U-Net as the backbone and first adds one convolutional layer after the decoder to learn the attraction field representation.Afterward, this learned attraction field representation and feature maps produced by the decoder are concatenated and fed into another convolutional layer to learn final segmentation masks.3) Mou and Zhu [57]: It utilizes an encoder and two

Fig. 12 .
Fig. 12. Results obtained by (a) proposed U-Net-AFM and (b) Girard et al. [35].Pixel-based true positives, false positives, and false negatives are marked in white, green, and red, respectively.(c) and (d) Corresponding aerial imagery and ground reference from the INRIA dataset (spatial resolution: 30 cm/pixel).

TABLE I ACCURACIES
(%) OF DIFFERENT NETWORKS FOR BUILDING FOOTPRINT GENERATION IN THE ISPRS DATASET (SPATIAL RESOLUTION: 5 cm/pixel)TABLE II ACCURACIES (%) OF DIFFERENT NETWORKS FOR BUILDING FOOTPRINT GENERATION IN THE INRIA DATASET (SPATIAL RESOLUTION: 30 cm/pixel)

TABLE III ACCURACIES
(%) OF DIFFERENT NETWORKS FOR BUILDING FOOTPRINT GENERATION IN THE PLANET DATASET (SPATIAL RESOLUTION: 3 m/pixel)

TABLE IV ACCURACIES
(%) OF DIFFERENT COEFFICIENTS OF AFM2MASK LOSS (λ) FOR BUILDING FOOTPRINT GENERATION IN THE INRIA DATASET (SPATIAL RESOLUTION: 30 cm/pixel)

TABLE V ACCURACIES
(%) OF DIFFERENT DESIGNS FOR THE INCORPORATION OF ATTRACTION FIELD REPRESENTATION IN THE INRIA DATASET (SPATIAL RESOLUTION: 30 cm/pixel)

TABLE VI ACCURACIES
(%) OF DIFFERENT METHODS FOR BUILDING

TABLE VII ACCURACIES
(%) OF DIFFERENT METHODS FOR BUILDING FOOTPRINT GENERATION IN THE INRIA DATASET (SPATIAL RESOLUTION: 30 cm/pixel) TABLE VIII ACCURACIES (%) OF DIFFERENT METHODS FOR BUILDING FOOTPRINT GENERATION IN THE PLANET DATASET (SPATIAL RESOLUTION: 3 m/pixel)