ARC-Net: An Efficient Network for Building Extraction From High-Resolution Aerial Images

Automatic building extraction based on high-resolution aerial images has important applications in urban planning and environmental management. In recent years advances and performance improvements have been achieved in building extraction through the use of deep learning methods. However, the design of existing models focuses attention to improve accuracy through an overflowing number of parameters and complex structure design, resulting in large computational costs during the learning phase and low inference speed. To address these issues, we propose a new, efficient end-to-end model, called ARC-Net. The model includes residual blocks with asymmetric convolution (RBAC) to reduce the computational cost and to shrink the model size. In addition, dilated convolutions and multi-scale pyramid pooling modules are utilized to enlarge the receptive field and to enhance accuracy. We verify the performance and efficiency of the proposed ARC-Net on the INRIA Aerial Image Labeling dataset and WHU building dataset. Compared to available deep learning models, the proposed ARC-Net demonstrates better segmentation performance with less computational costs. This indicates that the proposed ARC-Net is both effective and efficient in automatic building extraction from high-resolution aerial images.


I. INTRODUCTION
Automatic extraction of buildings based on aerial images is of great importance in a broad range of application fields including urban planning, change detection, map services, and disaster management [1]- [5]. Recently, with the continuous advancement of satellite and sensor technology, high-resolution remote sensing products have become the preferred data source for building extraction due to their rich textural, semantic, and spatial details. However, the The associate editor coordinating the review of this manuscript and approving it for publication was Stefania Bonafoni .
increasing resolution of aerial images results in an increasing degree of redundant interference information and infernal differences. Moreover, the diversity of building characteristics (color, shape, size, etc.) remains a difficulty and challenge for accurate building extraction. Thus, the efficiency and accuracy of automatic building extraction are still difficult archive and remain a challenging objective which attracts huge research interests [6].
In the past few years, traditional methods including mathematical techniques and morphology approaches have been proposed to address this issue. Many mathematical descriptors have been introduced to extract the spatial and textural VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ features of an image, such as Histogram of Oriented Gradients [7], Haar spaces [8], Grey Level Co-occurrence Matrix [9], and Local Binary Patterns [10]. Furthermore, several machine learning classifiers have been employed for a pixel-by-pixel analysis, including Random Forests [11], Support Vector Machines [12], K-Means [13], Adaptive Boosting [14], and Conditional Random Fields (CRFs) [15]. However, these methods rely heavily on prior knowledge and parameter selections which are leading to limitations as well as significant time and labor costs when applied in real live scenarios.
Recently, with the rapid increase of computational power and available data sources, the use of deep learning technology [16], especially convolutional neural networks (CNNs), has emerged as a powerful tool in computer vision and semantic segmentation [17]. CNNs automatically learn semantic information from the input and generate the classification results through convolutional operations. In the early stages of CNN development, patch-based CNN models, such as VGGNet [18], GoogLeNet [19], ResNet [20], and DenseNet [21], have outperformed traditional machine learning methods on classification applications. Some researchers also utilized patch-based CNN methods to segment buildings in remote sensing images and managed to greatly improve the performance [22]- [25]. However, as patch-based CNN models cannot guarantee spatial continuity and consistency, they are not the best solution for addressing the task of building segmentation [26].
In the fully convolutional network (FCN), proposed by Long et al. in 2015 for semantic segmentation [27], fully connected layers are replaced by up-sampling layers so that the output preserves spatial information of contextual features. Over the past five years, various FCN-based variants have been proposed to pursue more accurate segmentation results. The SegNet [28] and U-Net [29] are two classic architectures with symmetric encoder-decoder structures, which were both regarded as effective structures due to their capabilities of recovering semantic details [30]. Some novel FCN-based methods are mainly designed to improve performance by extending the receptive field and by learning multi-scale contextual information. For example, Yu and Koltun [31] utilized dilated convolutions to gather multi-scale contexts. The pyramid pooling module, proposed in PSPNet [32], is applied to capture multi-scale features with different kernel sizes. The DeepLab_v2 [33] employs atrous convolution and atrous spatial pyramid pooling (ASPP) to enlarge the receptive field on different levels. Liu et al. [34] merged the spatial pyramid pooling module into the encoder-decoder architecture with a particular focus on building extraction. The JointNet [35] introduced a new, dense atrous convolution block combining a dense connectivity block and atrous convolution to obtain multi-scale features. Ji et al. [36] proposed a scale-robust FCN and trained it with five outputs of two ASPP structures. SRI-Net [37] employed large kernel convolution and a spatial residual inception module to preserve details with large receptive fields. Zhang et al. [38] proposed the Web-Net with hierarchical dense connections to propagate feature maps among different levels. These novel FCNs have also been successfully applied for land-use detection and are regarded as state-of-the-art methods for semantic segmentation [39].
Some FCN-based models further adopt post-processing approaches to prediction results to optimize the pixel-wise results and to preserve the structure consistency. For example, Shrestha and Vanneschi [40] proposed a novel fully convolutional network using CRFs and exponential linear units for building extraction. Alshehhi et al. [41] proposed a post-processing method integrating low-level features of adjacent regions to enhance the performance. Wang et al. [42] improved the dense conditional random field (Dense CRF) using a superpixel algorithm in post-processing. However, post-processing methods are only able to improve results within a certain range [37]. The result of semantic segmentation cannot be fundamentally changed.
Although these networks presented before have greatly enhanced the performance of semantic segmentation, their computational cost is high and they require generous training time, which is bringing a heavy burden for the application of deep learning in remote sensing. Therefore, model complexity and computational cost need to be essential indicators to measure the performance of a CNN architecture and should be taken into consideration [43]. One practical way to decrease the number of model parameters is the utilization of efficient structure, such as residual blocks, kernel factorizations, and group convolutions. With these performance considerations in mind but still maintaining high accuracy, a variety of FCN-based architectures have been designed, including ENet [44], ERFNet [45], EDANet [46], the MobileNet family [47], [48], ShuffleNet family [43], [49]. Recent networks such as ICNet [50] and BiSeNet [51] are targeting to compromise performance and efficiency, but these models have still complex structure designs and are difficult to deploy and apply. So, there is still room for further improvement.
To better balance the accuracy and efficiency, we propose a new network for automatic building extraction, named ARC-Net. The basic architecture of the ARC-Net is an asymmetric encoder-decoder structure. We have designed the residual block with an asymmetric convolution (RBAC) module, which incorporates depth-wise separable convolution and asymmetric convolution with the residual connection in order to reduce the computational cost. Dilated convolution is incorporated with the RBAC module to further enlarge the receptive field. Moreover, the advanced atrous pyramid pooling module is added as a connector between the encoder and decoder to aggregate multi-scale contextual information. Experiments on two public building datasets, the INRIA Aerial Image Labeling Dataset [52] and the WHU Building Dataset [53], demonstrate the remarkable performance of the proposed model. Compared to several other FCN-based models, such as SegNet, FCN, U-Net, and ERFNet, higher accuracy with less computational complexity is achieved by the new ARC-Net model when applied to the building extraction from high-resolution aerial images.
The main contributions of this study are summarized as follows: (1) We design a novel efficient network, called ARC-Net, as well as a new residual block with asymmetric convolution module incorporating depth-wise separable convolution to reduce the computational complexity still with sufficient accuracy and (2) we conduct further experiments to provide justifications for some of the design decisions for ARC-Net.
The remainder of this article is organized as follows. The components of the proposed ARC-Net model are introduced in Section II. Section III describes the test datasets and experimental settings. Section IV provides the experimental results of the proposed ARC-Net model including the quantitative and qualitative comparison with other established models. Finally, a discussion and some conclusions from this study are presented in Section V and VI, respectively.

II. METHODS
The proposed ARC-Net model follows an asymmetric encoder-decoder architecture, which has already successfully been applied to semantic segmentation. Figure 1 presents the basic structure of the ARC-Net model. In the encoder part (blocks 1-12), several down-sampling blocks and residual blocks with asymmetric convolution (RBAC) modules are employed to extract the feature maps from the inputs and at the same improving computational efficiency. The RBAC modules are also utilized in the decoder phrase with up-sampling operations to recover the details of images in the decoder part (blocks [14][15][16][17][18][19][20]. The atrous spatial pyramid pooling (ASPP) is employed as a connector in block 13 between the encoder and decoder to further collect the multi-contextual information. The various components of the ARC-Net model are presented in Table 1. In the following, each component will be discussed in detail.

A. ENCODER WITH DOWNSAMPLING BLOCK AND RBAC MODULE
The residual block with the asymmetric convolution (RBAC) module is the fundamental element of the ARC-Net model. It mainly contains two parts: the separable convolution and asymmetric convolution. At the same time, the residual connection is employed to reduce the complexity and to retain dimensions between the input and output. The depth-wise separable convolution is considered as an efficient tool to reduce the computational cost and the number VOLUME 8, 2020 of parameters while achieving similar (or slightly better) performance [54], [55]. It splits the full convolution operations into two independent steps: depth-wise convolution and point-wise convolution [56]. In depth-wise convolution, each kernel has a single feature map in and a single feature map out. As weight kernels are shared the depth-wise convolution requires fewer parameters than the standard version. Pointwise convolution is equivalent to a standard convolution with a kernel size of 1 × 1 and is aiming to combine the channel-wise independent features from depth-wise convolution. Through such a two-step operation, the number of parameters to be fitted is reduced, speeding up the deep learning computations.
Asymmetric convolutions are widely employed to approximate an existing square-kernel convolutional layer for compression and acceleration [57]. Prior research [58], [59] has shown that a standard d × d convolutional layer can be factorized as a sequence of two layers with d × 1 and 1 × d kernels. Results of combing a d × 1 and following 1 × d convolution are consistent with the results of a direct d × d convolution, but the number of a multiplication operation is reduced from d × d to 2 × d leading to dramatical computational cost saving as d grows. This is the reason why asymmetric convolution performs well in reducing the model parameters and computational work. In this research, we follow this approach and factorize a standard two-dimensional d × d convolution kernel into two one-dimension d × 1 and 1 × d kernels. As presented in Figure 2 (a), 1 × 1 point-wise convolution is employed in the head of the RBAC module. Each 3 × 1 convolution is then followed by a rectified linear unit (ReLU) while each 1 × 3 convolution followed by batch normalization and ReLU function.
The down-sampler block in the proposed ARC-Net is inspired by the initial block of ENet [44] and performs down-sampling by concatenating the parallel outputs of a single 3 × 3 convolution with stride 2 and a MaxPooling operation with stride 2. In contrast to the ENet which used it only as the initial block to perform early down-sampling, we employ it in all down-sampling layers in the ARC-Net. The structure of the down-sampling block is presented in Figure 2 To improve the accuracy of semantic segmentation for high-resolution aerial images, the models usually need to enlarge the receptive field to gather sufficiently rich contextual information for each individual pixel [60]. The method used in the past is combing the stacking of convolutional layers with down-sampling layers. However, these extra convolution layers substantially increase computational effort during learning. Moreover, over-down-sampling is harmful to the dense pixel-level classifications it leads to a loss of unrecoverable spatial information [61]. Dilated convolution [60] introduces an additional parameter to the convolutional layers named the dilation rate. This rate defined spacing between the values in a kernel, delivering a wider field of view at the same computational cost. Therefore, the dilated convolution is conducive to enlarge the receptive field and to enhance the segmentation performance [62], [63]. Following the suggestion from the literature, we set the dilation rate to the sequence 2, 4, 8, 16 in block 9-12 incorporating the RBAC module to obtain a wide receptive field. The description of the dilated convolution is presented in Figure 3.

B. ATROUS SPATIAL PYRAMID POOLING AS THE CONNECTING MODULE
ASPP, proposed in DeepLab_v2 [33], has several parallel atrous convolutions that maintain the same feature map and fuse the outputs at the end. Comparing with the standard convolutional layer, the atrous convolutions can effectively increase the receptive field of the network without extra down-samplings. In this work, we employ the ASPP module with a 1 × 1 convolution and three branches of atrous convolution with rate 6, 12, 18 as a connector in block 13 after encoder to effectively capture multi-scale contextual information. Figure 4 presents the detailed structure of the ASPP module in the ARC-Net.

C. DECODER WITH SIMPLE DECONVOLUTIONS
The main task of the decoder phase is to up-sample the feature maps and to recover the input resolution from the encoder phase. Previous works have used heavy-weight decoders [42], [64], which increases computational cost. Inspired by the idea of light-weight and asymmetric decoder, we follow a strategy which similar to ENet [44] and has a small decoder to up-sample the output of the encoder fine-tuning the semantic information. The decoder used in this article is composed of blocks 14 to 20, including the upsampler block and RBAC module, see Table 1 and Figure 1. In contrast to SegNet and ENet, we utilize simple deconvolution layers with stride 2 as the up-sampling block to reduce computational costs. Two RBAC modules are employed to collect the contextual information after each up-sampling block. This operation is repeated twice (blocks [14][15][16][17][18][19]. Finally, the up-sampling block is utilized in block 20 generating the output segmentation into two classes: building and non-buildings.

III. EXPERIMENTAL DATASETS AND EVALUATION
In this section, we conducted experiments on two building datasets: the INRIA Aerial Image Labeling Dataset and the WHU Building Dataset. Data processing methods and experimental settings are discussed in detail. A standard metric with five values was applied to evaluate the performance and efficiency of the proposed ARC-Net. Other state-of-the-art deep learning models are introduced and their performances are compared to ARC-Net.

A. DATASETS
The first dataset used in this research is the INRIA Aerial Image Labeling Dataset [52]. This dataset covers different cities all over the world, including Austin, Chicago, Kitsap, Western/Eastern Tyrol, Vienna, Bellingham, and San Francisco. The spatial resolution of each image is 0.3 m with a size of 5000 × 5000 pixels and surface coverage of 1500 × 1500 m 2 . Following previous investigations [29,40], we selected the first five images of each city for validation and the rest for training. Only two semantic classes were considered as the ground truth; buildings, and nonbuildings. An example of an input image and its corresponding label are presented in Figure 5. The red color represents the buildings and the black color presents the background.
The WHU Building Dataset is proposed by [53], covering a surface area of about 450 km 2 in Christchurch, New Zealand. The dataset contains 8189 images of 512 × 512 pixels with a spatial resolution of 0.3 m. This dataset was divided into a training set, a validation set, and a test set, consisting of 4736 images, 1036 images, and 2416 images, respectively. Figure 6 shows an original image and its corresponding label.

B. DATA PROCESSING
Data augmentation is an effective way to enlarge the datasets and to avoid overfitting [35]. In this study, windows were rotated by 90, 180, and 270 degrees. Moreover, horizontal and vertical flipping were randomly applied with a probability

C. EXPERIMENTAL SETTINGS
The building extraction experiments were built on the deep learning framework named PyTorch. The experiments were conducted on computer servers with two NVIDIA GeForce GTX 1080 Ti (11GB). Parallelization was utilized to make full use of the available graphics processing unit (GPU) capability and to accelerate computation. Due to the limitation in GPU memory, we randomly cropped all images in two datasets to be 256 × 256 pixels for model training and crossvalidation of each epoch.
In the process of the experimental setting, we conducted many comparative experiments to finally determine the optimal model parameters. In the training phase, we adopted the ADAM stochastic optimizer [57] with an initial learning rate of 0.0001. To avoid over-fitting, an L2 regularization was introduced with a weight decay of 0.0001 [37]. Models had been trained with 150 epochs for the INRIA dataset and 100 epochs for the WHU dataset, respectively. To overcome the limitation of the GPU memory, the mini-batch size was set as 8. Figure 8 displays the dynamic accuracies and losses of the INRIA and WHU datasets with increasing epochs. It is obvious that the loss gradually decreases while the accuracy increases and retains at a high and stable level.

D. EVALUATION METRICS
The quantitative experiments are based on five evaluation metrics: the 'Overall Accuracy' (OA), 'Precision', 'Recall', 'F1-score', and Intersection-over-Union ('IoU'). 'Overall Accuracy' refers to the number of correctly classified pixels divided by the total number of test pixels. 'Precision' is the fraction of correctly classified positive pixels amongst all predicted positive pixels where 'positive pixel' refers to the pixel of the building. 'Recall' is the proportion of correctly classified positive pixels amongst all true target pixels. 'F1-score' is the weighted average of precision and recall. 'IoU' is the average value of the intersection of the prediction and ground-truth regions over their union. The five metrics are presented as follows: where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives

E. MODEL COMPARISONS
The performance of ARC-Net is compared with the following four FCN-based models: SegNet: Badrinarayanan et al. [28] proposed SegNet for the semantic pixel-wise segmentation. Encoder-decoder structure with MaxPooling operations is employed in SegNet for up-sampling the lower-level information input feature maps. Thus, SegNet is considered as efficient in terms of memory and computational time [17].
U-Net: The U-Net architecture was proposed by Ronneberger et al. [29] for biomedical image segmentation. Contracting paths and symmetric expanding paths are used to aggregate contextual information. Multiple skip connections were introduced between the upper and downer layers. Due to its robustness and excellent performance, U-Net and its variants are widely adopted for many semantic segmentation tasks in recent years.
ENet: ENet was proposed by Chaurasia et al. in 2017 [44], aiming at performing pixel-wise semantic segmentation with low latency operation. The ENet model is providing an accuracy similar or -in some cases even -better accuracy with far fewer computations achieving a good trade-off between accuracy and processing time of a network.
ERFNet: The ERFNet was proposed by Romera et al. in 2017 [45]. The core of the ERFNet architecture is the novel layer that uses residual connections and factorized convolutions to remain computational efficient while delivering remarkable accuracy. The ERFNet model can be applied in real-time while providing accurate semantic segmentation [45]. SRI-Net: Liu et al. [37] proposed a novel FCN-based network named SRI-Net in 2019. The spatial residual inception (SRI) module was introduced to capture and aggregate multiscale contexts for a better semantic representation. Meanwhile, depth-wise separable convolutions were employed to further improve the accuracy and to decrease the number of model parameters.

A. EXPERIMENTAL RESULTS ON THE INRIA DATASET
We first conduct the comparisons on the INRIA dataset between the ARC-Net model and the well-known models including SegNet, U-Net, ENet, and ERFNet. The experiments are implemented on the test dataset with the same experimental settings. Figure 9 presents the qualitative segmentation results for all five models on the INRIA dataset. The green, red, blue, and black pixels of the maps represent the predictions of true positive, false positive, false negative, and true negative, respectively. SegNet and ENet return more false negatives (blue) while U-Net gains more false positives (red) than the other models. ERFNet gets more false positives (red) compared to ENet. By contrast, the proposed ARC-Net shows significantly less false positives (red) and false negatives (blue) than the other models and is able to maintain a high degree of completeness in building segmentation on the INRIA dataset. However, all models have consistently misclassified parts of the built-up area in the top left corner of the first test image.
The quantitative comparison of the networks across the entire test dataset is displayed in Table 2

B. EXPERIMENTAL RESULTS ON THE WHU DATASET
Building segmentation results of different CNN models on the WHU dataset are displayed in Figure 10 for qualitative comparisons. Clearly, all models present quite similar performance in building segmentation except the SegNet model. As for the INRIA dataset, SegNet returns too many false positives (blue) and false negatives (red) indicating the worst performance on the WHU dataset across the five tested models. In contrast, the proposed ARC-Net (last row) performs best among all models accurately detecting an edge VOLUME 8, 2020   in building segmentation. Moreover, for column 2, all deep learning models wrongly classified a building at the bottom of the area except the proposed ARC-Net.
For a quantitative evaluation of the performance, we calculated the individual evaluation metrics presented in Table 3. The proposed ARC-Net model holds the highest scores relative to the established models except for Recall where U-Net performs the better. The performance differences across the deep learning models except SegNet are small, especially for U-Net, ERFNet, and SRI-Net. Compared to the ERFNet, the proposed model still yields a higher F1-score by 1.2% (0.957 vs. 0.946) and a higher IoU by 2.2% (0.918 vs. 0.898).

C. COMPUTATIONAL EFFICIENCY
Computational efficiency is an additional key performance indicator of deep learning models. The computational performance includes the cost and complexity of the model training and testing. As stated in the introduction, the main motivation of ARC-Net is to achieve high prediction accuracy with less computational costs when applied to the building extraction. The model-training for each epoch and testing time of different deep learning models are presented in Table 4.  balance between computational performance and segmentation efficiency on the two building datasets.

A. GOING DEEPER OR NOT
Previous researches have demonstrated that a deeper CNN structure with more convolution operations processes more semantic information during the training phase, which helps to improve the classification accuracy [20]. However, due to the limitation of computational sources and the complexity of structure design, a deeper deep learning neural network requires fitting a larger amount of parameters and can lead to instability introducing gradient explosion and gradient vanishing. In this article, we developed the RBAC module and incorporated it into the ARC-Net in combination with dilated convolutions in order to seek a balance between accuracy and efficiency. As mentioned in Section II-A the dilation rate in the RBAC module in the encoder phase is set as the sequence of 2, 4, 8, 16 which raises the question if a repeated application of dilated convolution in the RBAC module would enhance the performance. To optimize the architecture of the ARC-Net model, we kept other experimental settings unchanged and conducted three comparison experiments with different groups of dilated convolutions for the RBAC module, including one group (as used in ARC-Net), two groups, and three groups, respectively. The results of the comparison on the INRIA dataset are presented in Table 5. Results show that one repeated dilated convolution module (as applied in the proposed ARC-Net) achieves in fact the best score in the five score metrics in comparison to the other two architecture design.

B. THE EFFECT OF ATROUS SPATIAL PYRAMID POOLING
The ASPP module has demonstrated its considerable performance in aggregating multi-scale contextual features, which improve the extraction accuracy of buildings in different sizes, especially medium-sized to over-sized buildings [65]. One crucial innovation of the ARC-Net in comparison to the ERFNet is that it employs the ASPP module as a connector between the encoder and the decoder. To test the performance, we conducted a comparison experiment with and without the ASPP module of ARC-Net on the WHU dataset. As presented in Table 6, the model with ASPP shows an obvious improvement over the model without ASPP across all evaluation metrics. The comparison result demonstrates the efficiency and applicability of the ASPP module as a connector for building extraction from high-resolution aerial images [34].

C. ABOUT THE PROPOSED METHOD
Deep learning methods, especially FCN-based models, have been widely applied in automatic building extraction from high-resolution aerial images. Recently, several advanced FCN-based models have delivered improved feature representation capabilities to achieve better classification performance (e.g., USPP [34], EU-Net [65], and MC-FCN [66]). However, most of the existing models focus on improving the accuracy with very little consideration on the computational efficiency, which suffers under large numbers of weight parameters introduced in the model design and high memory costs in the learning phase.
In this article, we designed a novel asymmetric encoderdecoder network with residual connections, named ARC-Net, to pursue good segmentation performance with lower computational cost. The proposed model focuses on three key innovations: (1) The residual block with asymmetric convolutions (RBAC) module is proposed to reduce the model parameters and address the degradation problem (2) Larger convolutional kernels and dilated convolutions are used in the backbone of the architecture to enlarge the receptive field and to obtain rich semantics when detecting objects in complex backgrounds. Moreover, depth-wise separable convolutions are introduced to improve computational efficiency without reducing prediction performance. (3) The ASPP module is utilized as a bridge between the encoder and decoder to further aggregate spatial context information. Through these three innovations, the proposed ARC-Net model implements a good balance between performance and efficiency achieving better predictions with less computational resources.
The training accuracy and loss presented in Section III-C show that the tested CNN models achieved better performance and stability on the WHU dataset than on the INRIA dataset. Compared to the segmentation results on the INRIA dataset, the IoU metric of the different CNN models is all higher than 85%, indicating that the WHU dataset is of higher quality and building and background are easier to distinguish. The INRIA dataset includes wrong labels, high buildings, and shadows, factors that may heavily influence the discriminative ability of CNN models as presented in [37], [53]. For this reason, the differences in the experimental results between the two datasets as shown in this article are reasonable. Moreover, these results verify that the proposed ARC-Net has a beneficial capability for practical application scenarios.

D. LIMITATIONS
Despite the good performance and efficiency achieved, the application of the ARC-Net model is still limited. With the recent progress of remote sensing technology, it is getting much easier to obtain remote sensing images at different scales and spectral bandwidths also to meet different research requirements. However, the datasets used in this article do not contain images from different sensors or different sensor types, such as hyperspectral images and SAR images. Moreover, the buildings have complex morphological characteristics, such as different height, shape, and orientations while with the current approach these attributes cannot directly be determined through deep learning networks. In the future, we will expand to the multi-source training data and integrate multi-disciplinary knowledge to jointly extract the buildings from remote sensing images.

VI. CONCLUSION
The main objective of this research is to propose an efficient FCN-based model for automatic building extraction from high-resolution aerial images that is achieving an outstanding accuracy with less computational resources. To address this issue, we proposed the ARC-Net model which incorporates an asymmetric encoder-decoder structure with the ASPP module as a connector. The RBAC module, the core of the ARC-Net, is designed by incorporating residual connections with depth-wise separable and asymmetric convolutions to reduce the number of model parameters and to accelerate the calculations. In addition, dilated convolutions and the ASPP module are utilized to extend the receptive field for delivering desirable segmentations. Experiments on two public building datasets, the INRIA and WHU datasets, have shown that the proposed ARC-Net outperforms other established FCN-based models with higher metric scores and less computational time. The buildings were extracted successfully by ARC-Net with fewer classification errors and shaper boundaries which demonstrates that the proposed ARC-Net achieves high accuracy and efficiency in building extraction from high-resolution aerial images. In future studies, multi-resources remote sensing data from different sensors will be combined to further improve automatic building extraction. JIE ZHOU received the B.S. degree in geographic information science from Yunnan Normal University, in 2017, where she is currently pursuing the M.S. degree in geography teaching. Her research interests include geography-education informationization, GIS, and geography teaching.
WENHUA QI received the bachelor's degree in hydrology and water resources engineering from the School of Water Resources and Environment, China University of Geosciences, Beijing, China, in 2008, and the master's degree in tectonics from the Institute of Geology, China Earthquake Administration, Beijing, in 2011. His research interests include remote sensing and citizen science for natural disaster risk assessment and governance.
XIAOLI LI received the master's degree in structural geology from the Institute of Geology, China Earthquake Administration, Beijing, China, in 2008. She is currently a Senior Engineer with the China Earthquake Networks Center, Beijing. Her research interests include earthquake emergency response and management, earthquake disaster risk assessment techniques, and application of GPS, GIS, and RS to earthquake emergency and earthquake resistance and disaster relief. LI NI, photograph and biography not available at the time of publication.
XIWEI FAN received the Ph.D. degree in cartography and geographical information system from the Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, in 2015. He is currently an Associate Research Fellow with the Institute of Geology, China Earthquake Administration. His research interests include the retrieval and validation of land surface temperature/emissivity and earthquake damage estimation.
ZHIQIANG LI received the Ph.D. degree in geodynamics and tectonophysics from the Institute of Geology, China Earthquake Administration, Beijing, China, in 1997. He is currently a Professor with the China Earthquake Networks Center, Beijing. His research interests include earthquake emergency response and management, earthquake emergency basal database technology, earthquake disaster risk assessment techniques, and application of GPS, GIS, and RS to earthquake emergency and earthquake resistance and disaster relief.