LRAD-Net: An Improved Lightweight Network for Building Extraction From Remote Sensing Images

The building extraction method of remote sensing images that uses deep learning algorithms can solve the problems of low efficiency and poor effect of traditional methods during feature extraction. Although some semantic segmentation networks proposed recently can achieve good segmentation performance in extracting buildings, their huge parameters and large amount of calculation lead to great obstacles in practical application. Therefore, we propose a lightweight network (named LRAD-Net) for building extraction from remote sensing images. LRAD-Net can be divided into two stages: encoding and decoding. In the encoding stage, the lightweight RegNet network with 600 million flop (600 MF) is finally selected as our feature extraction backbone net though lots of experimental comparisons. Then, a multiscale depthwise separable atrous spatial pyramid pooling structure is proposed to extract more comprehensive and important details of buildings. In the decoding stage, the squeeze-and-excitation attention mechanism is applied innovatively to redistribute the channel weights before fusing feature maps with low-level details and high-level semantics, thus can enrich the local and global information of the buildings. What's more, a lightweight residual block with polarized self-attention is proposed, it can incorporate features extracted from the space of maps and different channels with a small number of parameters, and improve the accuracy of recovering building boundary. In order to verify the effectiveness and robustness of proposed LRAD-Net, we conduct experiments on a self-annotated UAV dataset with higher resolution and three public datasets (the WHU aerial image dataset, the WHU satellite image dataset and the Inria aerial image dataset). Compared with several representative networks, LRAD-Net can extract more details of building, and has smaller number of parameters, faster computing speed, stronger generalization ability, which can improve the training speed of the network without affecting the building extraction effect and accuracy.


I. INTRODUCTION
B UILDINGS are primary spaces for human life and play an important role in the development of humans and society. Recently, building extraction from remote sensing images has been widely used in smart city construction, land use surveys, military target reconnaissance, and other fields of study. The distribution of buildings also has a high reference value when evaluating the population and development of a region and understanding the historical origin of a region [1]. Traditional building extraction methods first extract the statistical features of remote sensing images using specific feature extraction algorithms and then use hand designed classifiers to extract buildings. Traditional methods can be roughly divided into two types: one method is based on the cell values of remote sensing images, and the other is object-oriented classification [2], [3], [4]. For buildings in high-resolution remote sensing images, the traditional approach is to extract buildings from optical images using spectral, textural, geometric and shading features [5], [6], [7]. From the analysis of principle, the current traditional methods of building extraction can be classified as extraction based on edge and corner point extraction [8], [9], extraction based on area segmentation [10], extraction based on building features and integration of various methods, such as digital elevation model based on auxiliary information [11], [12], [13], [14], [15]. Traditional extraction methods have many limitations with complex images. For buildings with different shapes, sizes and environments, it is difficult to obtain high accuracies and good generalizability.
With the development of deep learning technology, convolutional neural networks (CNN) have been applied to the field of building extraction due to their powerful automatic feature extraction capability [16]. Compared with traditional methods, CNN methods are much more efficient and accurate. However, CNN-based segmentation methods still have some drawbacks [17], which hinder the wide application of CNN. First, the storage overhead is large, such that the storage space required by sliding window-based CNN methods will increase markedly according to the number and size of sliding windows. The computation of convolution for each pixel block one by one is computationally repetitive because adjacent pixel blocks typically have repetitive parts, which markedly reduces efficiency. Second, as the depth of the network increases, the number of required This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ parameters increases exponentially, which is not conducive to practical applications. These problems were somewhat solved with the proposal of a fully convolutional neural network (FCN) by Long et al. [18]. The FCN reduces the number of parameters and improves the perceptual field of the neural network. Due to these advantages, FCN-based models have been widely used for other tasks [19], [20], [21]. However, FCN does not consider global contextual information, and excessive down sampling operations also lose information of some low-level features and are insensitive to details in images.
To solve these problems, segmentation models that use lowlevel features with rich spatial information have been developed. Typical model structures include the improved FCN-based model U-Net proposed by Olaf et al. [22] and SegNet proposed by Badrinarayanan et al. [23]. U-Net can achieve good results in medical image segmentation, and U-Net ++ [24] and U-Net3+ [25] networks that fuse multilevel features have been derived. Since 2014, the Deeplab series of networks were proposed in [26], [27], [28], [29], and [30], where the most effective Deeplabv3+ network was inspired by the depthwise separable convolution [31], [32], [33] with the encoder-decoder structure. Deeplabv3+ designed a simple encoder-decoder structure and improves the Xception [34] and ASPP modules to improve extraction accuracy.
Recently, the model based on FCN and its variants have been widely used for the task of building extraction in remote sensing images and achieved good results. For example, Zhang et al. [35] proposed Gaussian expansion convolution and embedded it into a hierarchical dense fusion structure to form a dense hierarchical spatial Gaussian pool (dense HSGP). Dense HSGP has the advantages of the original expanded convolution and retains more contextual information while providing richer perceptual fields and higher feature extraction capability in the model. In [36], the authors proposed a dense residual neural network, it combines the densely connected CNN and residual network structures, which can enable the full integration of the underlying features with the high-level features. In order to alleviate the influence of the background of the irrelevant feature region, a net with attention block and multiple losses (AMUNet) was presented in [37]. Zhu et al. [38] developed an E-D-Net to solve the problem that the building boundary extraction is not obvious by introducing the cascading network. Meanwhile, a multiple attending path neural network was proposed by Zhu et al [39]. This net can learn multi-scale features through multiparallel paths and refine discontinuous building footprint by using attention mechanism and pyramid pool module.
With the continuous superposition of the number of model layers, the number of network parameters and computational complexity become huge, which has a serious impact on the practical applications. Therefore, more researchers try to reduce the number of parameters and simplify computational complexity of the model, and proposed some lightweight models, such as RFA-UNet [40], ARC-Net [41], and DAN-Net [42]. RFA-UNet introduced the attention mechanism to reweight the features at different stages before the feature fusion, so as to make up for the semantic differences before the features. ARC-Net used residual blocks with asymmetric convolution and atrous convolution to reduce the number of parameters in the model and speed up the calculation. DAN-Net combined lightweight network DenseNet and spatial attention fusion module to effectively extract high-level feature information and suppress noise. In 2022, Huang et al. [43] proposed RSR-Net based on the U-Net architecture, which has improved RegNet basic units by incorporating attention mechanism.
All of these networks achieved marked successes in building extraction, but how to find the balance between the accuracy and speed of building extraction is still necessary to be studied. Therefore, in order to reduce number of network parameters and make the model more efficient in practical application, we design a new lightweight network which named LRAD-Net based on encoder-decoder structure. LRAD-Net can achieve a good performance with a small number of parameters and computation, and can achieve a good balance between speed and accuracy. The main innovation points and work are summarized as follows.
1) A lightweight residual block with polarized self-attention (LPA) is proposed, it can extract fused features both from space and channels, with smaller parameters and higher accuracy of recovering building boundary. 2) We present a new depthwise separable atrous spatial pyramid pooling (DSASPP) module, it can make full use of the context information of the original remote sensing images, enlarge receptive field without changing the map shape, and improve the capability of network multi-resolution feature extraction. 3) In order to extract more details of the building, we fuse feature maps with the low-level detail and high-level semantics in the decoder. Before the fusion, the squeeze-andexcitation (SE) attention mechanism is used innovatively to improve the feature weight of the building, so as to improve the performance of extracting the building. 4) As the spatial resolution of used public datasets are 0.3 m, in order to validate the robustness of LRAD-Net, a new building data set with higher spatial resolution 0.1 m is labeled for building extraction evaluation and analysis, which we refer to as the "self-annotated UAV dataset."

II. PROPOSED LRAD-NET
This section details our proposed LRAD-Net, and the specific structure of LRAD-Net is shown in Fig. 1. First, considering the computational efficiency of the network, we take the network structure searched by RegNet with the computational complexity of 600 million flops (600 MF) as our feature extraction backbone, and record it as RegNet-600. Second, SE [44] block is added before the feature fusion, which can enhance the sensitivity of the network to the channel and improve the accuracy effectively under the condition that only a few network parameters are added. Third, due to the inconsistency of building scales in remote sensing images, the process of single path extraction of semantic features fails to make full use of image context information. Therefore, we proposed DSASPP module to enrich semantic information. Finally, an LPA is proposed and used in the decoder. Compared to the traditional decoder that uses two sets of 3×3 standard convolution layers [34], the LPA block can reduce the number of parameters and floating point of operations (FLOPs) while in extracting important features from the relevant spaces and channels to recover building boundaries with greater accuracy.
A. Encoder 1) RegNet-600: In 2020, RegNet [45] was proposed as a network that combines manual design with a neural structure search [46], [47], [48]. We select the network structure searched by RegNet under the computational complexity of 600 MF as our feature extraction backbone, and name it RegNet-600. The structure of RegNet-600 is shown in Fig. 2.
As shown in Fig. 2(a), RegNet-600 consists of a set of stem layers and four stages. As shown in Fig. 2(b), each stage consists of a series of stacked RegY blocks. By passing each stage, the height and width of the input feature matrix is reduced by half. The RegY block primarily consists of a residual structure with group convolution, and a SE module is added between the convolution layers, its detailed structure in shown in Fig. 2(c). In the RegY block, the first 1×1 convolution layer can reduce the features dimension, thus reducing the network parameters and the number of calculations. The 3×3 group convolution layer is used to extract textural features, where S is the stride of the convolution and G is the number of groups. g stands for the group width of each group in the group convolution.
2) Depthwise Separable Atrous Spatial Pyramid Pooling (DSASPP): DSASPP takes the advantage of depthwise separable atrous convolution and spatial pyramid pooling [49], it can enlarge receptive field without changing the maps shape and enhance the network multiscales feature extraction ability. As depthwise separable convolution can greatly compress the number of parameters and computation of the model while maintaining similar, we use depthwise separable convolution to build our DSASPP module.
AS is shown in Fig. 3, the DSASPP consists of a set of 1×1 convolution; four groups of 3×3 depthwise separable atrous convolutions with atrous rates of {6, 10,14,18}; and a global  average pooling layer. Thus, the network receptive field is magnified without losing detailed information and increasing computational complexity, the output of each convolution includes a wide range of information, and the multiscale features can be captured. As a result, the targets with deferent sizes can be segmented well by the proposed four groups of atrous rate convolution kernels. In the building extraction application, DSASPP can extract more multiscale features and better segment buildings with different sizes.
3) SE Attention: In order to improve the extraction accuracy of buildings from remote sensing images, SE blocks are added after the feature extraction backbone network and before the fusion of deep and shallow layer features. Through SE block, the convolution operation of the network can focus on the extraction of building features and ignore the existence of irrelevant features. SE block operation can adjust the channel information of the input feature map, increase the weight of the building information in the feature map, and the network can pay more attention to the building information of the image, so as to complete the building extraction and reconstruction more efficiently. The SE block structure is shown in Fig. 4.
The procedure of the SE module can be divided into two steps. In the first step, we obtain a vector with a global receptive field through a global average pooling layer. The equation is as follows: where f in is the input feature, and H and W represent the height and width of the input feature, respectively. Equation (1) changes the input with size (H × W × C) into output with size (1 × 1 × C), the real number z contains global feature information.
In the second step, z is used to generate weights for each feature channel through two fully connected layers, and its equation is as follows: where L 1 represents the first fully connected layer, L 2 represents the second fully connected layer, and σ represents the activation function. According to (2), s can be obtained to express the correlation between feature channels.

B. Decoder
The structure of decoder can be seen in Fig. 1. Its input is composed of two parts: one is the low-level features from the output in the first-stage layer of RegNet-600, which has 48 channels; the other part is the high-level features obtained by DSASPP module, which has 256 channels. The two parts are fused together after passing through the SE module to form a new feature map with 304 channels. This new feature map contains meaningful semantic information and building boundary information. Then, we input the fused feature maps into the LPA module for feature extraction, and obtain a feature map with 256 channels. Finally, the segmentation result can be obtained through a 1×1 convolution layer and up sampling.
1) Lightweight Residual Bottleneck With Polarized Self-Attention (LPA) Block: The proposed LPA block consists of three parts: the residual bottleneck structure, PSA block and group convolution. The structure of LPA block is shown in Fig. 5. C in , C out , H, and W represent the input channels, output channels, length, and width of the feature graph, respectively. b represents bottleneck ratio, which means that the channel of the output characteristic matrix is reduced to 1/b of the input characteristic matrix channel. G is the number of groups in a grouping convolution. In this article, b is 1 and G is 8.
As shown in the Fig. 5, LPA module includes a main branch and a shortcut branch. The main branch first reduces the dimension through a 1×1 convolution layer to reduce the number of network parameters. Then in order to improve the feature extraction capability, we use two groups of 3×3 group convolution layers for feature extraction, and finally connect a 1×1 convolution layer. In order to reduce the information loss caused by dimension reduction, PSA module is applied following group convolution. At the end of LPA, the different features extracted from the main branch and the shortcut branch are fused. Compared with the method using two sets of 3×3 standard convolution, the LPA module can improve the ability of network feature extraction with fewer parameters.
2) Polarized Self-Attention (PSA): PSA [50] block maintains a relatively high resolution in channel and spatial dimensions, which can reduce the information loss caused by dimension reduction. The PSA module is a lightweight plug and play module that can improve the performance of semantic segmentation tasks. The details of PSA can be seen from Fig. 6.
As shown in the Fig. 6, PSA block includes channel-only and spatial-only branch. Each branch is divided into two parts. The PSA module first uses 1×1 convolution to fully collapse the features in one dimension (like channel dimension) while maintaining high resolution in the orthogonal dimension (like spatial dimension). For compressed dimensions, PSA uses the softmax normalization function to enhance its information to improve the dynamic range of attention. Finally, the sigmoid function is used for dynamic mapping.
Compared with other attention mechanisms in CNNs (such as convolutional block attention module [51] and efficient channel attention [52]), PSA can maintain a higher resolution in attention calculation and capture long-distance dependencies at a lower computational overhead. In addition, in the channel and space branches, PSA can improve the performance of building extraction task by using softmax-sigmoid joint function to adjust and optimize the focus weights.
3) Group Convolution: The process of grouping convolution can be divided into three steps. We start by defining some common notations. X is the input feature map with size (C 1 , H, W ), Y is the output feature map with size (C 2 , H, W ). C 1 represents the number of input channels, H, W represent the width and height of the input, respectively. G represents the number of groups. C 2 represents the number of output channels. k stands for convolution kernel size.
In the first step, the X with size (C 1 , H, W ) is divided into G parts, we use X i to represent the feature map of ith part. the size of X i is as follows: The second step is to convolve X i and W i to get Y i . W i is the group convolution kernel size with ( c 2 G , k, k), Y i is output feature map of ith part by the operation of group convolution, The size of Y i as follows: The third step is to concat the

A. Dataset Selection
To demonstrate the feasibility and generalizability of the proposed LRAD-Net in practical applications, we perform experiments on four datasets: the WHU aerial image dataset [51], WHU satellite image dataset [51], Inria aerial image dataset [52] and a custom GZHU UAV image dataset. Several sample cases from these four datasets are shown in Fig. 7.
As shown in Fig. 6, the various colors and changeable environments in the WHU satellite image dataset and Inria aerial image dataset make the task of extracting buildings more difficult compared to the WHU aerial image dataset and GZHU UAV image dataset.
The WHU aerial imagery dataset contains 8189 pictures with a resolution of 0.3 m, and each picture has a spatial coverage of 512 × 512 pixels. This dataset is divided into three sets: the training set, which contains 130 500 buildings (4736 pictures); the validation set, which contains 14 500 buildings (1036 pictures); and the test set, which contains 42 000 buildings (2416 pictures).
The WHU satellite dataset after pretreatment and random grouping, training sets (3135) and test sets (903) are obtained.
The Inria remote sensing dataset includes the areas with different landforms and building types, such as the highly dense urban center area dominated by high-rise buildings, the suburban area dominated by low-rise buildings and the mountainous area sparsely distributed buildings, which can effectively test the accuracy and robustness of the extraction of model buildings.
The image quality and label quality of the above three public datasets are relatively high, but the quality of the partcial dataset cannot be the same as that of the WHU dataset in applications. Therefore, to test the performance of LRAD-Net in practical applications, we use UAV images to make a building dataset in actual applications. In this dataset, UAV images of parts of Haizhu District, Guangzhou, in 2012 were selected to make building labels. The UAV image products selected for this self-annotated dataset are red, green and blue band images with a spatial resolution of 0.1 m. We select 13 high-quality drone images, each of which included different buildings (e.g., singlestory houses, multistory houses, schools, irregular buildings), as well as many nonbuilding areas. The size of each image was 10 401×10 401, and a dataset is made. After completing the dataset, we trim images and labels twice with a size of 512×512, which can be input into the network in bulk. The completed dataset consisted of 6854 samples, which is divided into training sets (4798 samples) and test sets (2056 samples).

B. Experimental Environment Configuration and Evaluation Metrics
The experimental environment is a Windows 10 system, the hardware parameter CPU is i5 9500, the running memory is 64 G, the GPU is an NVIDIA Geoforce RTX 3080, the video memory is 10 G, and the deep learning framework is PyTorch 1.8.1. To make the models fit faster, the backbone networks of all models in this article are initialized with parameters that have been pretrained by ImageNet, and the remaining network parts are initialized with random parameters. To ensure the fairness and authenticity of the network, the hyperparameter settings of all networks in this article are consistent. The optimization algorithm uses the Adam optimizer with default parameter settings, the loss function uses Dice loss, the initial learning rate is set to 0.0001, and the batch size is set to 8. A total of 80 epochs of training are conducted. Each batch of images is randomly rotated horizontally and vertically, flipped 90°and scaled for data enhancement. The image size as input into the network is 512 × 512.
We use the evaluation metrics to measure the effectiveness of LRAD-Net: intersection over union (IoU); precision; recall; and F-score. The formulae of these indicators are as follows: F-score = 2 · Precision · recall Precision + recall (9) where TP refers to the number of positive samples (buildings) predicted to be positive samples; FP is the number of negative samples (nonbuildings) predicted to be positive; TN means the number of negative samples expected to be negative samples; and FN denotes the number of positive samples predicted to be negative.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, we first design ablation experiments to test the impact of different modules on network performance with WHU aerial image dataset. Second, to evaluate the feasibility and robustness of LRAD-Net in building extraction tasks, we use the four datasets mentioned in Section III to do several experiments and compare it with several the state-of-art networks.

A. Ablation Experiment
In order to explore the contribution of different modules (e.g., DSASPP module, SE blcok, LPA block) in improving LRAD-Net performance, we conduct ablation experiments on WHU aerial image dataset. We first build the baseline model based on LRAD-Net. In the baseline, we remove the SE block and replaced DSASPP with the ASPP (the atrous convolution with atrous rates of {6, 12,18}) module used in Deeplabv3+. Then we replace the LPA block with two sets of 3×3 standard convolution layers.
On the basic of the baseline, the DSASPP block is named (D), the SE attention module is represented by (S), and the LPA block is called (L). Performance was evaluated on IoU, Precision, Recall, F-score, FLOPs and parameter. The results of the ablation study are given in Table I . The results in Table Ⅰ show that DSASPP increases the IoU and precision of the network by 0.39% and 0.25%, respectively. At the same time, compared with ASPP module, DSASPP module can slightly reduce the number of network parameters and computational complexity. This shows that depthwise separable convolution can achieve better performance while reducing the number of model parameters and computation. Adding the SE attention module after feature extraction of the backbone network and before the feature fusion improves the performance of building extraction and increased the IoU by 0.58%. Using LPA module instead of two sets of 3×3 standard convolution increases IoU by nearly 1% and reduces the amount of computation by nearly 1 time. Compared with the baseline, LRAD-Net in IoU and precision index increased by 1.26% and 1.25%, respectively, which indicates that the model can achieve better performance.
In order to explore the influence of sampling rate in DSASPP on model performance, three sets of comparison experiments were designed based on LRAD-Net, and the sampling rate of atrous convolution in DSASPP structure was set as {612,18}, {612,18,24}, {610,14,18}, and test results are given in Table  Ⅱ. As can be seen from Table Ⅱ, LRAD-Net with the sampling rate of {610,14,18} has a higher score overall than the sampling rate of {612,18,24}, which indicates that compared with DSASPP module with a large sampling rate, appropriately reducing the sampling rate can make the features extracted by atrous convolution better fit the character of the buildings in remote sensing images. Compared with the LRAD-Net with a sampling rate of {612,18}, the combination of {610,14,18} can improve the precision by 0.5% and the IoU by 0.32% on the basis of only increasing the number of parameters by 0.19M. Therefore, LRAD-Net selected the atrous rate combination of {610,14,18} to ensure the optimal performance of the model.

B. LRAD-Net Network Performance Experiment
To verify the segmentation performance of LRAD-Net, we conduct experiments on the four datasets mentioned in Section III. Here, four common classical networks (U-Net, Deeplabv3+ and PANet [53], PSPNet [54]) and two recently proposed lightweight networks (BiseNet [55] and BisenetV2 [56]) are used for comparison. In the classic network, we use ResNet50 as the backbone network of U-Net, Deeplabv3+, PANet, and PSPNet. At the same time, the lightweight network MoblieNetv3 [57] is used as the encoder for the experiment. Experiments show that LRAD-Net can achieve high accuracy with less parameters. The experimental results are shown in Tables III-VI. As given in Table III, on the WHU aerial image dataset, although the IoU of Deeplabv3+(Res50) is higher than U-Net (Res50), PANet(Res50) and PSPNet(Res50) with 0.58%, 1.24%, and 2.50%, respectively, it is 0.79% lower than LRAD-Net. The precision and F-score of LRAD-Net are also significantly higher than these compared networks. Compared with the lightweight networks BiseNet and BiSeNetV2, LRAD-Net is 2.37% and 2.27% higher in IoU and 1.14% and 0.76% higher in precision. It can also be seen from Table III that Parameters of  TABLE VI  QUANTITATIVE COMPARISON OF THE PROPOSED LRAD-NET WITH VARIOUS MODELS ON SELF-ANNOTATED UAV DATASET  LRAD-Net is only 7.30M, compared with U-Net, Deeplabv3+, PANet, and PSPNet, the parameters and Flops of LRAD-Net are 3×4 times lower. This shows that LRAD-Net not only has a significant improvement in model performance, but also has a significant advantage in the number of parameters and FlOPs. Compared with BiSeNet and BiSeNetV2 network, the number of parameters of LRAD-Net has no significant difference, but its precision index such as IoU is significantly higher. According to the comprehensive accuracy index and application index, LRAD-Net outperforms the above seven networks in building extraction.
It can be seen from Tables IV and V, in the WHU satellite Image dataset and the Inria aerial image dataset, IoU of all networks are generally not high. In the WHU satellite dataset, lightweight models, such as Deeplabv3+(MobileNetv3) and BiSeNet have better extraction effect than larger models, such as Deeplabv3+(Res50) and U-Net(Res50). This is because WHU satellite dataset has a small amount of data and sparse distribution of buildings and there are obvious mismarks and omissions, which affect the performance of the deep network model. Compared with Deeplabv3+(MobileNetv3) and BiSenetV2, the IoU of LRAD-Net increased by 1.27% and 0.52%, respectively. This is because LRAD-Net integrates the attention mechanism in the feature fusion stage, so that the network can pay more attention to the architectural information in the image, and DSASPP module and LRA module can extract more effective and accurate feature information.
As the spatial resolution of the datasets used above are 0.3 m, some details of the buildings with small size may be ignored, the advantages of our proposed network are difficult to highlight. In order to validate the robustness of LRAD-Net, we do more experiments with a self-annotated building dataset with spatial resolution 0.1 m. Table VI gives the quantitative comparison of the proposed LRAD-Net with other models on the self-annotated UAV dataset. It is clear that LRAD-Net has higher IoU than BiSeNet and BiSeNetV2 with 3.99% and 4.32%, respectively. Compared with U-Net(Res50), PANet(Res50), Deeplabv3+(Res50), and PSPNet(Res50), it has a very small number of parameters.
It can be seen from Tables Ⅲ-VI that the performance of different networks on different datasets is different. In general, complex networks, such as Deeplabv3+(Res50) and U-Net(Res50) perform better than lightweight networks, such as Bisenet on datasets with large amount of data and good quality, while WHU satellite image dataset with small amount of data and inaccurate image labels perform poorly. Compared with these networks, LRAD-Net can achieve the best performance on different data sets, which fully demonstrates the superiority of LRAD-Net network.
In order to better observe the specific performance of the n-etwork in building segmentation, Figs It can be seen from scene 1 of Fig. 8 that Deeplabv3+(Res50), U-Net(Res50) and LRAD-Net can completely extract large buildings in WHU aerial image dataset, while Deeplabv3+(MobileNetv3) and PSPNet(Res50) have obvious leakage phenomenon. From the marks in scene 2, we can see that for irregular buildings, LRAD-Net can accurately extract the outline of the building, and there is no obvious void phenomenon. As can be seen from scenes 3, 4, and 5, for areas with dense buildings, LRAD-Net can accurately extract the boundaries of adjacent buildings, with relatively few misclassification phenomena.
According to scenes 1 and 4 in Fig. 9, when extracting a single regular building from the WHU satellite image dataset, U-Net(MobileNetv3), BiSeNet and LRAD-Net have good performance, while the other five networks have poor effects. PANet(Res50), PSPNet(Res50), and BiSeNetV2 have serious missing scores and misclassification phenomenon, this is due to the simple network structure of PANet(Res50), PSPNet(Res50), and BiSeNetV2 cannot fully extract the deep features of buildings. As can be seen from the red areas in scenes 2 and 3, there are close adjacent buildings in the image, and there is shadow occlusion, which leads to the network misjudging the boundary of the building, resulting in poor segmentation effect. However, this phenomenon is not obvious in LRAD-Net.
In the Fig. 10, LRAD-Net can also perform better segmentation effect in the face of scenes with no obvious color difference. It can be seen from scene 1 in Fig. 10 that LRAD-Net can effectively extract irregular buildings and reduce the occurrence of misclassification and voids.  We can see in scenario 1 in Fig. 11, on the self-annotated UAV dataset with higher a spatial resolution, PANet(Res50), PSPNet(Res50), and BiSeNetV2 networks misclassified a large area of ground into buildings, which is caused by the high spatial resolution of images and the similarity of spectral information between buildings and the ground. LRAD-Net can extract buildings accurately without obvious misclassification. It can also be seen from scenes 23 and 4 in Fig. 11 that LRAD-Net can also extract buildings well on images with different luminance.
In general, compared with other networks, LRAD-Net has obvious advantages in extracting medium and large buildings, and also has certain advantages in identifying small building groups.

C. LRAD-Net Network Time Efficiency Experiment
Due to the infer speed of a model on a specific hardware, it is affected by several factors, such as hardware features, software implementation, and system environment, in addition to the parameters and FLOPs. Therefore, to verify the inference speed of LRAD-Net, we have tested the speed of LRAD-Net against seven other networks to predict a single three-channel image (512×512) on an NVIDIA GeoForce RTX 3080. The result is shown in Fig. 12.
As can be seen from Fig. 12, the speed of BiSeNet, BiSeNetV2, and LRAD-Net is almost the same. Compared with U-Net(Res50), Deeplabv3+(Res50), and other networks, they have obvious advantages in speed, which is of great help to practical applications. Combined with the accuracy of the four datasets, LRAD-Net can achieve a good balance between accuracy and speed.

V. CONCLUSION
Based on encoding/decoding structure, this article proposes a lightweight building segmentation network LRAD-Net. LRAD-Net makes best use of the advantages of Reg600 network, SE module, and proposed DSASPP and LPA module, DSASPP expands the sensitivity field through parallel sampling with depthwise separable atrous convolution of multistage sampling rate, enrich semantic information, and avoids the problem of segmentation errors caused by falling into local features. SE attention module can improve the weight of building features and reduce the interference caused by noise information during feature fusion of LRAD-Net. The LPA module in decoder can extract more building information with fewer parameters and improve the accuracy of building extraction. Compared with the commonly used semantic segmentation network, LRAD-Net can achieve a balance between precision and speed, and has better help for practical applications. Although LRAD-Net performs better than common semantic segmentation networks in reasoning speed, it is still far from ideal, how to further improve its reasoning speed will be our future work.