ResAt-UNet: A U-Shaped Network Using ResNet and Attention Module for Image Segmentation of Urban Buildings

Architectural image segmentation refers to the extraction of architectural objects from remote sensing images. At present, most neural networks ignore the relationship between feature information, and there are problems such as model overfitting and gradient explosion. Thus, this article proposes an improved UNet based on ResNet34 and Attention Module (ResAt-UNet) to solve the related problems. The algorithm adds a two-layer residual structure (BasicBlock) and a regional enhancement attention mechanism (Space Enhancement Area Enhancement, SEAE) to the original framework of UNet, which enhances the network depth, improves the fitting performance, and extracts small objects more accurately. The experimental results show that the network has achieved MIOU of 78.81% in the Massachusetts dataset, and the newly developed model outperforms UNet in both quantitative and qualitative aspects.


I. INTRODUCTION
R EMOTE sensing satellite image is an efficient means of obtaining geospatial information and data. Compared with traditional aerial photography, it has unique advantages. Remote sensing satellites have a large monitoring area and can transmit, process, and dynamically monitor data in real time.
The most important thing about research on remote sensing images of urban areas is how to effectively segment and extract architectural objects in the images. The use of remote sensing satellite technology for urban image segmentation has become an important means of planning cities and studying urban areas. How to obtain building information accurately and dynamically has arisen the interest of researchers. However, due to the wide variety of buildings together with complex and changeable Manuscript  image backgrounds, building image segmentation has always been a thorny problem.
Since the 21st century, image segmentation has always been widely concerned as a research hotspot and many segmentation methods have been proposed. These methods can generally be divided into two categories, traditional methods based on space and features and semantic segmentation methods based on deep learning. Traditional image segmentation methods are usually based on the feature domain and spatial domain and use the prior information of the segmentation target in the image, such as shape, brightness, texture, etc. to obtain the segmentation results. Kalyankar [1] used five different thresholding algorithms to segment remote sensing satellite images and compared their segmentation effects with each other. The best performing method was the histogram and edge maximization thresholding method. Adams et al. [2] developed a supervised learning method that generated label information for plant image segmentation, which outperformed traditional thresholding methods. Pratondo et al. [3] proposed a combination of a region-based dynamic contour model and machine learning for medical ultrasound image segmentation and the newly proposed model achieved higher segmentation accuracy than Chan-Vese [4]. Liow, Pavlidis [5] used edge detection to locate building boundaries and then adopted target area growth to determine the location and area of buildings. Avudaiammal et al. [6] combined morphological, spectral, shape, and geometric characteristic information with support vector machine (SVM) to classify remote sensing images into buildings and nonbuildings using the morphological building index. Yang et al. [7] developed a new method for local spectral angle thresholding, which segmented buildings better than the global thresholding method. In fact, due to the continuous development of remote sensing technology, the image resolution has become higher and the background area has become more complex, but the effect of traditional segmentation is not ideal.
The past decade has witnessed the continuous development of neural network, which is also used in cloud segmentation [8], [9], [10], [11], [12], [9], power transmission system [13], and building segmentation [14], [15], [16], [17], [18], [19], etc. In 2015, Long et al. [20] proposed the fully convolutional network (FCN), which was the first network successfully used for image segmentation. As a result, the researchers developed a series of FCN. Ronneberger et al. [21] developed a symmetric network This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ with encoder and decoder, UNet, which has been widely used in medical image segmentation, but it may lead to an imbalance between decoding ability and encoding ability. Badrinarayanan et al. [22] proposed SegNet using the max pooling index in the up-sampling stage. Zhao et al. [23] proposed the pyramid scene parsing network (PSPNet), which used a pyramid pooling module to extract contextual information and stacked the contextual information with the extracted feature information to complete image segmentation, urban buildings are obscured by shadows on high-resolution remote sensing images, and the network's extraction of object information is incomplete. Sun et al. [16] proposed a Multi-Attention UNet based on the UNet network, adding a residual encoder with an attention mechanism and a self-attention mechanism. However, due to the complex background of the ground features and irregular boundaries, this method will produce some misclassifications and omissions during extraction. Therefore, accurate segmentation of images was achieved in datasets of WHDLD and DLRSD remote sensing images. Chen et al. [17] proposed a UNet architecture with a self-attention mechanism bias module, which strengthened the weight of the target area in the down-sampling part and used the bias parameter to enhance the model reshaping ability in the up-sampling part. They achieved good segmentation effects on the WHU and Massachusetts urban image datasets. Although the existing convolutional neural networks can achieve good results in the field of architectural image segmentation, most of them ignore the correlation of feature information between different channels. Soni et al. [18] proposed two-scale inputbased architecture: Dual-scale CNN (Du-CNN). A Difference of Normals approach is used to isolate 3-D buildings from other objects in densely built-up areas. It shows that most building extraction results have a Precision > 0.9 and favorable Recall and F-score values [19]. Although the above algorithm applied to urban building segmentation has achieved good segmentation results, there are still some shortcomings. With the deepening of the network, the model is prone to over-fitting. In addition, the network still has insufficient extraction of building details.
In order to solve the above problems, this article combines the Space Enhancement Area Enhancement (SEAE) and the twolayer residual module (BasicBlock) to propose a ResNet and Attention Module based on UNet (ResAt-UNet).
1) An encoder-decoder network structure ResAt-UNet is designed and implemented based on UNet that enables end-to-end training, using ResNet34, a fused attention mechanism, as a feature extraction network with enhanced feature extraction capability. Additionally, adding ResNet blocks in the network solves the overfitting problem while increasing network depth. 2) The up-sampling layer of UNet reduces the optimization parameters of the network; the shallow feature map is spliced with the deep feature map, which can facilitate information fusion, reduce the numbers of network parameters, and mitigate the over-fitting phenomenon.
3) The improved CBAM (SEAE) is added to the intermediate connection layer to capture information in the channel and spatial domain and enhances feature representation. 4) Improving loss function DiceLoss and BceLoss instead of the cross-entropy loss function and facilitates information fusion.

5) A series of model comparison and ablation experiments
are conducted to verify the effectiveness of the proposed model, which has a better segmentation score compared with other models.

II. METHODS
This article uses the UNet proposed by Ronneberger et al. [21] in 2015 as the basic model architecture. The UNet network has a typical encoder-decoder structure. Specifically, after continuous convolution and down-sampling of the image in the encoding stage, the obtained feature map has a small scale but it contains high-dimensional semantic feature information. Then in the decoding stage, the network restores the feature map to the original size through continuous convolution and up-sampling, and finally obtains the segmentation results of the image. The middle part of the network connects the feature map of the encoding and decoding stages through the Concat layer and combines the context information to obtain the final prediction results by continuous up-sampling and feature fusion. We replace the down-sampling part of the UNet network with the residual network ResNet, and add several attention modules to the intermediate connection layer to obtain the improved network model: ResAt-UNet.

A. ResAt-UNet
Based on the original UNet network structure, the residual module BasicBlock is used to transform the encoder part and add the attention mechanism SEAE in the middle Concat layer. Specifically, the down-sampling part of UNet is modified by using the first five parts of ResNet34. The down-sampling operation uses 3 × 3 convolution instead of maximum pooling to reduce the loss caused by the pooling operation. Meanwhile, UNet retains jump connection part, adds the attention module AM in the middle connection layer stage and enhances the feature of the last three down-sampling output feature maps in the space and channel dimensions in the decoding stage. Then the corresponding part of the up-sampling is merged to complete the information fusion. The ResAt-UNet network structure diagram is shown in Fig. 1. The specific mechanism of the residual module BasicBlock and the attention mechanism SEAE are introduced in the following chapters. Fig. 1 shows the detailed structure of the network model in this article. Yellow, blue, pink, red, green, and gray represent 1 × 1 convolution, 3 × 3 convolution, 7 × 7 convolution, maximum pooling, up-sampling, replication, and connection separately. Black and purple represent the residual modules of internal Conv1: Stride 1 and 2, which are introduced in Chapter 2.2, and SEAE represents the attention module, which is introduced in Chapter 2.3. In the encoding stage, we conduct feature extraction through four groups of 1 × 1 convolution, maximum pooling, and residual convolution operations. The convolution kernel uses the size of 3 × 3 and the maximum pooling uses the size of 3 × 3, which can compress the size of the feature map to 1/16.  In the decoding stage, four sets of 2 × 2 up-sampling and 3 × 3 convolution operations are used to recover the size of the feature map. In the first three up-samplings, the up-sampling feature map completes information fusion in the middle layer and the down-sampling corresponding feature map is enhanced by the attention module feature SEAE. Finally, the prediction image with the same size as the input image is obtained by using the up-sampling and 1 × 1 convolution and the image segmentation is completed. The specific structure parameters of the network are shown in Table I, where Conv represents convolution, Maxpool represents maximum sampling, Upsample represents up-sampling, Stride represents step size and the structure name of each layer and the size of the output feature map of the network are annotated.

B. Residual Module Basic Block
He et al. [24] proposed the Deep Residual Network in 2015, in which the residual means the difference between the observed value and the estimated value. Assuming that the network input is x and the expected mapping is H(x), the network mapping can be converted to the residual of the network and the corresponding relationship can be represented as where x is the characteristic mapping of the upper layer network. F (x) is the residual of this layer network and H(x) is the observation value of this layer network. The relationship among the three is Although H(x) and F (x) + x have the same effect, F (x) is much simpler to optimize than H(x). Assuming that in layer L, the relationships between different layers can be represented as Assuming that the network reaches a certain depth, the network model has reached the best and the network loss error has reached the lowest. If the network depth continues to increase, it may lead to network degradation. In order to make the network in the optimal state, we can use the residual network, so that the residual F (x) value is 0.
In the k-layer network, the gradient of loss function loss with respect to x k can be expressed as ResNet residual network is mainly composed of residual blocks [24], which can solve the problem of network degradation. When residual blocks are introduced, the network is built deeply and the network segmentation effect becomes better. ResNet proposed by He et al. [24] has two residual structures, namely, the two-layer residual module (BasicBlock) and the three-layer residual module (BottleNeck). For networks with fewer layers such as ResNet18 and ResNet34, BasicBlock composed of two 3 × 3 convolutions is often used. The schematic diagram is shown in Fig. 2. The network input is x, the expected mapping is H(x) and the residual mapping of the network is F (x). Conv1 and Conv2 represent two different convolution layers. The specific parameters have been annotated in detail in the graph as BN represents BatchNorm, Relu represents Relu function activation and shortcut is shortcut channel.
The black modules in Fig. 1 correspond to (a) in Fig. 2(a) and the purple modules correspond to (b). Because dimension reduction operation is not needed in (a), the convolution layer is not added to the shortcut and the output result can be directly obtained with input plus residual. The size of the input feature map is assumed to be 64 × 64 × 64. In order to reduce the size and the amount of data, the first convolution step is set to 2. Since input and residual need to be added at the end of the block, we put a convolution layer with the same step of 2 in the shortcut to unify the input and output dimensions.
The ResNet34 structure to be used in this article is shown in Fig. 3, which is a 34-layer convolutional neural network. The parameters of each layer are shown in Table II. Conv represents convolution and /2 represents the down-sampling step of 2. The residual block of the dotted box corresponds to Fig. 2(b), which can reduce the dimension of the input while the residual block of the solid box corresponds to Fig. 2(a). In this article, the down-sampling part of UNet is modified. Because UNet has been down-sampled for four times, only the first six parts of ResNet34 are kept, including Conv1, Maxpool, Conv2_x,   Fig. 2. After removing the full connection layer, it can correspond to the original network structure. At this time, the size of the input image satisfies the multiple of 32. The retained modules are shown in Fig. 4.

C. Space Enhancement Area Enhancement
Attention mechanisms are commonly used in the fields of natural image processing, knowledge graphs, and language processing. In 2018, Woo et al. [25] proposed a simple and efficient Convolutional Block Attention Module (CBAM), which can be seamlessly connected to the CNN architecture to complete the enhancement of feature map channels and spatial dimensions. In the same year, Park et al. [26] proposed the Bottleneck Attention Module suitable for deep neural networks, which completed the reinforcement of the feature map in the down-sampling stage. In 2020, Nanjun et al. [27] noticed the correlation between the intermediate features of the neural network, introduced the attention mechanism into the field of remote sensing image processing and inserted the newly developed hybrid first and second order attention (HFSA) into the UNet model. On this basis, the attention mechanism has been widely used in computer vision.
SEAE used in this article is composed of channel attention module (CAM) and spatial attention module (SAM). The specific mechanism will be introduced below. After introducing the attention module, the network can learn the connection mode between channel and space to improve the efficiency of information processing. The implementation method is shown in Fig. 5. The attention module used in this article is based on CBAM. The attention module used in this article is modified based on CBAM. First, the MLP depth in the attention module of the CBAM channel is expanded from 2 to 6 on CAM, which can learn the attention weights more profoundly and calculate more accurate attention weight values. Second, in SAM, the feature maps after spatial enhancement and the feature maps of the corresponding sizes of downsampling are added to alleviate the problem of gradient degradation caused by increasing the depth in the neural network, followed by the learning of spatial features in both mean pooling and maximum pooling channels, using concat for feature fusion of different channels, enriching the spatial information of the module, and by supplementing the results of 1 × 1 point convolution, the squeezing operation. It is possible to obtain richer information and form a more effective attention map, so as to better grasp the spatial information. Building image targets are fine, and the improved CBAM can enhance spatial information and facilitate the extraction of tiny targets.
The feature map W x sampled under Encode phase is the same size as that sampled map W g under Decode phase. First, CAM is used to enhance the channel feature of W x and the enhanced feature map is W c . Subsequently, W c and W g in SAM are added and the enhancement of W c space area is completed to obtain W s . Skip connection operation is then performed in Concat layer. Fig. 6 shows the CAM module. First, maximum pooling and average pooling are used to compress the feature map W x from Encode stage to channel feature map with 1 × 1 size, namely P avg (x), P max (x). New nonlinear elements are then introduced through the activation of continuous convolution and linear rectification function (Relu) in Multilayer Perceptron (MLP) [28]. The MLP structure diagram is shown in Fig. 7 and the specific operation of MLP is shown in (6) and (7) ⎧ and f 6 (x) are the results of convolution and Relu activation of the first, second, third, fourth, fifth, and sixth times respectively. σ 1 is the activation function Relu and W 1×1 is the convolution of 1 × 1. In (7), F (x) is the final output of the feature map through MLP.
Then the results of the two channels after MLP processing are gotten and the Sigmoid function is used to make the weight value of each channel between 0 and 1. Finally, the weight matrix is multiplied by the original feature map to complete the feature enhancement on the channel. CAM can capture the relationship between spatial features and improve the network segmentation performance [29]. The whole process is shown in (8) and (9) In (8) and (9), σ 2 is the activation function Sigmoid, F is the activation of MLP sensor, P avg is the average pooling, P max is the maximum pooling, W x is the input feature map, C weight is the channel weight matrix, and W c is the feature map after channel feature enhancement. Fig. 8 shows the SAM used in this article, which is an improvement based on Attention-Gate [30] Adding SAM in the middle layer helps the network to focus on a specific part of the input.
First, the W c enhanced by CAM is gotten and the feature map W g obtained by upsampling in the Decode stage, use 1 × 1 convolution to compress the two feature maps to the same size channel, and then add the two feature maps element by element. Next, through the activation function, Relu activation and 1 × 1 convolution compression, the feature matrix with channel 1 is obtained. Here, the max pooling and average pooling processing matrices are used to obtain two different spatial feature maps, then CAM stitchs the maps according to dimension 1, uses 1 × 1 convolution to compress the map channel to 1 and finally uses the sigmoid function to activate the feature and weight map, so that the value of the weight map can be standardized between 0-1 and the weight map is multiplied with the feature map W x    to perform feature re-calibration to complete the enhancement of the spatial area of the building image.
The whole process is illustrated in the following formula, where P avg represents average pooling, P max represents maximum pooling, σ 1 is the activation function Relu, σ 2 is the activation function Sigmoid, and W 1×1 is the 1 × 1 convolution and C at represents the splicing operation. S a and S b are the spatial feature maps of the upper and lower pooling channels, W x and W g are the input images and S weight is the attention coefficient, which is obtained by addition. Although these steps will generate more computation than multiplication cost, a more ideal segmentation effect can be obtained. W S is the feature map after spatial region feature enhancement In the network jump connection stage, different numbers of SEAE modules are added to enhance the building pixels of the feature map and suppress the feature information expression of the background pixels, trying to improve the anti-noise ability of the network model.

III. EXPERIMENT
In this chapter, the above improved convolutional neural network model is used for experiments to complete the segmentation of building images. The detailed process of the algorithm is shown in Fig. 9, which mainly includes processing of image, division of test set and training set, training of the neural network and acquisition of test results.
Step 1: The original data of the Massachusetts dataset is large and the difference between the background and the building is not obvious, so it needs to be preprocessed, mainly including image cropping to reduce the image size and image rotation to expand the dataset.
Step 2: Before model training, the training set, test set and validation set are roughly divided into a ratio of 6:2:2 and each image must be preprocessed.
Step 3: In the model training stage, the process initializes the parameters of the model, inputs the data set, trains and optimizes the network and saves the trained model. Step 4: After each test image is input to the prediction module, the output result should be an image with the building and the background separated. The final prediction result is obtained and the effect is evaluated.
In the following, we first introduce the experimental environment and then preprocess the dataset to illustrate the training methods and evaluation criteria. Finally, the experimental results are displayed and analyzed.

A. Experimental Environment
Hardware environment: This experiment uses a Macbook Air produced in 2022 with Ubuntu operating system. The industry generally believes that the most popular deep learning framework is pytorch [31]. This article selects this framework to build a neural network on the integrated development environment Py-charm2021, and the specific configuration is shown in Table III.
The training parameters are set as shown in Table IV. When the times of training reach 8400, the model stops.

B. Dataset
In this article, the Massachusetts data set is used to generate multiple similar images by using data enhancement methods such as flipping, translation, scaling, and cropping to increase the size of the data set. As the data set increases, the model cannot overfit all samples, so the model has to be generalized.
The dataset used in this article is Massachusetts remote sensing city image [32]. The original dataset contains 137 large images for training, 4 for validation, and 10 for testing. The original size is 1500 × 1500. After rough estimation, the image resolution is 1.5 meters and the image quality is not very high. Due to the large image size in the dataset, it is not suitable for training test and verification directly. After preprocessing, the image size is reduced to 256 × 256. The training set, test  set, and verification set are randomly divided according to the proportion, 7200 images for training, 2400 images for testing, and 2400 images for verification.
The preprocessing method is described in detail below. The original image of Massachusetts is shown in Fig. 10 and the experimental image after clipping is shown in Fig. 11. As can be seen from Fig. 11, the image quality is general, the background is complex and the dataset has many different categories, such as docks, farms, forests, hills, lakes, roads, etc. The sizes of different targets are quite different as the resolutions of cars and pedestrians are small while the targets of lakes, rivers, and forests are large. Moreover, the boundaries of different targets are fuzzy and the pixel classification between adjacent targets is difficult.
C. Data Augmentation 1) Image Cropping: For the original remote sensing image with the pixel value of 1050 × 1050, we need to cut to construct the building image dataset. In order to facilitate processing, we cut each image into 25 256 × 256 images. Fig. 12 shows the effect of cutting a single image into 25 images in proportion.
2) Image Rotation: To expand the dataset and facilitate the next training of convolutional neural networks, we can use the image rotation method for preprocessing. Image rotation can use affine transformation to preserve the original "straightness" and "parallelism" of the image. The effect of expanding the dataset by image rotation is shown in Fig. 13 and the affine transformation matrix is shown in (14). After data augment, the model performance is improved and overfitting is mitigated In (14), x and y are the input pixels, x and y are the output pixels, t x is the distance moving on the horizontal axis and t y is the distance moving on the vertical axis.

D. Improvement of Loss Function
In this task, since our goal is to divide the input image into two categories, we combine the binary cross entropy loss function BceLoss and set similarity loss function DiceLoss to obtain the improved loss function. For data containing N samples, BceLoss and DiceLoss are shown in (15) and (16) while our improved loss function CombinedLoss is shown in (17) In the equation, x n indicates the predicted target category and y n indicates the actual target category. BceLoss applies to pixel-level prediction tasks, especially targeting at learning small samples, but is vulnerable to uneven sample distribution [33]. DiceLoss considers losses more globally and tends to process large samples, which is more suitable for binary segmentation [34]. In this article, the building objects that need to be segmented in the building image segmentation task are very small and the number of objects and background samples in the image is not balanced. DiceLoss can complete learning and training without the influence of the background size. The combination of the two can introduce additional weights to BceLoss, alleviates the problem of imbalance in the number of buildings and background samples and is conducive to the training optimization and parameter updating of the network.

E. Evaluation Criteria
To comprehensively evaluate the performance of the model, we use four widely used evaluation indexes for remote sensing building image segmentation, namely Precision (P ), Mean Pixel Accuracy (MPA), Mean Intersection over Union (MIOU) and  frequency weighted over Union (FWIOU), recall (Recall), F 1score (F 1 ). The representations of which are as follows: In the above formulas, BB is the number of pixels correctly predicted as the building target; BN is the number of pixels whose background target is falsely detected; NB is the number of pixels wrongly detected as building targets; NN is the number of pixels correctly detected as the background target.  Table V.
When λ is set to 0.2, μ is set to 0.8, MP A, M IOU, F W IOU are segmented, which have obvious advantages over other networks. Therefore, λ is set to 0.2 and μ is set to 0.8 in the CombinedLoss loss function.
2) Ablation Experiment: The original meaning of ablation is surgical resection of body tissue [35]. Long et al. [36] defined ablation experiment as: on relatively complex neural networks, deleting some network structures to test network performance to understand the process of network internal structure. In the field of deep learning, the use of ablation experiments to remove some parts of the network contributes to a better understanding of the network behavior, which is very important for deep learning research and can help to study the causal relationship of the system [36]. In this chapter, in order to select the network with the optimal performance of the attention module, this article conducts four ablation experiments and compares the segmentation results of the network without adding, adding one, two, and three attention modules SEAE, namely ResAt-UNet, ResAt-UNet(1SEAE), ResAt-UNet(2SEAE), ResAt-UNet(3SEAE). In order to evaluate differently deep ResNet performance, this article compares several ResNet networks, namely UNet, ResAt-UNet(2SEAE), UNet-ResNet50(2SEAE), UNet-ResNet101(2SEAE). To evaluate the loss in each network training, the results of P, MP A, MIOU, F W IOU are shown in Tables VI and VII. As the network is deeper, the model performs increasingly poorly, so the ResAt-UNet(2SEAE) is selected for the next comparative experiment. Fig. 14 is the contrast map of the visual effect of remote sensing building image segmentation in the ablation experimental network and four groups of different scene segmentation images are selected. In the image, the white part stands for the building and the black part is the background. The existence of roads and vegetation interferes with image segmentation [29].
The first image in Fig. 14(a) is remote sensing images of urban suburbs. The first image in Fig. 14(b) is the ResAt-UNet segmentation result, which is rough; the first image in Fig. 14(c) is the result of ResAt-UNet(1SEAE) segmentation, which can segment some large buildings but cannot completely separate them from the background; and the first image in Fig. 14(d) is the ResAt-UNet(2SEAE) segmentation result, which can describe the building contour as the segmentation is more accurate and there is no false detection. The first image in Fig. 14(e) is the segmentation result of ResAt-UNet(3SEAE). The extracted targets are relatively complete, but the buildings and roads cannot be well separated and there is a situation where some backgrounds are mistakenly regarded as building targets.
The second image in Fig. 14(a) is the remote sensing image of university town areas. The second image in Fig. 14(b) is the ResAt-UNet segmentation result, which does not segment small targets; the second image in Fig. 14(c) is the result of ResAt-UNet(1SEAE) segmentation, which can separate the building and the background, but there are missing and false detection; the second image in Fig. 14(d) is the segmentation result of ResAt-UNet(2SEAE), which segments most building targets completely and accurately and there are fewer missed buildings; and the second image in Fig. 14(e) is the result of ResAt-UNet(3SEAE) segmentation, which can segment most buildings, but the contour of the building target is rough and there is a problem of false detection.
The third image in Fig. 14(a) is the remote sensing image of residential areas. The third image in Fig. 14(b) is the ResAt-UNet segmentation result, which misses a small number of targets; the third image in Fig. 14(c) is the ResAt-UNet(1SEAE) segmentation result, which has more missed targets and larger false detection area. The third image in Fig. 14(d): ResAt-UNet(2SEAE) and the third image in Fig. 14(e): ResAt-UNet(3SEAE) can segment the vast majority of building targets whereas ResAt-UNet(2SEAE) for building target extraction is more detailed.
The fourth image in Fig. 14(a) is the remote sensing image of hilly areas. The fourth image in Fig. 14(b) is the ResAt-UNet segmentation result, which misses individual targets. The fourth image in Fig. 14(c) is the ResAt-UNet(1SEAE) segmentation result, which mistakenly regards land and road as building targets and missed some buildings. The fourth image in Fig. 14(e) is the result of ResAt-UNet(3SEAE) segmentation, which can extract most of the building targets, but there is a false detection and a small amount of rock is segmented into buildings.  The fourth image in Fig. 14(d) is the result of ResAt-UNet(2SEAE) segmentation. The segmentation effect is obviously improved and missed detection and false detection are rare, which can better segment the building from the background. Fig. 15 shows the comparison of visual effect details of remote sensing building image segmentation in the ablation experiment, and the parts with poor segmentation effect are selected for analysis and comparison. The yellow frame in the following diagram indicates that the area is missing and the building is mistaken as the background. The red frame indicates that the segmentation effect is poor. The blue frame indicates that the area is missing, and the background is mistaken as the building.
The first image in Fig. 15(a) is the remote sensing image of some suburban areas. The first image in Fig. 15(b) is the ResAt-UNet segmentation result, which is rough; the first image in Fig. 15(c) is the ResAt-UNet(1SEAE) segmentation result, which misses many building targets and the segmentation effect is not accurate enough; the first image in Fig. 15(d) is ResAt-UNet(2SEAE) segmentation results, of which the segmentation is more accurate and missing and wrong detection problems rarely appear; and the first image in Fig. 15(e) is the result of ResAt-UNet(3SEAE) network segmentation, which separates most of the targets, but there are also many inaccurate segmentation regions.
The second image in Fig. 15(a) is the remote sensing image of the university town area. The second image in Fig. 15(b) is the ResAt-UNet segmentation result, which misses a target; the second image in Fig. 15(c) is the segmentation result of ResAt-UNet(1SEAE), which misdetects many building targets and does not process the edge of building targets accurately. The second image in Fig. 15(d) is the segmentation result of ResAt-UNet(2SEAE), which is more accurate in segmentation and accurate in the extraction of building feature information. Missed detection and false detection almost do not appear and all building targets are segmented. The second image in Fig. 15(e) is the result of ResAt-UNet(3SEAE) network segmentation, which separates most of the targets, but there are errors.
The third image in Fig. 15(a) is the remote sensing image of residential areas. The third image in Fig. 15(b) is the segmentation result of ResAt-UNet. The third image in Fig. 15(c) is the segmentation result of ResAt-UNet(1SEAE). It misdetects many building targets and the processing of building target edge is not accurate enough, so the segmentation effect is not ideal. The third image in Fig. 15(d) is ResAt-UNet(2SEAE) segmentation result, the segmentation being more accurate and the processing of details being in place. The third image in Fig. 15(e) is the result of ResAt-UNet(3SEAE) network segmentation, which detects most of the building targets, but the edge information of small targets is not accurate enough and the building contour is not clear enough.
The fourth image in Fig. 15(a) is the remote sensing image of hilly areas. The fourth image in Fig. 15(b) is the ResAt-UNet segmentation result. The fourth image in Fig. 15(c) is the ResAt-UNet(1SEAE) segmentation result with missed detections of many building targets and nondetections of some targets. The fourth image in Fig. 15(d) is the ResAt-UNet(2SEAE) segmentation result, which segments all the targets without missed detection and false detection. The fourth image in Fig. 15(e) is the result of ResAt-UNet(3SEAE) network segmentation, which has a false detection and regards some rocks as building segmentation.
Combining the subjective observations of the performance experiments in Figs. 14 and 15 with the objective actual data, ResAt-UNet(2SEAE) is more accurate in the extraction of small building targets, more perfect in the processing of image detail information, and better to strip the building from the background, so it is selected for the following comparative experiments.
3) Comparative Experiment: To test the effectiveness of the improved network in this article, namely, adding the residual module and attention module ResAt-UNet(2SEAE), MLP [28], SVM [37], FCN8 [20], Bilateral Segmentation Network (Bis-Net) [38], Dual Attention Network (DANet) [39], SegNet [22], PSPNet-101 [23], DRNet [40], Deep Feature Aggregation Network (DFA-Net) [41], Spatial residual inception convolutional neural network (SRI-Net) [17], DeepLabV3+ [42], UNet [21], Du-CNN [18], MA-FCN [45], BRRNet [44] and ResAt-UNet network with only residual modules are selected for building image segmentation in this article. They are also compared with the improved network ResAt-UNet(2SEAE) segmentation effect to test the improved network segmentation effect. Table VIII shows a comparison between quantitative results of the improved network and other classical building segmentation networks. The results of image quantitative indicators obtained by traditional network MLP, SVM, FCN8s, SegNet, BisNet, and DANet segmentation are not ideal. The results obtained by PSPNet, DRNet, and SRI-Net have been improved, but the performances are slightly worse than that of UNet, DeepLabV3+, DFA-Net, and ResAt-UNet(2SEAE) segmentation results. ResAt-UNet(2SEAE) has obtained the best P , MPA, MIOU and FWIOU. Compared with UNet, it increases by 0.0554, 0.0513, 0.0639, and 0.0422 respectively, which proves the accuracy of this algorithm as well as the reliability and effectiveness of network improvement. Fig. 16 is the visual effect comparison chart of eight remote sensing building image segmentation methods with good quantitative indicators, and eight groups of segmentation images of different scenes are selected. In the image, the white part represents the building and the black part is the background. The existence of roads and vegetation interferes with image segmentation.
The first image in Fig. 16(a) is the remote sensing image of dense residential areas. The first image in Fig. 16(b) is the MLP segmentation result. The segmentation effect is not good and the building contour cannot be described. The first image of Fig. 16(c) is the FCN8s segmentation result and the building contour cannot be described; the first image of Fig. 16(d) is the BisNet segmentation result; the first image of Fig. 16(e) is the PSPNet-101 segmentation result; and the first image of Fig. 16(f) is the UNet segmentation result, which is better in large building areas. The first image of Fig. 16(g) is the ResAt-UNet segmentation result. Most of the residential buildings are segmented. The first image in Fig. 16(h) is the segmentation result of ResAt-UNet(2SEAE), which can well describe the building and the segmentation is relatively accurate without false detection.
The second image in Fig. 16(a) is the remote sensing image of residential areas. The second image in Fig. 16(b) is the MLP segmentation result, and the second image in Fig. 16(c) is the FCN8s segmentation result, which cannot describe the building contour. The second image in Fig. 16(d) is the BisNet segmentation result, the second image in Fig. 16(f) is the UNet segmentation result, and the second image in Fig. 16(g) is the ResAt-UNet segmentation result, which performs well in the building area. The second image in Fig. 16(e) is the PSPNet-101 segmentation result, which missed the large building. The second image in Fig. 16(h) is the result of ResAt-UNet(2SEAE) segmentation. The vast majority of building targets are segmented completely and accurately, and there are almost no missed targets. The third image in Fig. 16(a) is the remote sensing image of dense residential areas. The third image in Fig. 16(b) is the MLP segmentation result, and the third image in Fig. 16(c) is the FCN8s segmentation result, which separates some large buildings. The third image in Fig. 16(d) is the BisNet segmentation result, the third image in Fig. 16(e) is the PSPNet-101 segmentation result, the third image in Fig. 16(f) is the UNet segmentation result, and the third image in Fig. 16(g) is the ResAt-UNet segmentation result. These four methods perform well in the building area, but the individual small buildings are missed. The third image of Fig. 16(h) is the result of ResAt-UNet(2SEAE) segmentation, which is more detailed for extraction of the building target.
The fourth image in Fig. 16(a) is the remote sensing image of industrial areas. The fourth image of Fig. 16(b) is the MLP segmentation result, which is not ideal, and the building segmentation is incomplete. The fourth image of Fig. 16(c) is the FCN8s segmentation result, the fourth image of Fig. 16(d) is the BisNet segmentation result, the fourth image of Fig. 16(e) is the PSPNet-101 segmentation result, and the fourth image of Fig. 16(f) is the UNet segmentation result. These four methods are incomplete for the segmentation of large warehouse targets, and the fourth image of Fig. 16(g) is the ResAt-UNet segmentation result. The fourth image in Fig. 16(h) is the ResAt-UNet(2SEAE) segmentation result, which is more accurate.
The fifth image in Fig. 16(a) is the remote sensing image of dense residential areas. The fifth image of Fig. 16(b) is the MLP segmentation result, and the segmentation effect is not good. The fifth image of Fig. 16(c) is the FCN8s segmentation result, which cannot describe the building contour. The fifth image of Fig. 16(d) is the BisNet segmentation result; the fifth image of Fig. 16(e) is the PSPNet-101 segmentation result; the fifth image of Fig. 16(f) is the UNet segmentation result; and the fifth image of Fig. 16(g) is the ResAt-UNet segmentation result, which segments most of the residential buildings. The fifth image in Fig. 16(h) is the result of ResAt-UNet(2SEAE) segmentation, which completely separates small objects and can better separate buildings and backgrounds.
The sixth image in Fig. 16(a) is the remote sensing image of commercial areas. The sixth image of Fig. 16(b) is the MLP segmentation result, which misses many targets. The sixth image of Fig. 16(c) is the FCN8s segmentation result, and the sixth image of Fig. 16(d) is the BisNet segmentation result. These two methods miss some small targets. The sixth image of Fig. 16(e) is the PSPNet-101 segmentation result; the sixth image of Fig. 16(f) is the UNet segmentation result; the sixth image of Fig. 16(g) is the ResAt-UNet segmentation result; and the sixth image in Fig. 16(h) is the ResAt-UNet(2SEAE) segmentation result. These four methods can complete the accurate segmentation of small targets.
The seventh image in Fig. 16(a) is the remote sensing image of industrial areas. The seventh image in Fig. 16(b) is the MLP segmentation result, and the seventh image in Fig. 16(c) is the FCN8s segmentation result, which does not extract the warehouse target. The seventh image of Fig. 16(f) is the result of UNet segmentation; the seventh image of Fig. 16(g) is the result of ResAt-UNet segmentation; and the seventh image of Fig. 16(h) is the result of ResAt-UNet(2SEAE) segmentation. These three methods can accurately segment large and small targets, and ResAt-UNet(2SEAE) has the best segmentation effect.
The eighth image in Fig. 16(a) is the remote sensing image of industrial areas. The eighth image in Fig. 16(b) is the MLP segmentation result, and its segmentation effect is poor. The eighth image in Fig. 16(c) is the FCN8s segmentation result; the eighth image in Fig. 16(d) is the BisNet segmentation result; the eighth image in Fig. 16(e) is the PSPNet-101 segmentation result; and the eighth image in Fig. 16(f) is the UNet segmentation result. The segmentation of large targets by the above methods is incomplete. The eighth image in Fig. 16(g) is the ResAt-UNet segmentation result. The eighth image in Fig. 16(h) is the result of ResAt-UNet(2SEAE) segmentation. The two large warehouse targets are completely segmented without missed detection and false detection. Fig. 17 compares the detail parts of visual effects of remote sensing building image segmentation in the contrast experiment, and choose to intercept the parts with poorer segmentation effect for analysis and comparison. The yellow frame in the following diagram indicates that the area is missing, and the building is mistaken as the background. The red frame indicates that the segmentation effect is poor. The blue frame indicates that the area is missing, and the background is mistaken as the building.
The first image in Fig. 17(a) is the remote sensing image of dense residential areas. The first image in Fig. 17(b) is the MLP segmentation result; the first image in Fig. 17(c) is the FCN8s segmentation result; the first image in Fig. 17(d) is the BisNet segmentation result; the first image in Fig. 17(e) is the PSPNet-101 segmentation result; the first image in Fig. 17(f) is the UNet segmentation result, which misses many buildings; the first image in Fig. 17(g) is ResAt-UNet segmentation result, which segments most residential buildings; the first image of Fig. 17(h) is the result of ResAt-UNet(2SEAE) segmentation, which detects all targets.
The second image in Fig. 17(a) is the remote sensing image of residential areas. The second image of Fig. 17(b) is the MLP segmentation result; the second image of Fig. 17(c) is the FCN8s segmentation result; the second image of Fig. 17(d) is the BisNet segmentation result; the second image of Fig. 17(f) is the UNet segmentation result, which misses the large targets; the second image of Fig. 17(g) is the ResAt-UNet segmentation result; the second image of Fig. 17(h) is the ResAt-UNet(2SEAE) segmentation result, which detects all the targets.
The third image in Fig. 17(a) is the remote sensing image of dense residential areas. The third image of Fig. 17(b) is the MLP segmentation result, and the segmentation result is poor. The third image of Fig. 17(c) is the FCN8s segmentation result; the third image of Fig. 17(d) is the BisNet segmentation result; the third image of Fig. 17(e) is the PSPNet-101 segmentation result; the third image of Fig. 17(f) is the UNet segmentation result; the third image of Fig. 17(g) is the ResAt-UNet segmentation result. These five methods miss some small buildings. The third image in Fig. 17(h) is the segmentation result of ResAt-UNet (2SEAE), which separates all building targets and backgrounds.
The fourth image in Fig. 17(a) is the remote sensing image of industrial areas. The fourth image of Fig. 17(b) is the MLP segmentation result, the segmentation result of which is not ideal and the building segmentation is incomplete. The fourth image of Fig. 17(c) is the FCN8s segmentation result; the first image of Fig. 17(d) is the BisNet segmentation result; the fourth image of Fig. 17(e) is the PSPNet-101 segmentation result; and the first image of Fig. 17(f) is the UNet segmentation result. These four methods are incomplete for the segmentation of large warehouse targets. The fourth image of Fig. 17(g) is the ResAt-UNet segmentation result. The fourth image of Fig. 17(h) is the ResAt-UNet(2SEAE) segmentation result, which is more accurate.
The fifth image in Fig. 17(a) is the remote sensing image of dense residential areas. The fifth image of Fig. 17(b) is the MLP segmentation result, and its segmentation is relatively rough. The fifth image of Fig. 17(c) is the FCN8s segmentation result; the fifth image of Fig. 17(d) is the BisNet segmentation result; the fifth image of Fig. 17(e) is the PSPNet-101 segmentation result; the fifth image of Fig. 17(f) is the UNet segmentation result; and the fifth image of Fig. 17(g) is the ResAt-UNet segmentation result. The above methods segment most of the residential buildings and miss out on very small buildings. The fifth image of Fig. 17(h) is the ResAt-UNet(2SEAE) segmentation result, which is accurate and complete.
The sixth image in Fig. 17(a) is the remote sensing image of commercial areas. The sixth image in Fig. 17(b) is the MLP segmentation results, the segmentation effect of which is not ideal. The sixth image in Fig. 17(c) is the FCN8s segmentation result. The sixth image in Fig. 17(d) is the BisNet segmentation result, which misses individual buildings. The sixth image in Fig. 17(e) is the PSPNet-101 segmentation result; the sixth image in Fig. 17(f) is the UNet segmentation result; the sixth image in Fig. 17(g) is the ResAt-UNet segmentation result; and the sixth image in Fig. 17(h) is ResAt-UNet(2SEAE) segmentation results. These four methods split all targets.
The seventh image in Fig. 17(a) is the remote sensing image of industrial areas. The seventh image in Fig. 17(b) is the MLP segmentation result; the seventh image in Fig. 17(c) is the FCN8s segmentation result; the seventh image in Fig. 17(d) is the BisNet segmentation result; and the seventh image in Fig. 17(f) is the UNet segmentation result. The above methods are not ideal for the detection and extraction of large targets. The seventh image in Fig. 17(g) is the ResAt-UNet segmentation result, and the seventh image in Fig. 17(h) is the ResAt-UNet(2SEAE) segmentation result. These two methods can detect two large targets.
The eighth image in Fig. 17(a) is the remote sensing image of industrial areas. The eighth image of Fig. 17(b) is the MLP segmentation result, and its segmentation effect is poor. The eighth image of Fig. 17(c) is the FCN8s segmentation result; the eighth image of Fig. 17(d) is the BisNet segmentation result; the eighth image of Fig. 17(e) is the PSPNet-101 segmentation result; and the eighth image of Fig. 17(f) is the UNet segmentation result. The above methods have a poor extraction effect on large buildings. The eighth image of Fig. 17(g) is the ResAt-UNet segmentation result and the eighth image in Fig. 17(h) is the segmentation result of ResAt-UNet (2SEAE). These two methods can segment large buildings and small targets. Information,

4) Generalization Experiment:
To verify the robustness and generalization ability of the ResAt-UNet(2SEAE), we use the model to carry out multilabel segmentation on The WHDLD dataset [45]. The WHDLD dataset is an open-source dataset for remote sensing image segmentation, published by Wuhan University. The image is 256 × 256 × 3 and categories are divided into 6 classes containing bare soil, buildings, pavement, roads, vehicles, and water, with a total of 4940 images. We randomly divided the training sets and validation sets at a ratio of 0.8:0.2 for each category, among which 3952 images were used for training and 988 were used for testing. The images and labels of the WHDLD datasets are shown in Fig. 18. We use four widely used evaluation indexes: MPA, MP, MIOU, MRecall, the representations of which are as follows: where TP stands for true positives, FP stands for false positives, TN stands for true negatives, and FN stands for false negatives, k stands for numbers of categories. As shown in Table IX, ResAt-UNet(2SEAE) has a better degree of misclassification, though DFA-Net achieves better MPA values. Overall, ResAt-UNet(2SEAE) has better segmentation results. In contrast, ResAt-UNet(2SEAE) achieves finer segmentation results by focusing on different dimensions and enhancing the semantic representation of features between different categories, proving the robustness of the model. In contrast, ResAt-UNet(2SEAE) not only enhances the feature extraction ability through residuals and attention in the encoder but also enhances the feature enhancement ability through SAM and CAM in the feature fusion stage, which makes ResAt-UNet(2SEAE) achieve the highest score on both datasets.

IV. SUMMARY
The research of remote sensing image in urban areas is becoming increasingly important. Efficient and accurate building image segmentation algorithm has gradually attracted people's attention. The combination of deep learning and remote sensing image segmentation has become an inevitable trend. This article mainly introduces a new segmentation algorithm based on the UNet network: ResAt-UNet(2SEAE). Because the amount of urban remote sensing image data used in this article is small, and UNet requires few training samples, which meets the task requirements of this article, so it is selected as the basic framework. In addition, in view of the problem that the down-sampling of UNet is easy to lose context and detail information, attention mechanism and residual module are added. In the encoding stage, the residual module of ResNet34 network is used for down-sampling. Meanwhile, SEAE is added to the intermediate connection layer, which makes a full use of the correlation of intermediate features and improves the accuracy of the network segmentation image [43].