ReA-Net: A Multiscale Region Attention Network With Neighborhood Consistency Supervision for Building Extraction From Remote Sensing Image

Aimed at the challenge of low accuracy of building segmentation caused by poor continuity of remote-sensing-image regions and blurred boundaries, a remote sensing building semantics segmentation algorithm based on multiscale regional consistent attention supervision is proposed. First, based on the Unet encoder–decoder architecture, the proposed algorithm constructs the region attention network (ReA-Net), which employs a multiscale receptive field-guidance model to simultaneously focus on regional features and edge details of remote sensing image objects. Second, the self-attention mechanism is employed to establish the correlation representation of regional-level features of remote sensing images, and multiscale regional attention features of remote sensing images are obtained through weighted regional-level correlation mapping. Finally, to address the lack of spatial correlation constraints on the prediction of remote sensing images segmentation, a loss function with multiscale neighborhood consistency supervision is suggested to constrain the consistency of pixel label assignment related to a local region. Experimental results on WHU building dataset showed that intersection over union (IOU) reached 91.6%, precision reached 95.61%, recall reached 95.68% recall, and F1-score reached 95.64%; On the Massachusetts building dataset, IOU reached 74.77% and precision reached 83.93%, recall reached 87.53%, and F1-score reached 85.69%. Therefore, the proposed algorithm not only has a good segmentation effect but also has a strong robustness for remote sensing building image segmentation.

great significance for monitoring the change of urban areas, urban planning, and population estimation. The semantic segmentation of high-resolution remote sensing buildings is a major part of remote sensing earth observation technology. Its main task is to use the collected remote sensing images to extract the relevant characteristic information of buildings, and to classify the target represented by each pixel in the remote sensing images [1], so as to obtain the extraction of buildings in remote sensing images. However, compared with natural objects, such as water bodies and forests in remote sensing images, buildings are often affected by severe disturbances, such as illumination, season, unclear angle and boundary, and complex background information. These disturbances bring great challenges to the accurate segmentation of buildings in remote sensing images.
With the development of computer vision technology, more and more researchers have undertaken extensive studies on the semantic segmentation of high-resolution remote sensing buildings [2], [3], [4], [5], [6], and proposed a large quantity of remote sensing image semantic segmentation methods. These research methods are mainly divided into the following two parts: one is traditional machine-learning-based segmentation and the other deep-learning-based segmentation.
The segmentation method based on traditional machine learning mainly utilizes artificial features, such as shape, texture, color, spectrum, and spatial details to train the classifier. These methods mainly include local binary patterns (LBPs) [7], random forests (RFs) [8], K-nearest neighbors (KNN) [9], Gaussian maximum likelihood classifier [10], logistic regression [11], and Markov random field [12]. LBPs [7] are an image local texture feature extraction algorithm. In the domain of remote sensing image segmentation, the relationship between a single pixel and its neighbor pixels is used to describe the features of local texture structure. However, the feature dimension of local feature is too large after coding, so it is difficult to use the LBPs to express the texture features of remote sensing images in the practical applications. RF [8] is an ensemble learning model with decision tree as the basic classifier, which is composed of multiple decision trees. When the remote sensing image to be segmented is input, the final classification result of each pixel is determined by multiple decision trees voting. KNN [9] is a supervised classification method, which is based on the principle of utilizing distance metric to calculate the distance between unknown data and known data, and then determining This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the similarity of different pixel semantic features of remote sensing images based on their distances, and selecting the nearest K distances as the basis for determining the class of unknown data. Markov random field [12] combines graph theory with Bayesian probability theory, and uses the spatial features of neighboring pixels as prior knowledge to establish a semantic segmentation model for remote sensing buildings. Although this method can effectively improve the continuity of the segmentation region, the traditional Markov random field is difficult to describe the complex statistical features of remote sensing images, meanwhile its inference is still an NP-hard problem. Although the segmentation methods based on traditional machine learning have achieved certain effects, the effectiveness of such methods strongly depends on the accurate description of the prior knowledge of the specific scene images. This problem often leads to the shortcomings of traditional machine learning-based image segmentation methods, such as poor generalization ability and complex design, and it is challenging to solve the problem of remote sensing image semantic segmentation in a real complex environment.
In recent years, deep learning has achieved satisfactory results in the domain of remote sensing image semantic segmentation due to its ability of strong generalization and self-learning characteristics [13], [14], [15], [16], [17], [18]. Deep-learning-based remote sensing image semantic segmentation networks primarily utilize encoder-decoder structure to extract the semantic feature of remote sensing images and restore the resolution [19], so as to achieve the segmentation of the target remote sensing objects. Shao et al. [20] suggested a remote sensing building semantic segmentation network (ReA-Net) based on multiscale feature fusion and residual refinement structure. The fusion of deep semantic features and the refinement of segmentation results are realized and refined by introducing atrous convolution module with distinct receptive fields and a residual refinement module. Chen et al. [21] introduced a dense Xception module based on DeepLabV3+Net and utilized a multiscale fusion approach to fuse shallow and deep features, which effectively enhanced the network's ability to extract contextual information. Zeng et al. [22] employed PCA to downscale the input raw remote sensing data for problems, such as missing edge information of remote sensing images, and performed segmentation and contour extraction of remote sensing image buildings based on multitask learning, thus improving the segmentation result of building target contours. To enhance the ability of deep learning networks to characterize targets to be extracted from different remote sensing images, it is an effective strategy to add attention mechanism in the encoder-decoder module [23], which enriches the ability of deep learning networks to capture long-range dependencies with contextual information. Yang et al. [24] proposed a semantic segmentation method for high-resolution remote sensing images based on the attention fusion network, which obtains satisfactory segmentation results by fusing highlevel semantic features and low-level semantic features of the encoder through a multipath attention fusion block module, and uses the decoder to recover the resolution of the fused features. Deng et al. [25] designed an attention-gate structure based on the encoder-decoder structure, which is based on the spatial pyramidal pooling structure of the atrous convolution and utilizes the atrous convolution to further expand the perceptual field while fusing the multilayer features of the decoder, and enhances the pixel-level feature information expression in the relevant regions through the attention-gate structure, thus improving the results of semantic segmentation of remote sensing images of buildings. Li et al. [26] proposed a multicore attention mechanism to extract global contextual information while utilizing the hierarchical aggregation strategy to fuse different levels of contextual information, which effectively exploits the feature information at each stage and enhances the network feature representation. To solve the problem of unclear global semantics at image-level in remote sensing images, Ding et al. [27] proposed a local attention module using the characteristics of local image blocks with clear semantics to achieve region-level attention weighting, which effectively enhanced the network's ability to express region-level semantics.
In summary, the spatial attention mechanism has made impressive advances in the field of semantic segmentation of remote sensing buildings based on deep learning. However, the segmentation method based on the classical spatial attention mechanism focuses primarily on pixel-level spatial correlation of images, although this mechanism has the feature of enhancing strong correlation features and suppressing weak correlation features, due to the high resolution of remote sensing buildings images with large overall resolution and the small area to be segmented, therefore, directly applying the classical spatial attention mechanism to the semantic segmentation of remote sensing buildings images will lead to the following two problems:1) The classical spatial attention mechanism aggregates the global statistical information of the image, which frequently leads to an exponential increase in the computational complexity of the network for high-resolution remote sensing images, resulting in an excessive occupation of computational resources; 2) The pixel-level attention mechanism is challenging to accurately capture the region-level spatial correlation of remote sensing images at the large scale, and therefore, lacks learning and supervision of the building regions and the coherence of building edges during the training and segmentation of the remote sensing image model, which leads to a large impact of interference information such as illumination, season, angle, and background on the accurate segmentation of buildings.
To solve the abovementioned problems, a multiscale region attention (MRA) network with neighborhood consistency supervision for remote sensing buildings is proposed. The proposed network constructs a multiscale regional attention module based on the Unet encoder-decoder structure and employs the multiscale neighborhood information of remote sensing images to build the multiscale region-level feature descriptors. The self-attention (SA) mechanism is operated to capture the intraregional correlation and the correlation between different local regions in remote sensing images, and enhance the ability of semantic features to pay attention to the correlation features of target regions at different scales. Based on the assumption of spatial consistency of local region pixel label assignment, a loss function for multiscale neighborhood consistency supervision is proposed. The main contributions of this article are described as follows.
1) To solve the problem that the attention mechanism at the pixel level is difficult to effectively characterize the global semantic features of salient targets in remote sensing images, a multiscale regional attention module is devised to intensify the regional focusing capability of semantic features on building regions of remote sensing images. 2) Based on the spatial consistency assumption that pixels in local regions of remote sensing images tend to take the same labels, a loss function for multiscale neighborhood consistency supervision is proposed, and a neighborhood consistency supervision mechanism is established to constrain the regional consistency of pixel segmentation labels, which enhances the robustness of the network to noise, texture mutation and other disturbing information. 3) Fusing MRA and neighborhood consistency supervision mechanism, ReA-Net based on multiscale region consistency attention supervision is proposed, and a comparative study is carried out in two remote sensing building datasets to validate the effectiveness of the proposed method. The rest of this article is organized as follows. Section II presents the details of the proposed building extraction network architecture, including the feature extraction module, the resolution recovery module, and the MRA module, and the proposed network loss function. Experimental results and demonstrations are reported and analyzed in Section III. Finally, Section IV concludes this article.

II. PROPOSED METHODS
Many remote sensing building extraction methods based on deep convolutional neural networks have been presented since the emergence and development of convolutional neural networks. Semantic segmentation algorithms based on the encoder-decoder structure of Unet frequently utilize the characteristics of residual connectivity and dense connectivity and have achieved numerous research achievements in the semantic segmentation of high-resolution remote sense buildings [28], [29], [30]. To enhance the feature representation capability of deep convolutional neural networks for remote sensing image segmentation, researchers commonly adopt the attention mechanism in the feature extraction stage to obtain more information about the target of remote sensing images and suppress the background, noise, and other interference feature of remote sensing images.
However, most of these methods are based on pixel-level attention mechanisms, which cannot sufficiently express the spatial correlation between local regions of remote sensing images and the spatial coherence of adjacent pixels within the region, thus, often leads to boundary discontinuity and missegmentation of intraregional patches in remote sensing image target segmentation results. To address the aforementioned challenges, we propose a multiscale attention network with neighborhood consistency supervision for remote sensing buildings, which is based on Unet encoder-decoder structure. The overall structure of ReA-Net network is shown in Fig. 1. An MRA module is constructed to enhance the regional concentration ability of semantic features on building regions in remote sensing images. Then, a neighborhood consistency supervision module is proposed, which effectively enhance the smoothness of the network to assign labels to pixels in local areas.

A. Encoder-Decoder Segmentation Network Based on ReA-Net
The proposed network is mainly composed of the following three parts: the feature extraction module of remote sensing image based on Unet encoder structure, the resolution recovery module based on the decoder, and the MRA module. The feature extraction module of the remote sensing image extracts the texture, boundary, and deep semantic features of buildings in the remote sensing image, and inputs the extracted features into the resolution restoration module for resolution restoration. To enhance the ability of the network to represent the correlation between regions of remote sensing images, an MRA module is introduced into the resolution recovery module, and the region-level features in different scales are constructed by atrous convolution and pooling operation, and input into the SA module to obtain an enhanced map of region-level correlations of the feature map. Finally, the local weighting (LW) module is used to fuse the enhanced image of regional correlation with the input feature to achieve the presentation of regional correlation in remote sensing images. Each module is detailed as follows.
1) Remote Sensing Image Feature Extraction Module: The remote sensing image feature extraction module mainly consists of two stages. In the first stage, a convolution block (Conv1) is used to extract the low-level texture features of remote sensing images. The Conv1 mainly contains two convolution layers with convolution kernels size of 3 × 3 and a padding of 1. Each convolutional layer is followed by a batch normalization (BN) layer and leaky corrected linear unit (LeakyReLU). The second stage consists of four residual blocks based on residual structure (RESConv1-RESConv4), in which each residual block contains a max pooling layer with a kernel size of 2 × 2 for down-sampling operation, and at the same time, there are two convolution layers with a convolution kernel 3 × 3 and the padding of 1. Each convolutional layer is followed by a BNwith LeakyReLU, which is used to increase the network's ability to extract deeper semantic features from remote sensing images. The parameters of the proposed encoder are shown in Table I. where k represents kernel size; IC represents number of input channels; OC represents number of output channels; Pool represents the detail of pooling layer; Output represents the size of outputs.
2) Resolution Recovery Module Based on the Decoder: The decoder-based resolution recovery module focuses on the resolution recovery of extracted remote sensing image feature maps using a progressive up-sampling strategy, and realize the preclassification of dense pixels. The structure consists of four up-sampling blocks (Upsample1-Upsample4), four multiscale region attention (MRA1-MRA4) modules, and one convolution block (Outconv), which are divided into the following five stages. From stage 1 to stage 4, each stage consists of an upsampling module and an MRA module. The up-sampling module completes the progressive spatial resolution recovery of the feature map, and to eliminate the checkerboard artifacts caused by deconvolution [31], and each up-sampling block contains a bilinear up-sampling layer and two convolutional layers with a convolutional kernel size of 3 × 3 , padded with 1, each followed by a BN and a LeakyReLU. Meanwhile, in the feature fusion phase of up-sampling, the MRA module is employed to fuse distinct multiscale region level correlation enhancement maps, which improves the network's ability to reflect larger-scale spatial correlations between regions in remote sensing images. The MRA block is mainly composed of average pooling layers and convolutional layers. The multiscale regions are extracted by the atrous convolution layers with kernel size 3 × 3 and different atrous rate [1,3,5,7]. The fifth phase involved a convolutional block that implements the classification of the remote sensing image's output feature map. Table II shows the parameters of the proposed decoder structure. where k represents kernel size; IC represents number of input channels; OC represents number of output channels; Scale represents the scale factor of up-sample layer; Output represents the size of outputs.
2) (c) MRA Module: The attention mechanism can effectively use global statistics to enhance the salient features of the target to be segmented in remote sensing images, and meanwhile suppress nonsalient features, such as noise and background on the premise of not reducing the spatial resolution of the image [32], [33]. Therefore, numerous remote sensing image semantic segmentation methods have been proposed based on the pixellevel spatial attention mechanisms. However, high-resolution remote sensing images are characterized by large resolution and relatively small targets to be segmented, this often leads to the problems of unclear image-level global semantic information extracted by the network, whereas the local image patches have clear semantic references from high-resolution remote sensing images [27]. For such high-resolution remote sensing image segmentation problems, it is often difficult to effectively describe the global semantic properties of significant objects using pixel-level attention mechanism. Additionally, since it solely focuses on short-range spatial correlation at the pixel-level of the image, it is challenging to establish a robust high-order spatial correlation among multipixels in local regions of remote sensing images. Meanwhile, in the natural scene of remote sensing images, severe interference factors, such as illumination, season, angle, and unclear boundary of buildings often result in uneven segmentation regions and the unclear edge of buildings. To alleviate the aforementioned issues, we propose an MRA module. The overall structure is illustrated in Fig. 2.
Let F in ∈ R C×H f ×W f denotes the input MRA feature map, and F out ∈ R C×H f ×W f represents the output MRA feature map, where W f , H f , Care the height, width, and channel number of the input MRA feature map, respectively. The proposed MRA is mainly composed of the multiscale neighbor extraction module, the region embedding (RE) module, the SA module and the LW module. The MRA process is descripted as follows: First, to obtain the multiscale neighborhood feature A ∈ R 4C×H f ×W f , the multiscale neighborhood information of the input remote sensing building feature map F in is extracted utilizing atrous Conv1 with different atrous rates, and then the neighborhood features extracted by convolution layers of the different atrous rate are stitched together. The processes are shown in the following equation: where Conv d,k,pad (·) represents the multilayer atrous convolution layer, d ∈ [1,3,5,7] is parameter atrous rate, k = 3is the size of convolution kernel, and pad ∈ [1, 3, 5, 7] is the padding; BN(·)stands for the BN layer; ReLU(·)stands for the leaky linear rectifier layer; Concat(·)stands for the concatenation operation. Second, we input feature map A incorporated multiscale neighborhood information into the RE module. To eliminate the redundancy of multiscale features and improve the presentation of remote sensing image features, the feature map Ais dimensionally selected by utilizing a convolution layer with a kernel size of one to obtain the feature map B ∈ R C×H f ×W f . Simultaneously, to gather regional feature information for all regions, feature map B is subjected to a regional average pooling operation to create a multiscale region-level descriptor C ∈ R C×H p ×W p of the input remote sensing images where Conv d,k,pad (·) represents the dimension reduction convolution layer, d = 1 is parameter atrous rate, k = 1is the convolution kernel size, pad = 0is the padding, and the dimension reduction rate is 0.25;BN(·)stands for BN layer;Avgpool k,s (·)stands for the average pooling layer with convolution kernel k = 4, and stride s = 4; let H p = 0.25H f and W p = 0.25W f . Furthermore, to capture correlations between distinct feature regions of the remote sensing image, an SA method is applied to characterize the correlations between all-region descriptors in the region-level feature map. First, the region-level features are input into the SA module, and the feature map C is encoded and reshaped three times by using convolutional layer with kernel size of one to produce three encoded feature matrices: V ∈ R M ×C , G ∈ R C×M , andI ∈ R C×M . Then, the spatial attention scores matrix Z is obtained by multiplying the two feature matrices V and G through sigmoid(·) activation function. Finally, the feature matrixZ and the spatial attention matrix I are multiplied and reshaped to obtain the feature enhancement map K ∈ R C×H p ×W p with inter-regional correlation. The calculation process is shown in the following equation: where Conv d,k,pad (·) represents the convolution layer,d = 1 is the parameter atrous rate, k = 1 is the convolution kernel size, and pad = 0 is the padding; ⊗ stands for multiplying the Then, the inter-regional correlation feature enhancement map K ∈ R C×H p ×W p is input into the LW module, furthermore the correlation weighted map of region-level is projected into the original feature map F in . After up-sampling the enhanced map K, the region relevance weighted mapH is obtained. Finally, the region attention feature map Q ∈ R C×H f ×W f is constructed by multiplyingH byF in pixel by pixel. The compute process is as follows: where Upsample scale,mode ( ) is the up-sampling layer with parameter scaling scale = 4 , and the up-sampling mode parameter mode set as the nearest sampling method. Finally, to incorporate the global context semantic information and the spatial correlation of local regions of the remote sensing image, we use pixel-by-pixel addition to fuse the MRA feature map Q and the global attention feature map T ∈ R C×H f ×W f from preceding steps. Then, the fusion result is pixel by pixel multiplied by the activation function Sigmoid(·), and the original feature F in to construct multiscale regional attention feature output of MRA F out . The compute process is as follows: To improve the smoothing of remote sensing image segmentation results, a multiscale neighborhood consistency supervision loss function is proposed based on the assumption that adjacent pixels within a local region tend to be assigned to the same labels. The strategy of the local region consistency constraint will result in better smoothing within the segmented local region. The purpose of establishing the neighborhood consistency supervision loss is to quantitatively evaluate the error between the ReA-Net segmentation results and the true labels for remote sensing images. The proposed neighborhood consistency loss structure shows in Fig. 3.

B. Neighborhood Consistency Supervision Loss Function
The proposed neighborhood consistency supervision loss function is defined as where y i,j ∈ Y represents the jth label of the ith remote sensing training image. u i,j ∈ Urepresents the jth probability of prediction label of the ith remote sensing training image. w i,j ∈ W represents the jth neighborhood consistency penalty weight of the ith remote sensing training image. The weight w i,j is defined as follows: where i,j represents the jth label predicted value of the ithpredicted label map i ; i,s represents the sth label predicted value of the ith predicted label map i ; ∂j\j denotes the set of the neighborhood lattice point of thejth node, excepts for the node jth itself; |∂j| is the potential of the neighbor lattice point of the jth node; y i,j is the predicted label value corresponding to i,j one-to-one; The denominator term |s − j| denotes the Euclidean distance of spatial location between the jth label and its neighbor label. The farther the spatial location between the center label and its neighbor label, the smaller the weight valuew i,j ; The molecular term( i,s − i,j ) 2 + (y i,s − i,s ) 2 is the sum of the errors of the predication label i,j of the center in local region and its neighborhood predication label i,s and the training label y i,s of the neighborhood, respectively. The larger the error value, the larger the weight w i,j . Substituting (10) into (9), the proposed neighborhood consistency loss function is shown in the following equation: where L Smooth (y, )represents the consistency constraint term of pixel label allocation process in the local regions of remote sensing image x, which makes the prediction labels in the local regions of remote sensing image x tend to be assigned the same label, that is, the constraint segmentation result is smoother.L Data (y, u) represents the likelihood penalty term of the training label map y and the prediction label probability map uof the output remote sensing image, which constrains the pixel of the remote sensing image to be assigned a label according to the maximum likelihood probability.

A. Experimental Settings
The experimental platform's workstation is configured with an Intel Xeon E5 2650 processor, 376GB of memory, with 4 NVIDIA 2080Ti 11G graphics cards. PyTorch 1.8, NVIDIA's CUDA11.2 GPU runtime platform, and the cuDNN8.0 deep learning GPU acceleration package are all used in the deep learning framework.
During the proposed ReA-Net training stage, the size of the input training remote sensing images is set as 512 × 512 or 256 × 256. To augment the remote sensing image set, random horizontal flip, and normalized operation strategies are employed. The network's training parameter batch size is set to 12, the total number of learning rounds is 151, the starting learning rate is 1 × 10 −3 , and the AdamW optimizer [47] is utilized for optimization in the experiment. The cosine learning rate adjustment approach is employed in the network training process, with a 20-round adjustment cycle and a minimum learning rate of 1 × 10 −6 .

B. Experimental Dataset and Evaluation Metrics
The experiments employed two remote sensing building image datasets, the aerial imagery dataset [34] and the Massachusetts building dataset [35], which is subdataset of the WHU building dataset, to evaluate the effectiveness of the proposed algorithm. The aerial imagery dataset consists of Aerial and Satellite Imagery of Christchurch, New Zealand, comprising approximately 22 000 individual buildings. The original ground resolution of the image is 0.075 m. The dataset contains total of 8187 remote sensing images with 512 × 512 resolution after cropping, including 4735 training images, 1036 evaluation images, and 2416 test images. The Massachusetts building dataset is constituted of 151 aerial imageries from the Boston area, each with 1500 × 1500 pixels and 137 training images, four evaluation images, and ten test images. To facilitate training and evaluation, the experiment used the edge overlap method to divide each original image into 36 subgraphs with the resolution of 256 × 256. After the pruning, 5436 training images, 144 evaluation images, and 360 test images are established.
To verify the efficiency of the proposed algorithm in this article, evaluation metrics such as precision [36], recall [36], intersection over union (IoU) [36], and F1-Score [37] are used for a comparison study of the method's effectiveness, which are defined as follows: F 1 − Score = 2 × precision × recall precision + recall (15) where TP is the total number of pixels correctly segmented by building category in segmentation result; TN is the total number of pixels correctly segmented by background category in segmentation result; FP is the total number of pixels incorrectly segmented by building category in segmentation result, and FN is the total number of pixels incorrectly segmented by background category in segmentation result.
To evaluate the complexity of the proposed model, multiplyaccumulate operations (MACs) [38] and the size of parameters (Params) are utilized as a metric (16) where H in and W in are height, width of input feature map, respectively; C in and C out are the number of channels of the input feature map and output feature map, respectively; K is the height and width of kernel; bias is the bias of kernel.

C. Ablation Study
An ablation study is performed on the aerial imagery dataset to evaluate the effectiveness of each module of the proposed ReA-Net. precision, recall, IoU, F1-Score, MACs, and Params metrics of several variant networks, were compared in the ablation study. All alternative models had the same related parameter settings and training strategies, and the semantic segmentation network from the literature [39] was used as the baseline network. Table III shows the configuration and description of each network version, and Table III also shows the comparison results of ablation experiments (the input network image resolution is set as 512 × 512).
It can be seen from the analysis of Table IV. 1) Compared with the Baseline network, the performance of the network is improved more significantly by adding the regional attention module, while the computational complexity and the size of model parameters are slightly increased. Baseline + RA1 improved over Baseline by 2.35% (IoU), 0.65% (Precision), 1.48% (Recall), and 1.33% (F1-Score); Baseline + RA1 + RA2 improved over Baseline by 2.53% (IoU), 0.73% (Precision), 1.68% (Recall), and 1.43% (F1-Score), respectively; Baseline + RA1 + RA2 + RA3 improved over Baseline by 2.70% (IoU), 0.77% (Precision), 1.69% (Recall), and 1.46% (F1-Score), respectively; Baseline + RA1 + RA2 + RA3 + RA4 improved over Baseline by 2.76% (IoU), 0.83% (Precision), 2.02% (Recall), and 1.66% (F1-Score). All these effects indicate that the addition of the region attention module enhances the attention to relevance and consistency of local regions, and improves the proposed network's ability to focus on region and boundary information of the segmentation target building (see Section III-D Visualization Experiments for details). 2) Comparing the performance of the Baseline+RA1+ RA2 + RA3+ RA4 and the Base-line+RA1+ RA2 + RA3+ RA4+MS. Compared with the Baseline+RA1+ RA2 + RA3+ RA4, the Baseline+RA1+ RA2 + RA3+ RA4+MS introduces the multiscale high-order neighborhood extraction and fusion strategy, therefore, its evaluation metrics, such as IoU, precision, and F1-Score, improved by 0.37%, 0.34%, and 0.1%, respectively. The improvements proved that the introduction of multiscale neighborhood extraction and fusion strategy can effectively enhance the relevance and consistency concern ability of high-order neighborhoods, and thus, can effectively learn the regional feature information of the segmentation target building. At the same time, the computational complexity and the size of parameters slightly increase by 9.2% and 9.1%, respectively. However, in general, considering the performance improvement of IoU, precision, F1-Score, and other precision, the proposed network still has better advantages. 3) Comparing the performance of baseline + RA1 + RA2 + RA3 + RA4 + MS with that of baseline + RA1 + RA2 + RA3 + RA4 + MS + RL, the latter increases the region consistency constraint among pixels in remote sensing images by introducing the proposed neighborhood consistency loss function, which enhances the network's ability to focus on building regions and boundaries (see Section III-D for details of the visualization experiments). Thus, the latter have more accurate segmentation results in the boundaries of the segmentation buildings. Compared with the baseline + RA1 + RA2 + RA3 + RA4 + MS, the latter's IoU, precision, and F1-Score improve by 0.37%, 0.34%, and 0.1%, respectively Fig. 4 shows the variation of F1-score and IOU at different neighborhood sizes (NS). It can be seen that without neighborhood consistency supervision loss (NS = 0), the network performance on F1-score and IOU is poorer reaching 95.53% and 91.4%, respectively. After introducing neighborhood consistency supervision loss with NS = 2, the network performance on F1-score and IOU has significantly improved to 95.58% and 91.5%, respectively. The best network performance was achieved at NS = 3, with 95.64% and 91.62%, respectively. The ablation experiments presented in Fig. 4 show that increasing the NS size after NS = 3 does not significantly improve the performance of the network on F1-score and IOU, while increasing the NS size implies a significant increase in the computational complexity of the network. Therefore, through comprehensive consideration, NS = 3 is chosen in this article for the subsequent experiments.

D. Visualization
To further prove the effectiveness of the proposed ReA-Net, the feature maps of the RA1 and RA3 phases of the ReA-Net are visualized based on the aerial imagery dataset, and the visualization results are shown in Fig. 5. To demonstrate the effectiveness of the proposed region consistency supervision loss function, the penalty weights wfrom the training is visualized and analyzed, and the results are shown in Fig. 6.
The selected test images contain scattered groups of buildings as shown in the first and third rows of Fig. 5, and large regions of buildings as shown in the second and fourth rows of Fig. 5, respectively. Scattered building clusters tend to suffer from a lack of attention to the regions, and edges of small targets at the larger scale of the remote sensing image, resulting in frequent undersegmentation of scattered small target regions. For largeregion buildings, oversegmentation tends to occur when the color of the background regions is similar to that of the building regions to be segmented, and because the building regions to be segmented are large, the problem of poor consistency of pixel classification in the local regions tends to occur due to noise, texture, and other interference factors, resulting in regionally mis-segmentation and blurred edges in the segmentation results.
As shown in Fig. 5, RA1 in ReA-Net directly processes the deep-level features, and since RA1 holds the larger regional perceptual field, it is capable of extracting richer semantic information from remote sensing images. Therefore, compared with the original Unet, RA1 in ReA-Net is can extract multiscale region-level features of remote sensing images, expand the perceptual field of the network to the target building region, and establish the correlation representation between region descriptors in the region-level feature map employing the SA mechanism, which effectively captures the intraregional correlation and the correlation between different local regions in remote sensing images. Consequently improves the saliency of RA1 in extracting the features of the building regions to be segmented, and decreases the missing features of the building regions caused by the interference of the background regions of the deep feature map in the resolution recovery stage, enhancing the ability of the network to characterize the correlation between different regions of the remote sensing image and the ability to focus on the building regions to be segmented, so that the proposed network not only suppresses the features of the background regions of the remote sensing image but also effectively enhances the features of the building regions. RA3 in ReA-Net processes deeper features, and because RA3 has a smaller local field of perception, it is more capable of extracting rich spatial detail information of remote sensing images. Compared to the original Unet, RA3 in ReA-Net can focus precisely on the shape and edge features of the building regions to be segmented, while suppress background regions interference. Therefore, as shown in Fig. 5, the visualization results demonstrate that the proposed region attention mechanism can better capture the region and edge information of the buildings to be segmented for a variety of building distribution cases. Fig. 6 depicts a heatmap of the penalty weights w in the training stage, with blue regions representing small weights and red regions representing significant weights. The high-weight activation area is predominantly dispersed on the edges of the buildings and the inaccurate segmentation regions, as can be observed from the visualization results in Fig. 6. The suggested strategy improves the loss of building edges and inaccurate segmentation regions utilizing dot product weighting in the processing of training, and enhances the network's attention to such problems. Therefore, the proposed regional consistency supervision loss function can effectively improve the edge extraction ability of the buildings to be segmented, and at the same time strengthen the network's ability to constrain the image consistency of remote sensing.  In summary, the proposed region attention mechanism and region consistency supervision loss cannot only strengthen the network's attention to the region to be segmented under different typical building distributions, but also further suppress the salience of the background region of the remote sensing image, so as to effectively extract the regional features of the building to be segmented and the information of the building edge features, and improve the accuracy of building segmentation result.
As illustrated in Fig. 7(a), compared with FCN, PSPNet, and Res-Unet, the proposed ReA-Net not only effectively extracts the spatial detail information of buildings, but also makes usage of the constraint of spatial consistency function in the local regions of remote sensing images, so that the ReA-Net can effectively solve the problem of inaccuracy segmentation caused by building adhesion, which more clearly segmented the adhesion of buildings in the image, and the segmentation edges are clearer. For the problem of very similar building color and background color, as shown in Fig. 7(a), (b), (d), (e), and (g), compared with networks, such as FCN, PSPNet, and Res-Unet, the proposed region attention mechanism can focus more accurately on the regions of insignificant buildings and their contours, thus solving the problem of accurate segmentation of buildings in remote sensing images with similar building colors and background colors. As the boundaries of remote sensing building images are often disturbed by illumination, shadows, and complex foreground colors, as shown in Fig. 7(c) and (f), compared with FCN, PSPNet, and Res-Unet networks, the proposed ReA-Net can effectively notice the building boundaries under various types of disturbances and obtain better building segmentation results, so the proposed ReA-Net is more robust to the effects of illumination, shadows, and complex foreground colors. The proposed ReA-Net has strong robustness to the effects of illumination, shadows, and complex foreground colors. As shown in Fig 7(h), the proposed ReA-Net proposes an MRA and region consistency supervision mechanism, so that the network can also pay attention to the regional and the boundary information of small targets, which enhances the network's ability to extract semantic features of buildings at different scales. Compared with FCN, PSPNet, Res-Unet, PSPNet, HR-Unet, and other networks, the proposed ReA-Net can further improve the segmentation result of small target buildings.
In summary, it can be seen from the comparative experimental results of the aerial imagery dataset, compared with the comparison networks, the proposed ReA-Net cannot only obtain higher quality segmentation results for the remote sensing building image segmentation in complex scenes, but also has strong robustness to the interference of challenging problems with adhesion, light-shadow interference, complex foregroundbackground color interference, and small targets.
In summary, it can be seen from the qualitative evaluation that the remote sensing building images in complex scenes not only have good segmentation results, but also has satisfactory robustness to various strong interference factors that often exist in remote sensing images. From the quantitative evaluation, the comparison results of the segmentation experiments on the aerial imagery dataset show that, although the computational complexity and the parameters size of the proposed ReA-Net algorithm are slightly higher or the same as those of the comparison algorithms, the proposed algorithm is more effective in terms of the more important effectiveness indicators, such as precision, recall, IoU, and F1-score. the proposed algorithm obtained the best result in the quantitative evaluation. Therefore, compared with the state-of-the-art networks, the proposed ReA-Net can effectively achieve more accurate segmentation results of remote sensing building images in complex scenes.

E. Massachusetts Dataset Experimental Results and Analysis
To further demonstrate the generalizability of the proposed ReA-Net, we also conducted a comparison experiment between ReA-Net and the state-of-the-art segmentation networks on the Massachusetts dataset. The comparison networks include FCN [40], SegNet [41], DeeplabV3 [42], PSPNet [43], Unet [39], Res-Unet [44], and HR-Net [45]. The comparison of partial segmentation results of various comparison algorithms is shown in Fig. 8. From left to right in Fig. 8, the first column is the input test remote sensing building image, and the second to eighth columns are the semantic segmentation results of FCN [40], SegNet [41], and DeeplabV3 [42], PSPNet [43], Unet [39], Res-Unet [44], and HR-Net [45], respectively. The ninth column is the semantic segmentation result of the proposed ReA-Net, and the ground truth is shown in the last column.
Qualitative analysis. The overall imaging quality of the Massachusetts building dataset is poor. There are the following four mainly cases: 1) the case where the buildings are small and sticky, as shown in Fig. 8(a); 2) the case where the features, such as insignificant building boundaries and colors due to lighting and shadows, as shown in Fig. 8(b); 3) the case where small building are densely distributed and have blurred images, as shown in Fig. 8(c); 4) the case where the foreground and background colors are relatively similar, as shown in Fig. 8(d). The segmentation comparison results of the different networks are shown in Fig. 8. The proposed ReA-Net obtained  the best segmentation results compared to the comparative networks, such as FCN, PSPNet, and Res-Unet. As shown in the segmentation comparison results marked by the red box in Fig. 8, the proposed network not only obtains a smoother building segmentation region, but also a clearer contour of the building segmentation, which is closer to the ground truth map. This is because the proposed MRA mechanism and region consistency supervision strategy can better enhance the network's ability to extract semantic features from the remote sensing images, and by constraining and supervising the building regions and boundaries in the remote sensing images, it can make the network enrich the degree of attention to the local regions and contours of buildings, which improves the segmentation result of the proposed network.
To sum up, compared with the comparison networks, it can be seen from the qualitative comparison that the proposed algorithm has the best segmentation results on the Massachusetts dataset, and can performs higher quality segmentation for remote sensing images with poor imaging quality. From the quantitative metrics, the proposed ReA-Net algorithm has obtained the best results on the aerial imagery dataset and the Massachusetts building dataset. Therefore, the comparative experimental results on the two datasets show that the proposed algorithm has satisfactory generalizability IV. CONCLUSION Aimed at the challenge of low building segmentation accuracy caused by poor continuity of remote sensing image regions and blurred boundaries in remote sensing images, a semantics segmentation algorithm for remote sensing buildings based on MRA and neighborhood consistent supervision is proposed. The proposed algorithm employs the encoder to extract features, such as texture, boundary, and deep semantic of buildings in remote sensing images, and restores a resolution of extracted feature images through the progressive up-sampling strategy of the decoder. The proposed multiscale neighborhood extraction fusion strategy can effectively enhance the correlation and consistency ability of higher order neighborhoods, and then enhances the network's ability to extract region and boundary feature information of the target to be segmented. Furthermore, by introducing the region consistency supervised loss, the network's attention to region smoothness is strengthened, and the network's sensitivity to building boundaries and the accuracy of pixel classification are enhanced. Furthermore, the effectiveness and robustness of the proposed algorithm are demonstrated by quantitative experiments, qualitative experiments, and ablation experiments on the publicly available datasets the aerial imagery dataset and the Massachusetts dataset.