LFEMAP-Net: Low-Level Feature Enhancement and Multiscale Attention Pyramid Aggregation Network for Building Extraction From High-Resolution Remote Sensing Images

With the rapid development of Earth observation technology and deep learning, building extraction from remotely sensed imagery based on deep convolutional neural networks has attracted wide attention in recent years. However, due to the heterogeneity of building shapes and sizes and the complexity of the surrounding objects, current building extraction methods still have challenges in boundary accuracy and complete building extraction. For these purposes, we proposed a low-level feature enhancement and multiscale attention pyramid aggregation network (LFEMAP-Net) that considers building boundary information and multiscale feature expression to obtain higher accuracy building extraction. First, a low-level feature enhancement model is proposed based on the prior edge information to enhance the representation of spatial details, effectively addressing issues related to information loss and fuzzy boundaries. Additionally, a multiscale attention pyramid aggregation model is developed during the decoding stage to facilitate the fusion of features from different scales, thereby enhancing the extraction of building features. The experimental results on two publicly available datasets validate that LFEMAP-Net can overcome building extraction interruptions and boundary blur in complex scenes, and achieve boundary optimization and complete segmentation of buildings and achieve better performance than other advanced semantic segmentation models.


I. INTRODUCTION
B UILDINGS, as primary features in high-resolution remote sensing images (HRSI), are closely related to human activities, urban development, and societal functions.Their footprint information plays a fundamental role in comprehending the complex interactions between human endeavors and environmental influences and is an important component in applications, The authors are with the School of Geography, Geomatics and Planning, Jiangsu Normal University, Xuzhou 221116, China (e-mail: liuy101004@ 163.com; lierzhu2008@126.com;liuw@jsnu.edu.cn;lixing@jsnu.edu.cn;zhuyuxuannjust@163.com).
Digital Object Identifier 10.1109/JSTARS.2023.3346454including urban planning [1], land use [2], and disaster response [3].Over the past few decades, extracting the accurate building footprint information from HRSI has attracted wide attention and has obtained great progress in various applications.With the rapid advancement of Earth observation technology, it becomes possible to extract very detailed building footprint information, thanks to HRSI with rich spatial and structural information.However, in most urban areas, remote sensing ground objects are artificial surfaces, exhibiting complex composition in HRSI.Specifically, some buildings and other impervious surfaces share similar spectral and spatial features, such as some city roads and building roof, making it difficult to capture their unique features.Furthermore, due to the variability in building sizes, the diverse distribution of buildings, and the complex environmental surrounding, it presents a significant challenge for accurately building extraction task based on HRSI [4].Thus, there exists an exigent need to develop precise and efficient building extraction methodologies that effectively leverage the features of HRSI to enhance the quality of building footprint information obtained from remotely sensed imagery in urban areas.
Since the emergence of high-resolution remote sensing technology, considerable endeavors have been committed to developing building extraction methods based on HRSI.Traditional approaches early applied for building extraction rely on manually designed features.Characterize buildings by establishing representative architectural features from characteristics, such as spectrum, texture, and geometry [5], [6].However, because of occlusion by trees, shadows, and other factors, these methods cannot fully utilize the various information available in buildings, which limits their feature extraction capabilities.Additionally, some researchers have developed template libraries using prior knowledge of building shapes and, subsequently, incorporated them into active contour models to guide the evolution of segmentation curves [7].Nevertheless, this approach has limitations in dealing with a wide range of complex and diverse building shapes.To this end, some works have integrated multiple sources of GIS and auxiliary data to enrich building features [8], [9], significantly improving the robustness of building recognition but often come with high data costs and complex algorithms.Therefore, these methods are still difficult to meet research requirements [10], [11].
In recent years, due to the continuous progress in deep convolutional neural networks (DCNNs), a series of deeplearning-based approaches have gained significant traction in the remote sensing community [12], [13], [14].Compared with traditional methods, deep learning makes full use of multilayer structures to extract high-level abstract features from spatial data, thus enhancing classification and detection accuracy [15], [16], [17].This end-to-end deep network, which automatically adapts its parameters to capture features, proves more efficient than the manual design of features.Benefiting from the fully convolutional networks (FCNs) [18], data-driven DCNNs can automatically identify distinct objects within remotely sensed imagery through extensive training on labeled samples.This breakthrough enables dense predictions on large-scale remote sensing images and provides an efficient solution for extracting building features.For instance, Shrestha and Vanneschi [19] improved FCN with conditional random fields for boundary refinement, Deng et al. [20] used an encoder-decoder with attention gates and spatial pyramids for multiscale feature capture, and Chen et al. [21] combined deeplabv3+ with dense connections and ResNet for enhanced performance.
However, there are still challenges [22], [23], [24] in extracting buildings based on DCNNs.On the one hand, the network architectures tend to prioritize high-level semantic features, potentially sacrificing the finer edge and shape details, resulting in the loss of local detail features and edge information, leading to blurred boundaries [25].On the other hand, high-level semantic features might be less responsive to background information and target regions [10], and common downsampling operations result in significant information loss and limit contextual information integration.
To overcome the above shortcomings, some building extraction methods based on encoder-decoder architectures have been proposed [26], [27], [28].They have effectively reduced network parameters and promoted the fusion of multiscale features by incorporating residual concepts and pyramid pooling.However, the use of simple skip-layer connections for encoder-decoder models can simultaneously increase contextual information and low-level feature transfer [29].It may lead to inadequate feature representation.Furthermore, it brings challenges in detecting smaller objects in extremely high-resolution images due to the use of dilated convolutions with different dilation rates.Some other approaches [30], [31], [32] have also aimed to enhance network extraction performance through the integration of multiscale input architectures.However, these methods significantly increase the computational complexity and bring difficulties to practical applications [11].To this end, a series of attention methods have been developed [33], [34], [35].These methods optimize features from both spatial and channel perspectives, leveraging intraclass similarity to improve the overall feature integrity [36], [37], [38], [39].This not only enhances the network's ability to handle complex scenes but also reduces the computational complexity associated with multiscale input architectures and feature fusion techniques, making them more suitable for practical applications.Meanwhile, multimodal approaches for cross-city semantic segmentation have opened up new avenues for building extraction [40].
The fusion of contextual information is acknowledged as indispensable in building extraction based on HRSI.But, the incorporation of boundary information is also important in semantic segmentation.Due to complex shapes and diverse lighting conditions, the boundaries of semantic objects often exhibit considerable ambiguity, which is a huge challenge to accurate segmentation.To address this issue, several enhanced building extraction networks have been proposed [41], [42], integrating edge detection mechanisms.These innovative approaches enhance the capacity to handle intricate building edges while maintaining relatively smoother building footprint boundaries through the introduction of constraint terms.Nevertheless, the inclusion of supplementary edge networks often leads to a huge computation burden.Recent studies [43], [44], [45] have integrated structural information about buildings into the workflow by leveraging prior knowledge of building shapes and implementing postprocessing techniques, which have yielded promising outcomes [46].Moreover, the structural prior information module is combined to refine the building boundaries [47], combined with feature map refinement during training [48], and has contributed to more robust edge detection results.
Although the existing methods have made improvements in determining building boundaries and segmenting building types, these still struggle to resolve internal inconsistencies and discontinuities in building extraction based on HRSI due to the building of distributed discretely, complex characteristics, and vary in scale.Besides, for the blurred difference between the foreground and background in some complex scenes, pixels with similar colors and spatial distances can easily be misjudged as homogeneous pixels, which leads to blurred boundaries.To solve these problems, we combine the priori edge information and propose a low-level feature enhancement and multiscale attention pyramid aggregation network (LFEMAP-Net) based on the low-level feature enhancement model (LFEM) and the multiscale attention pyramid aggregation model (MAPM) for detailed building footprint from HRSI.
The main contributions of this work include the following.1) This work proposes a novel segmentation architecture, named LFEMAP-Net, characterized by multiscale integration and edge fusion, to achieve the refined extraction of buildings in HRSI.2) We develop the LFEM by designing a bilateral fusion method to effectively combine prior edges to enhance the expression of network spatial details and provide more details for the decoding results.3) We proposed MAPM to effectively focus on building feature representations across different scales by building a multiscale mixing attention (MMA) mechanism.Enhance the model's ability to aggregate information across levels.The rest of this article is organized as follows.Section II presents the proposed LFEMAP-Net and its detailed architecture.Section III includes the descriptions of the dataset, experimental setups, evaluation metrics, as well as detailed analysis and discussions of experimental results.Finally, Section IV concludes this article.

A. Overall Framework
For the DCNNs' models, features extracted from deeper layers have a higher level of abstraction, while shallow features contain rich spatial details.Semantic segmentation networks usually use convolutional network models as encoders.As the number of network layers increases, spatial detail information will inevitably be lost, resulting in incomplete content expression or inaccurate segmentation edges during the decoding stage.Most semantic segmentation networks either directly connect multiscale feature maps and decode them into prediction maps or use skip connections to supplement scale features.Although these methods can enhance the feature expression ability of content, they cannot effectively retain accurate edge information.Therefore, LFEMAP-Net based on low-level feature enhancement and multiscale attention pyramid aggregation is designed in this work, as shown in Fig. 1 and Table I.It can integrate both high-level and low-level feature information to construct contextual semantic features and leverage prior edge information to enhance the low-level features associated with objects' edges.First, MAPM is proposed to maximize the utilization of features at various levels and enhance the model's ability to aggregate information.Moreover, we develop the LFEM to further refine boundaries by using prior edge information, enabling the extraction of more discriminative spatial details.Finally, the bidirectional aggregation method [49] is employed to fuse feature representations from both parts.This guidance manner enables efficient communication between both branches, integrating rich high-level semantic information for buildings with spatial detail features to obtain the robust building extraction results.

B. Multiscale Attention Pyramid Aggregation Model
In semantic segmentation, encoders are usually used to generate feature maps at different scales.Their amalgamation  [50], [51].Nevertheless, simply concatenating the low-level and high-level features may lead to the underutilization of features across each scale.The lowest resolution branch output features of some popular backbone networks [52], [53], [54] contain the strongest semantic representation.However, the currently popular approach is to construct the feature maps of different dimensions from the lowest scale upward and then fuse them together [55], [56].This process may not effectively propagate semantic information into higher resolution branches.In addition, generating high-resolution prediction maps through commonly used bilinear upsampling methods may result in the loss of irregular edge detail information.To this end, we propose MAPM, which can be viewed in Fig. 2. The module effectively exploits the spatial and channel dependencies within features across multiple scales to enhance semantic expression.It simultaneously integrates multiple-scale feature maps to form a robust and comprehensive feature representation.The MAPM process can be expressed as follows: where F i denotes the output at the i layer, FM i stands the output processed by the MMA mechanism, ϕ 1×1 represents a convolution operation using a 1×1 kernel, and Cat d denotes the concatenation operation.The MMA module cleverly combines multiple-scale spatial and channel attention mechanisms to extract features more comprehensively from HRSI, thus improving the model's feature representation capabilities.
In order to better aggregate semantic features from high to low layer-by-layer, we have devised an MMA module for optimal utilization of contextual information.Furthermore, it allows the integration of building features with different scales of information.As indicated in Fig. 3, for an input F ∈ R C×H×W , we initially construct a multiscale channel attention mechanism by utilizing different-sized pooling windows, followed by concatenation and fusion, resulting in a multiscale channel attention result F C with dimensions of C/2×H×W.Subsequently, multiscale spatial attention is designed using different-sized convolution kernels, yielding multiscale spatial attention output F S .To enrich the feature representation further, we integrate the input feature map features into F S .Finally, the original image is elementwise added to F S for fusion, resulting in a feature map with dimensions of C/2×H×W.The combination of multiscale channel attention and multiscale spatial attention effectively integrates spatial and channel features, resulting in improved semantic segmentation accuracy, which can be expressed as follows: where ϕ represents the convolution operation, P is the channel attention operation of different pooling windows, S represents the spatial attention operations at different convolution scales, and F MMA stands for the feature map output from the MMA module.

C. LEFM
We employ the structured forest (SF) combined with the adaptive morphological reconstruction (AMR) for prior edge extraction.SFs [57] use a structured learning method that can fully learn edge features by continuously predicting local segmentation masks for image patches.Specifically, the decision trees are trained to classify image patches as edge or nonedge.For each decision tree, the optimal segmentation parameters are determined based on the principle of maximum information gain.
Given an image x ∈ X and its corresponding classification result y ∈ Y , the optimization objective function is given as follows: where θ j is the optimal separation parameter, k represents the quantization feature of x, and γ stands for the threshold value of the quantization feature.At the output stage, the classification result y ∈ Y of the decision forest is mapped into labels, and Euclidean distance is used to measure whether the image patches with similar labels belong to the same segmentation.The measured results serve as a benchmark for both training and testing.
To mitigate the impact of potential noisy pixels in edge information and optimize subsequent processing, a multiscale and multistructural AMR method [58] is employed to effectively eliminate redundant information in the image and enhance the quality of prior edge.
Given an SF result g, the AMR is performed as follows: where C R is the morphological closing reconstruction, S i represents the multiple groups of structural elements, and the scale of structural elements is i(1 ≤ i ≤ n, i ∈ N + ).σ is the AMR operator, which increases with the size of the structural element, and the start and end of the structural element scale selection are represented by m and n.In this study, we set m and n to 1 and 10, respectively.Fig. 4 presents the results of edge extraction from remote sensing images.Compared with other methods, SF combined with AMR comprehensively represents the information on object boundaries, leading to more robust edge detection results.
To fully leverage the extracted prior edge information, we develop a dual-branch fusion strategy to enhance the network's spatial information representation and improve its boundary discrimination capability.Fig. 5 shows the specific implementation details of the proposed dual-branch fusion strategy.For an input image X L , where the prior edges are represented as X E , the final output for detail feature representation is obtained as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where F L and F E correspond to the sequences of residual basic blocks with low resolution and prior edges, T E−L and T L−E refer to the low-to-edge and edge-to-low transformer, and X d represents the final detail branch output.

D. Loss Function
The loss function is composed of four components: the final output loss (L1), pyramid aggregation output loss (L2), low-level feature enhancement loss (L3), and edge loss (L4).The total loss can be defined as follows: where α and β are the hyperparameters that control the weighting between these losses.In this article, it is set to 0.4.We employ cross-entropy loss with online hard example mining (L OHEM ) for L1, L2, and L3, while binary cross-entropy loss (L BCE ) is used for edge loss L4.For N samples, where y i denotes the actual category label and p i stands the predicted probability by the model for the ith sample, then the OHEM cross-entropy loss can be mathematically expressed as follows: where K is the number of difficult examples to mine, which usually selects samples with wrong predictions or the largest loss, and CE(p i , y i ) represents the cross-entropy loss.
The L4 loss can be expressed as follows:

A. Dataset
In this study, two publicly available datasets, including WHU building dataset [59] and Massachusetts building dataset [60], were chosen as the experimental data to test the proposed method.
The WHU building dataset consists of both aerial and satellite data.For our study, we specifically use aerial imagery, which consists of approximately 220 000 independent buildings in Christchurch, New Zealand.And offers imagery with a ground resolution of 0.3 m over an area of 450 km 2 .The dataset is divided into training set, verification set, and test set, containing 4736, 1036, and 2416 images, respectively.Each image and its corresponding label are cropped to the dimensions of 512 × 512 pixels.We employed the default dataset division for our experiments.Fig. 6 shows that the sample images along with their corresponding building labels are presented for the WHU and Massachusetts datasets.The WHU dataset contains high-resolution aerial images that can reveal more detailed representations due to their higher resolution, while the Massachusetts dataset presents challenges with its lower resolution.Moreover, both two datasets exhibit significant variations in building scales, which effectively illustrate the practicality and efficacy of our approach in accurately delineating fine-grained building boundaries across various scales.

B. Implementation Details
All experiments are conducted on Windows 10 with PyTorch 1.12 framework based on Python 3.8.NVIDIA GeForce RTX 3060 GPU for 100 epochs on both two datasets.In addition, the Adam optimizer is employed with an initial learning rate of 0.01 and dynamically adjusted based on the validation accuracy.All compared approaches use the same batch size of 2 and data augmentation, including random scaling, rotation, and flipping.

C. Evaluation Metrics
To effectively assess and compare the models' performance, four commonly used semantic segmentation metrics were employed: Precision (P), Recall (R), Intersection over Union (IoU), and F1-score (F1).Precision measures the proportion of correctly predicted pixels out of the total.Recall signifies the ratio of predicted pixels to the overall count.The F1-score combines both recall and precision, providing a balanced assessment of the model's segmentation performance.Meanwhile, IoU provides a clear indication of the proportion of pixel overlap between the predicted and ground truth masks.The mathematical expressions are given as follows: IoU = TP/ (TP + FP + FN) ( 16) where TP, TN, FP, and FN represent the numbers of true positive, true negative, false positive, and false negative for pixels.

1) Ablation Experiments:
To demonstrate the effectiveness of different components within LFEMAP-Net, we conducted ablation experiments using ConvNext-B combining feature pyramid networks [55] for scale fusion as the baseline on both the WHU and Massachusetts building dataset.In these experiments, we employed P, R, IoU, and F1 scores to evaluate the distinct effects of various modules within LFEMAP-Net through selective exclusion or deactivation.Fig. 7 presents representative visual results, illustrating variations in building extraction outcomes across different scenarios when the base network is combined with different modules.

TABLE II ABLATION EXPERIMENTS OF THE NETWORK STRUCTURE BASED ON WHU DATASET
While the base network exhibits good architectural segmentation capabilities, there is room for improvement in boundary delineation and extraction completeness.In the fourth column, Base + MAPM demonstrates more complete building shapes through multiscale learning.The addition of LFEM to the base network emphasizes building boundaries.Finally, in the sixth column, LFEMAP-Net synthesizes the advantages of both modules, resulting in more complete and accurately delineated building extractions.
Tables II and III indicate that the module we designed significantly improved the performance of the network model.These supplementary modules demonstrated variable levels of the advantageous outcomes.First, optimizing the multiscale attention pyramid aggregation architecture based on the ConvNext-B, denoted as ConvNext-B + MAPM, led to an increase in IoU of 1.54% and 2.18%, F1 of 0.86% and 1.5%, Precision of 0.94% and 0.49%, and Recall of 0.99% and 2.39%, respectively.This demonstrates that the MAPM can effectively leverage feature representation at different scales to enhance semantic expression, resulting in more robust extraction results.
ConvNext-B + LFEM, which incorporates prior edgeenhanced low-level features, increased IoU by 1.05% and 1.8%, F1 by 0.59% and 1.24%, Precision by 0.71% and 1.21%, and Recall by 0.45% and 1.26%.The optimization effect was most significant on the Massachusetts dataset, presenting that our proposed model exhibits powerful extraction capabilities for small, densely distributed buildings.
We also conducted comparative experiments using different backbone [54], [61] networks on the Massachusetts dataset.In Table IV, our method consistently demonstrated superior performance with various backbone networks, achieving an average improvement of 2.41% in IoU, 1.65% in F1, 0.98% in Precision, and 2.24% in Recall across all metrics.
Fig. 8 presents the visual results of our LFEMAP-Net and other common semantic segmentation methods on the WHU dataset.In this comparison, eight representative images were selected to conduct experiments with UNet, DeepLabV3+, PSPNet, HRNetv2, Mask2former, and our LFEMAP-Net.As shown in Fig. 8, when dealing with buildings in complex scenes, our LFEMAP-Net outperforms other tested methods.It produces a more complete building boundary, and it is more sensitive to the scale of both small and large buildings.In the first four rows of Fig. 8, LFMAP-Net significantly reduces misclassification and omissions of other interfering objects, providing more accurate and detailed descriptions of buildings with complex boundary contours.It can accurately distinguish between buildings and background, even in cases where building boundaries are challenging to delineate.The last four rows depict buildings with lower foreground-background contrast and significant scale variations, and LFEMAP-Net can still effectively distinguish buildings from the background.
Leveraging its multiscale advantages, it comprehensively captures information about buildings of different sizes and accurately infers their complete shapes.
Quantitative evaluation results are presented in Table V.For a more comprehensive evaluation of the proposed method, this study further compared building segmentation performance with the latest research, including MAP-Net, MSL-Net, and LFMAP-Net achieved the highest precision with 91.09% IoU, 95.34% F1, 95.81% Precision, and 94.86% Recall across all metrics.Additionally, when we employed a backbone model, ConvNext-XL, with a larger number of channels, it achieved an accuracy of 91.48% IoU, 95.55% F1, 95.65% Precision, and 95.45% Recall with training for 100 epochs, proving the effectiveness of the proposed method.
To further assess LFEMAP-Net's generalization performance in extracting buildings across different datasets, multiple comparative experiments were also carried out on the challenging Massachusetts Dataset.As shown in Fig. 9, due to the limited spatial resolution, building boundary delineation is often Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.unclear; thus, segmenting small buildings is challenging.Variations in building roof materials and the presence of partial shadows contribute significantly to the challenge of achieving complete building extraction.Compared with other semantic segmentation methods, LFEMAP-Net is capable of achieving more complete building extraction while preserving welldefined boundaries.Fig. 9 presents results for large-scale and small-scale buildings in complex environments.The segmentation results demonstrate that LFEMAP-Net performs better in the challenging scene, resulting in fewer omissions and misclassifications for both large and small buildings, and accurately extracts the outlines of small buildings.Moreover, recent reports were also referenced on MBR-HRNet, MAFF-HRNet, and D-LinkNet for quantitative evaluation in Table VI.Rows represent evaluation metric results for different methods, while columns represent evaluation metrics.In quantitative comparisons, LFEMAP-Net continues to achieve the best performance, demonstrating the advantages of our proposed method in scale learning and building edge optimization, achieving fine-grained building segmentation.
3) Parameter Analysis: In this study, we conducted a comprehensive analysis of model parameters for different main modules to holistically assess the complexity of our model.Specifically, we evaluated the model complexity by calculating the parameters of LFEMAP-Net with various configurations, including the base network, base network + LFEM, base network + MAPM, and base network + LFEM + MAPM (LFEMAP-Net).Table VII reveals that incorporating LFEM into the base network results in a marginal parameter increase of 5.404 (M).On the other hand, introducing MAPM leads to a more substantial parameter augmentation of 27.015 (M).In comparison, LFEM contributes relatively fewer parameters to LFEMAP-Net.The MAPM module, emphasizing scale characteristics and integrating mixing attention across multiple scales, significantly amplifies the model's parameter count.

IV. CONCLUSION
In this study, an improved building extraction approach (LEFMAP-Net) for HRSI has been presented to address the limitations of current methods in boundary accuracy and complete building extraction.Specifically, in order to get more accurate contours, we proposed LFEM, a novel approach incorporating prior edge information through bilateral fusion, which considers more spatial detail information by fusion of prior edge, thereby refining the building boundary details.Moreover, we develop MAPM.Through the designed MMA mechanism, MAPM can effectively capture multiscale and multilevel features and solve the problem of incomplete building extraction and missed detection of small buildings.Experimental results on two publicly available datasets validate the effectiveness of LFEMAP-Net, showcasing its capacity to improve the building boundary and multiscale feature integration.Even in the challenging scene, LFEMAP-Net can make full use of prior edges and multiscale information, extract more accurate building boundaries, and achieve more complete building extraction results.

Abstract-
With the rapid development of Earth observation technology and deep learning, building extraction from remotely sensed imagery based on deep convolutional neural networks has attracted wide attention in recent years.However, due to the heterogeneity of building shapes and sizes and the complexity of the surrounding objects, current building extraction methods still have challenges in boundary accuracy and complete building extraction.For these purposes, we proposed a low-level feature enhancement and multiscale attention pyramid aggregation network (LFEMAP-Net) that considers building boundary information and multiscale feature expression to obtain higher accuracy building extraction.First, a low-level feature enhancement model is proposed based on the prior edge information to enhance the representation of spatial details, effectively addressing issues related to information loss and fuzzy boundaries.Additionally, a multiscale attention pyramid aggregation model is developed during the decoding stage to facilitate the fusion of features from different scales, thereby enhancing the extraction of building features.The experimental results on two publicly available datasets validate that LFEMAP-Net can overcome building extraction interruptions and boundary blur in complex scenes, and achieve boundary optimization and complete segmentation of buildings and achieve better performance than other advanced semantic segmentation models.Index Terms-Building extraction, deep learning, edge extraction, feature enhancement, multiscale attention.

Manuscript received 29
October 2023; revised 10 December 2023; accepted 20 December 2023.Date of publication 25 December 2023; date of current version 10 January 2024.This work was supported in part by the National Natural Science Foundation of China under Grant 42371465, in part by the Natural Science Foundation of Jiangsu Province under Grant BK20231353, and in part by the Natural Science Research of Jiangsu Higher Education Institutions of China under Grant 23KJB420002.(Corresponding author: Erzhu Li.)

Fig. 6 .
Fig. 6.Images and labels from the WHU dataset and Massachusetts dataset.The two columns on the left are WHU dataset, and the two columns on the right are Massachusetts dataset.
LFEMAP-Net: Low-Level Feature Enhancement and Multiscale Attention Pyramid Aggregation Network for Building Extraction From High-Resolution Remote Sensing Images Yu Liu , Erzhu Li , Wei Liu , Xing Li, and Yuxuan Zhu

TABLE I ARCHITECTURE
OF BACKBONE NETWORK IN ENCODING PATH facilitates a more comprehensive capture of global and local context information

TABLE III ABLATION
EXPERIMENTS OF THE NETWORK STRUCTURE BASED ON MASSACHUSETTS DATASET TABLE IV COMPARATIVE EXPERIMENTS OF THE NETWORK BASED ON A DIFFERENT BACKBONE

TABLE V QUANTITATIVE
EVALUATION ON THE WHU AERIAL BUILDING DATASET Fig. 9. Visualize results on the Massachusetts building dataset.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.