Improved Swin Transformer-Based Semantic Segmentation of Postearthquake Dense Buildings in Urban Areas Using Remote Sensing Images

Timely acquiring the earthquake-induced damage of buildings is crucial for emergency assessment and post-disaster rescue. Optical remote sensing is a typical method for obtaining seismic data due to its wide coverage and fast response speed. Convolutional neural networks (CNNs) are widely applied for remote sensing image recognition. However, insufficient extraction and expression ability of global correlations between local image patches limit the performance of dense building segmentation. This paper proposes an improved Swin Transformer to segment dense urban buildings from remote sensing images with complex backgrounds. The original Swin Transformer is used as a backbone of the encoder, and a convolutional block attention module is employed in the linear embedding and patch merging stages to focus on significant features. Hierarchical feature maps are then fused to strengthen the feature extraction process and fed into the UPerNet (as the decoder) to obtain the final segmentation map. Collapsed and non-collapsed buildings are labeled from remote sensing images of the Yushu and Beichuan earthquakes. Data augmentations of horizontal and vertical flipping, brightness adjustment, uniform fogging, and non-uniform fogging are performed to simulate actual situations. The effectiveness and superiority of the proposed method over the original Swin Transformer and several mature CNN-based segmentation models are validated by ablation experiments and comparative studies. The results show that the mean intersection-over-union of the improved Swin Transformer reaches 88.53%, achieving an improvement of 1.3% compared to the original model. The stability, robustness, and generalization ability of dense building recognition under complex weather disturbances are also validated.


I. INTRODUCTION
E ARTHQUAKES are one of the most severe natural disasters, and due to the recent acceleration of urbanization development, earthquake-induced building damage has become one of the most severe threats to human beings [1]. Therefore, after an earthquake occurs, it is crucial to recognize the number, location, and damage level of urban buildings rapidly to ensure postearthquake rescue and reconstruction [2]. The seismic damage-related data have been mainly collected via field investigation, which is labor-time-intensive and inefficient. In addition, particular circumstances, such as power facility destruction and communication system interruption caused by earthquakes, can bring additional challenges to conducting immediate field investigation. Therefore, an efficient and effective method that can meet the practical requirements of postearthquake rapid assessment and emergency rescue is urgently needed.
In recent years, with the development of satellite systems, remote sensing techniques have become increasingly popular in the field of natural disaster assessment [3]. The commonlyused remote sensing data [4], [5], can be roughly divided into three categories: synthetic aperture radar images [6], [7]; optical images [8]; and light detection and ranging data [9]. Among them, high-resolution optical images-which are easy to obtain and can provide rich information on postearthquake building attributes, such as color, texture, and shape-have been the most widely used [10]. Remote sensing images are wide-ranging, all-weather, unaffected by earthquakes, and accessible without onsite human inspection. In early-stage research, remote sensing image interpretation primarily relied on preset thresholds and handcrafted parameters and thus was highly affected by a subjective judgment in various application scenarios. In addition, the recognition speed and reliability highly depended on engineering experience and prior knowledge of image analysts. However, automatic extraction and autonomous recognition of seismic damage from remote sensing images have rapidly developed with advanced computer vision techniques, including image processing, machine learning, and deep learning. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Compared to image processing and machine learning, deep learning has relatively better learning ability and stronger robustness against interference and variations in object size, position, shape, and geometry and thus can provide more accurate localization and damage information on dense seismic buildings [11]. Convolutional neural networks (CNNs) have been the most widely-used deep learning-based model for seismic damage data extraction from high-resolution remote sensing optical images. Currently, CNNs are widely applied to seismic damage identification from postearthquake remote sensing images. Cooner et al. [12] adopted CNNs to classify high-resolution seismic remote sensing imagery and quickly detect damaged buildings, achieving an accuracy of 55% for the 2010 Haiti earthquake with the 7.0 magnitude. Ma et al. [13] combined remote sensing images with block vector data and improved the Inception V3 architecture; a test accuracy of 90.07% on postearthquake aerial imagery of Yushu was achieved. Furthermore, Ji et al. [14] used the pretrained VGG model to recognize collapsed buildings in remote sensing images before and after the 2010 Haiti earthquake, concluding that the fine-tuned VGGNet model outperformed the original VGGNet model trained from scratch with an overall accuracy increasing from 83.38% to 85.19%. Xiao et al. [15] proposed a dynamic cross-fusion network to enable each task to share features from different CNN layers adaptively and achieved state-of-the-art performance. Zhan et al. [16] used the Mask R-CNN to extract information on damaged buildings from postearthquake remote sensing images and identify the damage level. An improved feature pyramid network (FPN) was designed, and a detection accuracy of 92% was achieved for the most severely damaged buildings (the overall classification accuracy for four damage classes was 88%).
However, conventional CNNs can focus only on a small range of pixel-level features, thereby providing insufficient information on global correlations between local pixels and lack the capacity to model global relationships between objects within an image and nonlocal relationships between pixels. In addition, the limited receptive field could not provide sufficient contextual features, which might have a significant impact on the damage assessment accuracy of dense seismic buildings [17]. Transformer-based models using global self-attentive mechanisms can compensate for the abovementioned shortcomings of conventional CNNs that focus only on local receptive fields without considering global features [18], [19], [20], [21], allowing each pixel to contain global correlations and thus improving generalization ability and interference robustness [22], [23], [24], [25], [26].
Dosovitskiy et al. [27] first present the vision transformer (ViT) models and utilized the transformer as the backbone network for image classification tasks. The ViT models tokenized the input image into fixed-size patches, which were then flattened as vectors and fed to the transformer backbone. Experimental results demonstrated that the ViT models pretrained on large-scale datasets could achieve better performance than the CNNs when migrated to the classification tasks on small-size and medium-size datasets. In recent years, several transformer-based vision models have been proposed for different computer vision tasks, such as target classification [28], object detection [29], and semantic segmentation [30], [31], [32]. Despite the successful application of the transformer in the natural language processing field, there are two main challenges in its application to the visual domain from the original language domain. These challenges are introduced by significant differences in visual entity size among images and much higher resolutions of images compared to texts, which leads to an intensive computational cost.
To solve the above problem, Swin Transformer [33] is proposed with two principle improvements over conventional ViTs. 1) A hierarchical structure similar to the CNN structure is designed. This structure is very flexible in multiscale modeling and reduces the increase in computational complexity with the image size from square to linear. 2) The shifted window multihead self-attention (SW-MSA) block is proposed to reduce the computational cost while considering the information transferred between different windows. Although the transformer-based models made a splash in computer vision, they have still been in the infancy phase for large-scale seismic disaster evaluation in urban areas. Da et al. [34] developed a two-stage damage assessment framework named the SDAFormer, which feeds pre-disaster and postdisaster images to the network separately for damage assessment. The SDAFormer won first place on the xBD (a large-scale building damage assessment dataset) and achieved a mean intersection-over-union (mIoU) improvement of 1.5% compared to the second-place method. Chen et al. [35] proposed a transformer-based damage assessment architecture consisting of a Siamese transformer encoder and a lightweight dual-tasks decoder, which outperformed traditional CNN models such as the Mask R-CNN and Siamese-UNet.
Although the CNN models have been extensively investigated for computer vision tasks, the feature extraction process of conventional CNN is always performed at a local region, and modelling the global correlation is challenging. Considering the characteristics of the investigated remote sensing images for postearthquake buildings in a city area, the buildings are densely distributed, and the structure style and damage type are often similar, which suggests that the small-region features are closely related and the global correlations should be significant for the recognition accuracy. Therefore, this article designs an integrated model using the improved Swin Transformer for global correlation modeling and CNN for local feature extraction to further enhance the recognition capacity of building damage states and location semantics, respectively.
Meanwhile, statistical analyses of previous studies have demonstrated that clouds approximately cover 70% of the Earth, which suggests that weather interferences of cloud or fog obscuration and illumination variances inevitably exist in remote sensing optical images [36]. In addition, postearthquake remote sensing images can suffer from light overexposure and darkness due to various illumination conditions. Therefore, accurate recognition of dense seismic buildings in images collected under strong weather disturbances represents a great challenge in semantic segmentation. However, research on semantic segmentation of postearthquake remote sensing images of dense urban buildings with complex backgrounds and strong interferences is rather limited.
To address the abovementioned limitations, this article proposes a semantic segmentation method for seismic damage of large-scale dense buildings in large-scale urban areas with complex backgrounds and strong weather interferences. In addition, the opportunity of incorporating the transformer and CNN for seismic damage recognition from remote sensing images is analyzed.
The main contributions of this article can be summarized as follows.
1) An effective semantic segmentation method is proposed for high-resolution remote sensing optical images of dense buildings with complex backgrounds and strong weather interferences; this method can accurately and simultaneously extract the building damage state and location semantics. 2) An improved Swin Transformer with the encoder-decoder structure is proposed to simultaneously exploit multilevel local features and global correlations, which performs the multilevel feature fusion at each stage of the encoder, inserts convolutional block attention module (CBAM) in the linear embedding and patch merging modules, and uses the UPerNet as a decoder. 3) Two actual seismic scenarios of Yushu city and Beichuan city with different weather disturbances are used to simulate possible light overexposure, darkness, and fog occlusions and validate the effectiveness of the proposed method. 4) Ablation experiments are performed to demonstrate the efficacy and necessity of the proposed modules in the improved Swin Transformer. In addition, comparative studies are conducted to verify the superiority of the improved Swin Transformer over the original Swin Transformer and various mature CNN-based segmentation models. The rest of the article is organized as follows. Section II describes the architecture of the improved Swin Transformer. Section III introduces the dataset and implementation details. Section IV presents the test results under two real-world seismic scenarios, ablation experiments, and comparative studies. Section V concludes the article.

A. Overall Architecture
An improved Swin Transformer based on the encoder-decoder framework is proposed to realize accurate semantic segmentation of postearthquake dense buildings from remote sensing images with complex backgrounds and strong weather interferences. The overall architecture that uses the original Swin Transformer as a backbone of the encoder is presented in Fig. 1. As shown in Fig. 1, a feature fusion module is added to the end of the encoder to fully exploit the extracted features at various levels. In the proposed structure, hierarchical feature maps are concatenated using convolutions to enrich the transferable local features of different stages by multilevel feature fusion. In addition, the CBAM is inserted into the linear embedding and patch merging modules to alleviate feature leakage during the patch downsampling process in the encoding stage. This enables the proposed model to distinguish different building damage states and location semantics, thus improving multiclass segmentation accuracy. Finally, the UperNet incorporating multilevel features is used as a decoder. Details on the feature fusion and CBAM modules are described in the following sections.

B. Swin Transformer Backbone
The Swin Transformer backbone includes an initial patch partition module and four different stages denoted by stages 1-4. Stage 1 consists of a linear embedding layer and two consecutive Swin Transformer blocks. Stage 2 consists of a patch merging module and two Swin Transformer blocks. Stage 3 consists of a patch merging module and 18 Swin Transformer blocks. Finally, Stage 4 consists of a patch merging module and two Swin Transformer blocks.
For the patch partition module, the input image with a size of H × W × 3 is split four times in the spatial directions and flatted in the channel direction, generating a patch size of H/4 × W/4 × 48. Then, the linear embedding layer projects the channel dimension to an arbitrary number denoted by C (in this article, C = 128) through the 1 × 1 convolution, generating a feature map with a size of H/4 × W/4 × C. The feature map of each stage is input into the patch merging module, and a half-flat-size downsampling process is performed by neighborhood sampling every two points, and thus the channel number quadruples. Then, a 1 × 1 convolution is utilized to adjust the  channel number to double. The overall schematic of the patch partition module is presented in Fig. 2.
The schematic diagram of the Swin Transformer block, which is the fundamental component of the Swin Transformer, is presented in Fig. 3. Each Swin Transformer block includes a regular window and a shift window. The regular window consists of a layer-normalization (LN) layer, a window multihead self-attention (W-MSA) module, a residual connection, an LN layer, a multilayer perceptron (MLP), and a residual connection. The shift window has a similar structure as the regular window; the only difference is that an SW-MSA module is used instead of the W-MSA. The mathematical formula of the Swin Transformer block is expressed as follows: where Z l−1 and Z l+1 denote the input and output of the Swin Transformer block, respectively. A detailed description of W-MSA, MLP, and SW-MSA can be found in the study of Han et al. [32].

C. Feature Fusion Module
Compared with the traditional semantic segmentation task, the dataset investigated in this article consists of remote sensing images with complex backgrounds, and its unique characteristics are reflected in two aspects: images contain complex backgrounds, including several types of strong distractions, such as illumination variations and fog obscurations; and buildings in remote sensing images are in different geometries; particularly, shapes and sizes of collapsed and not-collapsed buildings are different.
A previous study has shown that using different convolution operators in the transformer architecture can provide information on both local and global features of the input image, significantly improving the semantic segmentation performance [37]. Inspired by this idea, a multilevel feature fusion module is designed after each Swin Transformer block to convolute the feature maps output by the previous levels to further enhance the extraction capability of local features and global correlations. Although the Swin Transformer has a hierarchical structure, there are no interactions between feature maps at any stages. Therefore, enriching the extracted features is essential considering that remote sensing images of postearthquake dense buildings contain various types of background distractions, including illumination overexposure, darkness, uniform fog, and non-uniform fog, and have a high diversity of geometric shapes and sizes.
The schematic diagram of the feature fusion module is shown in Fig. 4, where four feature maps from the corresponding stage of the Swin Transformer backbone are illustrated. The flat dimension of each stage is halved, and the channel dimension is doubled. The feature map of each stage is downsampled by a 2 × 2 convolutional kernel with a sliding stride of two and concatenated with that of the next stage in the channel direction. Then, the channel number of the concatenated feature map is half reduced by the 1 × 1 convolution. Finally, feature maps from all stages are fused in the channel direction, and the channel size is quartered using a 1 × 1 convolutional kernel.

D. Convolutional Block Attention Module
The attention mechanism is a typical way to achieve adaptive attention inside a neural network, and the commonly-used attention mechanisms include channel attention and spatial attention. The channel attention aims to enable the network to focus on the category information inside an image by keeping the channel dimension unchanged and compressing the spatial dimension into a scalar. Furthermore, the spatial attention assists the network in paying more attention to the location information of targets inside an image by keeping the spatial dimension unchanged and compressing the multiple-channel dimension into one single channel. This article utilizes the CBAM by simultaneously combining channel attention and spatial attention and can distinguish significant feature maps of building damage states and location semantics. The schematic diagram of the CBAM, which is a lightweight attention mechanism module consisting of a channel attention part and a spatial attention part by Woo et al. [38], is presented in Fig. 5. Details of CBAM have been presented in [38] and omitted here.
The process of inserting the CBAM into the linear embedding module is illustrated in Fig. 6. The dimension of the feature map generated by the patch partition module is transformed to C by a 1 × 1 convolution block, and the CBAM module is inserted before the LN layer.
Conventional downsampling operations often use convolution, average pooling, and maximum pooling in a local region, which will inevitably cause feature leakage. Patch merging selects the neighborhood of every two pixels, reassembles them into a series of patches (the spatial size of patches is halved), and concatenates the patches in the channel dimension (the channel dimension is quadrupled), which is finally followed by a 1 × 1 convolution to adjust the channel dimension. Therefore, all the input information can be reserved, and no feature leakage occurs in patch merging. The process of inserting the CBAM into the   patch merging module is presented in Fig. 7. In each channel, neighborhood areas of every two points are reassembled into a patch (i.e., the flat size is halved). The reconstructed patches are fed into the CBAM module individually, and the output feature maps of the CBAM module are fused in the channel direction. The CBAM module is followed by an LN layer and a fully-connected linear layer.

E. UPerNet Decoder
For remote sensing images with complex backgrounds and small dense buildings, a multilevel segmentation predictor, the UPerNet [39], is employed to achieve full-scale coverage from low-level concrete features to high-level abstract features. The design of the UPerNet is based on the pyramid pooling module (PPM) [40] and FPN, which fully integrates extracted features from different stages of the encoder. The architecture of the UPerNet decoder is shown in Fig. 8.
The PPM block utilizes pooling kernels covering different portions of the input feature map to generate multiscale correlations among different subregions. In this article, a four-level pyramid pooling is designed to individually perform the pooling operation for the whole, half of, a third of, and a sixth of the input feature map. Then, the channel dimensions are adjusted using 1 × 1 convolution, and the spatial dimensions are unified by bilinear interpolation upsampling. Finally, they are fused as the global prior and concatenated with the original feature map at the channel dimension, as shown in Fig. 9.

A. Dataset
In this article, 24 remote sensing city-scale images of the Yushu city and Beichuan city after Yushu and Wenchuan earthquakes with a resolution of 4608 × 2560 were used. The original images were downloaded from the Internet and manually pixel-wise labeled using "labelme" [41] to classify buildings into collapsed and non-collapsed buildings. Buildings with destructive shapes, severely-damaged roofs, columns, and beams were classified as collapsed, and other buildings were labeled as non-collapsed.
Data augmentation operations, including random flipping in the horizontal and vertical directions, brightness transformation, uniform fogging, and nonuniform fogging, were performed to expand the dataset and simulate possible light overexposure and darkness and fog occlusions in remote sensing images.
The brightness transformation is realized by rescaling the pixel intensity as follows: where I(h, w) andÎ(h, w) denote the image intensity at the pixel location (h,w) before and after brightness transformation, respectively; α denotes the rescaling coefficient controlling the light exposure and darkness; median operator ensures the transformed pixel intensity within the range of 0-255. Based on the dark channel prior theory [42], dark pixels have very low intensity in at least one color channel of the RGB for most local regions that do not cover the sky; therefore, the non-uniform fogging operation is expressed bŷ     where J(h, w) andĴ(h, w) denote the image intensity before and after fogging transformation; h and w are the pixel indexes in the height and width directions, respectively; A denotes the fog brightness parameter, and its value is in the range of 0-255 corresponding to the grayscale intensity of fog changes from black to white; t(h, w) represents the light transmittance; β denotes the fogging concentration factor; γ denotes the constant influence factor, and in this article γ = 0.04; d(h, w) denotes the scene depth. H and W denote the height and width of the input image. Considering that remote sensing images could be completely covered by a large area of clouds or fog, the uniform fogging operation is used to simulate possible scenarios and enhance the dataset asĴ Equation (4) is a particular case of (3) with a constant light transmittance at all pixel locations, whereĴ(h, w) denotes the image intensity after uniform fogging transformation, J(h, w) denotes the original image and L(h, w) denotes a new image with the identical pixel value of 170 on three channels of RGB. Fig. 10 shows some representative postearthquake remote sensing images with dense buildings after brightness, uniform, and non-uniform fogging transformations with different configurations. After data augmentation, the original images were cropped to 512 × 512 patches with an overlap ratio of 50%. Finally, 8262 patches were obtained, 80% of which were used for training by random assignment, and the rest was used for validation.

B. Implementation Settings
The proposed method was implemented in PyTorch 1.7.0 on a workstation equipped with an i9-10900k CPU and a GeForce RTX 3090 GPU. The AdamW optimization algorithm was employed to update the model parameters under a learning rate of 0.0001, a batch size of 8, and a training epoch of 50. The mIoU between the predicted and ground-truth buildings was used as an evaluation metric of the proposed method and used the weights obtained from pre-trained on the ADE20K [43] dataset as pre-training weights for the model.

A. Test Results of Yushu City
Remote sensing seismic images of Yushu city, including various weather disturbances, were used to demonstrate the recognition accuracy of the proposed method for postearthquake dense buildings. The test results obtained by the original and improved Swin Transformers on 512 × 512 patches of Yushu city collected under different weather disturbances are presented in Fig. 11. The results show that the proposed improved Swin Transformer achieved higher accuracy and better robustness against light overexposure, darkness, and fog occlusions than the original Swin Transformer with an average mIoU improvement of 0.83% for 512 × 512 patches. In Fig. 11, white circles in sub-figures present local details of predicted building corners and edges, indicating that the improved Swin Transformer could maintain better recognition ability under various weather disturbances than the original Swin Transformer. In addition, the improved Swin Transformer achieved better recognition on the fogging test images where the buildings were already difficult to distinguish, and the mIoU value improved by 1.18% compared to the orginal Swin Transformer. The test results on the large-scale image with a resolution of 4608 × 2560 are presented in Fig. 12, which shows that the improved Swin Transformer performed better than the original Swin Transformer.

B. Test Results of Beichuan City
Remote sensing seismic images of Beichuan city, which included various weather disturbances, were used to demonstrate the recognition accuracy of postearthquake dense buildings further. The test results obtained by the original Swin Transformer and improved Swin Transformer on the 512 × 512 patches of Beichuan city are presented in Fig. 13. The results in Fig. 13 show that the improved Swin Transformer achieved higher accuracy and better robustness against light overexposure, darkness, and fog occlusions than the original Swin Transformer with an average mIoU improvement of 1.05% for 512 × 512 patches. In addition, the improved Swin Transformer still achieved better recognition on the fogging test images, and the mIoU value improved by 1.77% compared to the orginal Swin Transformer. Additional test results on the 512 × 512 patches are given in Fig. 20. The test results of the two transformers on the large-scale image with a resolution of 4608 × 2560 are presented in Fig. 14, which shows that the improved Swin Transformer performed better than the original Swin Transformer.

C. Discussion of Test Results
For all test images of Yushu city and Beichuan city, the original Swin Transformer had more local misrecognition and larger prediction errors for building edges than the improved Swin Transformer, which resulted in the distinct shape variance of dense building regions. The original Swin Transformer tended to ignore unconnected pixels inside the building region and classified them into the same class. Moreover, the improved Swin Transformer achieved higher recognition accuracy than the original Swin Transformer for collapsed buildings with more irregular geometrical shapes. The recognition results of negative objects for wild regions with trees, tents, and rivers are shown in Fig. 15. The results show that negative objects are successfully classified into the background, and misrecognition rarely occurs. Under weather disturbances of fogging and brightness transformation, the misrecognition of the background of collapsed buildings and incomplete recognition of non-collapsed buildings often occurred. A possible reason may be that the fogging and brightness transformation introduced severe occlusion in certain areas, thus increasing the difficulty of accurate segmentation.
The comparison results of category-wise intersection-overunion (IoU) of the two models for different weather disturbances are presented in Fig. 16. As shown in Fig. 16(a), the proposed Swin Transformer improved the average segmentation IoU for each category with a lower volatility than the original Swin Transformer, suggesting the robustness and stability of the proposed method. Fig. 16(b) shows that the model performance decreased for each category when weather disturbances existed. Among the considered types of weather disturbances, the non-uniform fogging affected the model performance of the improved Swin Transformer the most, and the proposed model was less sensitive to brightness transformation than fogging occlusion.
It should be noted that the remote sensing images of Yushu city and Beichuan city had unique characteristics. In Yushu city, buildings were more densely distributed; intensities in the color space were similar to the background and plenty of tents and vehicles existed in the images, which increased difficulty in recognition. Although these factors could cause a slight decrease in average IoU, the improved Swin Transformer still achieved good recognition accuracy for each category. The results also indicated that the proposed method efficiently addressed the deficiencies of the original Swin Transformer and enhanced the edge smoothness and completeness of the results of geometrical shapes for postearthquake dense buildings. Therefore, the improved Swin Transformer had stronger robustness and resistance to different types of severe interferences under real-world scenarios than the original Swin Transformer.

D. Ablation Experiments and Comparative Studies
Ablation experiments were performed to demonstrate the effectiveness and necessity of the feature fusion and CBAM modules in the improved Swin Transformer. Besides the proposed model (including both the feature fusion module and the CBAM module), three additional models, namely the original Swin Transformer, the Swin Transformer + feature fusion module, and the original Swin Transformer + CBAM module, were trained using the same dataset, optimization algorithm, and training hyperparameters. Table I gives the comparison results of model performances in the ablation experiments. The results showed that both the feature fusion and the CBAM had certain contributions to the model performance improvement, but the effect of the feature fusion module was more significant. Accordingly, the feature fusion and CBAM modules improved The full model achieved the highest improvements in segmentation IoU of background, collapsed buildings, and noncollapsed buildings by 0.38%, 1.57%, and 1.95%, respectively. The overall mIoU improvement was 1.3%, demonstrating that the improved Swin Transformer successfully integrated the advantages of feature fusion and CBAM modules. It further indicated the effectiveness of multilevel feature fusion in alleviating feature leakage and CBAM in focusing on small dense objects.
To verify the effectiveness of the improved Swin Transformer over conventional CNNs, several mature CNN-based semantic segmentation models, including the PSPNet [43], DeepLabV3+ [44], and UNet [45], were used for comparison. The dataset, optimization algorithm, and training hyperparameters were the same as those of the improved Swin Transformer. Table II gives a comparison of the performances of the improved Swin Transformer and several CNN-based models. The results showed that the UNet performed the best among the three CNN-based segmentation models but worse than the proposed Swin Transformer. Although the background IoU, noncollapsed IoU, collapsed IoU, and mIoU of the UNet reached 93.86%, 80.85%, 79.05%, and 84.59%, the improved Swin Transformer performed better in terms of all metrics by 2.49%, 5.22%, 5.11%, and 3.94%, respectively. This indicated that the proposed method integrating the Swin Transformer and CNN together enhanced the semantic segmentation accuracy of dense buildings in postearthquake remote sensing images compared to conventional CNN-based models.
The feature fusion module is designed to alleviate the possible feature leakage and enhance the multistage feature extraction. Even if some features at a particular stage are ignored, the feature fusion module can ensure that the information on missed features is retained and can be fed into the subsequent decoder. The authors admit that it is indeed challenging to determine which feature stage is essential and should be enhanced in the feature fusion module. Therefore, the feature fusion model is designed in a two-step manner: the adjacent stages are fused to alleviate the feature leakage at the previous stage; and all the stages are fused at the final stage to take full advantage of the multistage features.
In addition, two comparative studies are performed to demonstrate the effectiveness of the proposed feature fusion module. First, the feature fusion module is only adopted at the final stage and ignored for the adjacent stages in the encoder, noted as feature fusion-1 in Table III. Second, the feature fusion module is adopted both in the encoder and decoder, noted as feature fusion-2 in Table III. The encoder part is the same as Fig. 4; for the decoder part, feature maps of the first and second stages are downsampled by 2 × 2 convolution and concatenated in the channel dimension with those of the next stage. Afterward, the number of channels is halved by 1 × 1 convolution, and the residuals are finally added together. Table III gives the comparison results of these three different feature fusion modules, indicating that both insufficient (feature fusion-1) and excessive (feature fusion-2) feature fusion modules have negative impacts on recognition accuracy.
To explore the applicable range of controlling parameters under each weather condition, more experiments are performed, as shown in Figs. 17 and 18. Fig. 17 shows representative test results under various lightness conditions. It suggests that the controlling parameter α could be recommended in the range of 0.4-1.3 with a high mIoU over 0.8. When α is set as 1.9, a significant drop of about 19.85% in the prediction accuracy occurs. Fig. 18 shows representative test results under various fogging conditions. It suggests that the controlling parameter β could be recommended in the range of 0-0.04 with a high mIoU over 0.8. When β is set as 0.05, a significant drop of about 20% in the prediction accuracy occurs.    Fig. 19, where "CBAM before concat" represents that all feature maps were first input into the CBAM module and then concatenated in the proposed patch merging block; "CBAM after concat" represents that all the related feature maps were concatenated before being input into the CBAM module in the Patch Merging block. The results indicated that the insertion strategy of CBAM before concatenation gained the higher training accuracy and lower diversity than inserting CBAM after concatenation.

V. CONCLUSION
This article proposed an improved Swin Transformer for remote sensing segmentation of postearthquake dense buildings in urban areas. The main contributions of this article are obtained as follows.
1) An improved Swin Transformer following the encoderdecoder framework was proposed to achieve accurate semantic segmentation of postearthquake dense buildings from remote sensing images under complex backgrounds and strong weather interferences. The proposed structure performed multilevel feature fusion at each stage of the encoder, inserted the CBAM into the linear embedding and patch merging modules based on the original Swin Transformer backbone, and used the UPerNet as a decoder. 2) A total of 24 high-resolution remote sensing city-scale images were used to train and validate the proposed model. Different weather disturbances were considered by performing brightness transformation, uniform fogging, and nonuniform fogging to expand the dataset and simulate possible light overexposure, darkness, and fog occlusions under actual situations. The results showed that the improved Swin Transformer achieved higher recognition accuracy than the original Swin Transformer, especially for collapsed buildings with highly irregular geometrical shapes. 3) Ablation experiments were performed to demonstrate the effectiveness and necessity of the proposed modules in the improved Swin Transformer. The comparison results showed that the full model (i.e., the proposed model with feature fusion and CBAM) obtained the best segmentation IoU result of background of collapsed and noncollapsed buildings among all models, which further indicated the advantages of the multilevel feature fusion in alleviating feature leakage and the CBAM in focusing on small dense objects. 4) The comparison results showed that the improved Swin Transformer had distinct superiority over the original Swin Transformer and some mature CNN-based segmentation models, including the PSPNet, DeepLabV3+, and UNet. It indicated that the proposed method could enhance the semantic segmentation accuracy of dense buildings in postearthquake remote sensing images owing to the comprehensive extraction capability of local features and global correlations by organically integrating transformer and CNN structures. In future work, the multiscale recognition of seismic disasters is supposed to be investigated using multisource data based on ViTs.