Landslide Mapping Using Multilevel-Feature-Enhancement Change Detection Network

Landslide mapping (LM) from bitemporal remote sensing images is essential for disaster prevention and mitigation. Although bitemporal change detection technology has been applied for LM, there remains room for improvement in its accuracy and automation. In this article, a multilevel feature enhancement network (MFENet) is proposed for LM based on modules built in convolutional neural networks (CNNs) like CNN-Attention. MFENet mainly consists of three modules: the postevent feature enhancement module (PFEM), the bifeature difference enhancement module (BFDEM), and the flow direction calibration module (FDCM). Specifically, the main role of PFEM is to selectively fuse postevent multilayer features to provide discriminative postevent features. BFDEM fuses the multilayer differences of both pre-event and postevent features to generate high-quality change detection features, which are sufficiently powerful to distinguish foreground from background. FDCM uses a digital elevation model to calibrate the flow direction of each pixel of the landslide detection results to complete the LM task. Experiments were conducted to test the effectiveness of MFENet on two real-world regions, Lantau Island and Sharp Peak, Hong Kong, where landslides occurred after rainstorms. Compared with other state-of-the-art general change detection methods and landslide-specific change detection methods, the proposed method outperforms all metrics, with its intersection over union reaching 87.23%. The availability of additional features and the generalization performance of MFENet are demonstrated experimentally. It is anticipated that the proposed network will further contribute to disaster prevention and mitigation.


I. INTRODUCTION
L ANDSLIDES are one of the major natural disasters, causing huge economic losses and casualties worldwide [1], Manuscript [2]. Under the background of frequent occurrences of severe weather around the world caused by environmental problems, landslide disasters appear to be occurring more frequently [3], [4]. Therefore, there is positive significance in quickly and accurately obtaining the location, boundary, and flow direction of landslides that follow disasters by the means of landslide mapping (LM) for postevent rescue, disaster investigation, and disaster prevention and mitigation [5]. During recent years, the development of remote sensing has provided new opportunities for LM [6], [7]. Many LM methods based on remote sensing images have been developed, and such methods can be divided into two types: object detection based on postevent images and change detection based on pre-event and postevent bitemporal images [8]. Deep learning technology shines in remote sensing image processing [9], [10], making it possible to achieve more accurate and automated LM [11].
LM using postevent images is usually based on textural, spectral, and morphological features. Of specific interest is that it can also be assisted by additional data such as that gained by the digital elevation model (DEM) [7], [11]. From Landsat-8 images, Yu et al. [12] proposed a method for LM, which introduced saliency enhancement to recognize landslides and used additional DEM data to remove the ground objects of plain areas in which landslides are less likely to occur. Cheng et al. [13] proposed a classification method based on the k-nearest neighbor, combining the bag-of-visual-words and the probabilistic latent semantic analysis model, which automatically divides a given image into landslide areas and nonlandslide areas. Yu et al. [14] used a deep convolutional neural network (CNN) to initially determine the landslide and then used an improved region-growing algorithm to extract the region and boundary of the landslide. Qi et al. [15] combined the characteristics of U-Net and residual network (ResNet) to design a deep learning model to complete automatic LM. "Different objects with the same spectral features" is the specific difficulty regarding this type of method [16], as it is necessary to, either manually or by using additional data, remove exposed rocks, roads, and other ground objects similar to the specific spectral landslide characteristics. This action leads to poorer LM results, whereas postprocessing reduces the associated automation.
The accumulation of historical data has created new possibilities for LM in the use of bitemporal remote sensing images [17]. Landslides mostly occur in vegetation-covered areas, in which the texture and spectral features of the ground before and after a landslide are generally different. However, disturbed objects like rocks still maintain their original features. Therefore, LM using bitemporal image change detection methods has become one of the mainstreams [6]. Such methods are generally divided into two steps: the generation of change detection images (CDIs) [18], [19], [20], and the segmentation of CDIs for LM. Mondini et al. [21] used the normalized differential vegetation index (NDVI), image spectral angle, principal component analysis (PCA), and independent component analysis (ICA) to generate CDIs and then employed multivariate classification techniques, such as logistic regression, linear discriminant analysis, and quadratic discriminant analysis, to detect landslides. Li et al. [22] applied change vector analysis to generate CDIs and employed a more robust threshold method to generate the landslide mask. Then, a change-detection-based Markov random field (CDMRF) method was proposed for LM using the spectral and spatial contextual features of landslides. Lu et al. [23] first generated CDIs through NDVI, PCA, and ICA and then combined the CDMRF method for LM. Lv et al. [8] proposed an approach based on adaptive region shape similarity, which used the neighborhood features of pixels to obtain richer context information to generate CDIs, and finally used the threshold segmentation method to segment the CDIs to complete LM. These methods rely on the quality of CDIs and also require empirical manual selection of optimal features and thresholds, resulting in a low degree of automation. In addition, bitemporal SAR images are also used for LM, and their effect is better than that of optical remote sensing images when there is cloud cover [24], [25]. But SAR-based LM relies on additional data such as DEM.
Deep learning methods can automatically extract more effective features through deep convolutional layers to complete the end-to-end LM [11], [26]. It is a potential LM method with a high degree of automation and accuracy equal to or better than traditional methods. Zhang et al. [27] used a deep CNN to learn landslide features from historical images, then proposed a change detection CNN to detect landslides from bitemporal images, and postprocessed landslide objects to obtain attribute information such as their trail, flow direction, and source points. Similarly, fully convolutional network (FCN)-based methods were proposed to extract features from differential images or directly superimpose depth features from bitemporal images for LM [26], [27]. Fang et al. [30] combined a generative adversarial network and a Siamese neural network to generate bitemporal feature images and used pixelwise Euclidean distance for LM. Based on U-Net and ResNet, Su et al. [31] proposed the LanD-CNN model to detect landslides, which superimposed bitemporal images and DEM as the input of the model. Amankwah et al. [32] tested the change detection methods proposed on nonlandslide datasets, such as the spatial-temporal attention neural network and Siamese nested U-Net (SNUNet) on the landslide dataset, and achieved good results. These methods can be summarized as using three feature extraction and fusion strategies 1) using differential images as the model input, 2) using the direct superimposition of bitemporal images as model inputs to extract features, 3) using the same encoder to extract features for bitemporal images and then superimposing these features into the decoding stage. The pre-event and postevent features are equally weighted in the model.
When doing LM, more attention is paid to the new landslide. New landslides exist in the postevent image and generally occur in vegetation-covered mountainous areas. Its spectral, textural, and morphological features are obvious in the postevent image and are well differentiated from the surrounding vegetation. Taking advantage of this property combined with change detection techniques, in this article, a network based on multilevel feature enhancement is proposed for LM. First, the postevent feature enhancement module (PFEM) is designed to enhance the postevent image features to make them more discriminative. Here, the additional DEM or partial landslide segmentation results of the postevent image can be input into the network through concatenate or side supervision to further enhance the postevent feature. Second, the pre-event image feature and the enhanced postevent image feature are input to the bifeature difference enhancement module (BFDEM), which highlights the differential features of bitemporal images to obtain reliable landslide change detection results. Finally, the flow direction calibration module (FDCM) uses an eight-direction pour point model (D8) [33] to calibrate the landslide flow direction. Our main contributions are as follows.
1) We propose an end-to-end change detection network named multilevel feature enhancement network (MFENet) for LM. MFENet enhances features from two levels and can use side supervision and concatenation to input additional features.

II. METHODOLOGY
In this section, we introduce the overview of our proposed network and then introduce the details of the modules in the network.

A. Overview
MFENet is an end-to-end network, and its architecture is shown in Fig. 1. It consists of the following four parts: 1) a feature extractor based on a Siamese network, 2) a PFEM for enhanced postevent features, 3) a BFDEM for enhanced bitemporal difference features, 4) an FDCM to calibrate the flow direction. Let I pre and I post represent pre-and postevent images, respectively. The flow of MFENet is as follows.
Step 1: The Siamese network is a neural network framework with two branches. The "Siamese" of the neural network is realized by sharing weights. The I pre and I post are input into two weight-sharing branches of the Siamese network to obtain multilayer features of pre-and postevent, represented 2) F post is corrected and refined using the feature repay (FR) mechanism before feeding it to step 3.

3) As optional inputs, DEM and partial landslide segmentation
results are used to further enhance F post .
Step 3: The F pre and F post are input into BFDEM to obtain the output change detection map. The core block of BFDEM, the bifeature difference enhancement block (BFDEB), fuses F pre and F post to obtain the change detection feature Convolutions are employed to reduce the feature dimension of f 1 dout to 1, that is, the output change detection map. A combined loss function that integrates binary cross entropy [34], structure similarity [35], and intersection over union (IoU) [36] loss is used to calculate the distance between the change detection map and the ground truth to complete the training. It can be defined as follows: The BCE loss function calculates the loss value for each pixel for the image as follows: (2) where gt ij and p ij represent the true value at the pixel (i, j) and the predicted value, respectively. H and W represent the height and width of the image, respectively. The SSIM loss function is integrated to focus on the integrity of the local area. Its calculation formula is given as follows: where x and y represent the reference image and the predicted image, respectively. μ x , μ y and σ x , σ y represent the mean and standard deviation of x and y, and σ xy represent the covariance. C 1 and C 2 are two constants to avoid the denominator being 0, in this study, C 1 = 0.01 2 and C 2 = 0.03 2 . The IoU loss function focuses on the global structural information. Its calculation formula is given as follows: where g ij and p ij represent the true value and the predicted value, respectively.
Step 4: The proposed FCDM uses DEM and D8 to calibrate the flow direction of landslides.

B. Feature Extractor
The feature extractor is a Siamese network with two weightsharing branches, as shown in Fig. 1(a). First, in each branch, convolutions with a stride width of 1 and kernel size of 3 (Conv3×3_str1) are used to increase the feature dimension to 64. Second, the first four layers of ResNet34 [37], i.e., layer1-layer4, each of which contains a varying number of BasicBlock of ResNet, are used as the backbone for feature extraction. BasicBlock for each layer contains the same number of Conv3×3_str1 as the number of channels and uses a skip connection known as "shortcut." Skip connection solves the degradation problems of vanishing gradients and exploding gradients when training deep neural networks. Feature extraction by using skip connections enables the network to learn features with multiscale information and various receptive fields. A BasicBlock as layer5 is used to obtain deeper features for feature enhancement. Finally, layer1 -layer5 of features can be obtained, and their channels are 64, 128, 256, 512, and 512, respectively.

C. Postevent Feature Enhancement Module
The PFEM contains three stages of enhancement as shown in Fig. 1(b), i.e., PFEB, FR, and additional features.
Stage 1: The PFEB is the core block of the PFEM, which selectively fuses postevent multilayer features to provide discriminative postevent features. Due to the difference in the receptive field, low-layer features retain more complete details, such as localization information and clearer boundaries, but they suffer from background noise, whereas high-layer features have a clear background and semantic information. PFEB reduces feature variance by fusing high-level and low-level features that contain different information. As shown in Fig. 1, PFEB is a bi-input and bioutput structure. The postevent feature of the ith layer and the high-layer feature of the i+1 layer are used as input. The enhanced postevent feature and the high-layer feature of the ith layer are the output. An overview of PFEB can be expressed as follows: where i = { 4, 3, 2, 1} represents the layer of network. The feature f post output by each layer of PFEB contains more comprehensive information, but it also retains the differences in the features of each layer. The specific process of PFEB is as follows.
Step 1: The consistent part of the multilayer features of the biinput is obtained by elementwise multiplication, denoted as F con . This process can be expressed as follows: where ⊗ represents elementwise multiplication.
Step 2: The Squeeze-and-Excitation network (SE-Net) [38] is used for the "feature recalibration" of F con . As shown in Fig. 2, first, the Squeeze uses the global average pooling (GAP) to compress the spatial dimension in each channel into a global feature constant. Second, the Excitation is used to capture the dependencies between feature channels to generate weights for each feature channel, which is realized through layers FC + Relu + FC + Sigmoid, that is Weigths = σ (FC 2 (Relu (FC 1 (F con )))) where σ(·) represents a Sigmoid activation function. FC 1 and FC 2 represent fully connected layers with dimensions c/r and c, respectively. c represents the feature dimension of F con , and r is the dimension reduction coefficient, generally r = 16.
Finally, the Scale weights feature channel-by-channel by multiplication to complete the "feature recalibration" in the channel dimension. Scale makes the model more discriminative to the characteristics of each channel, which is similar to the attention mechanism. This process can be expressed as follows: where represents dot product by channel.
Step 3: The recalibrated consistent features are applied to enhance saliency cues in F post and F high , thereby yielding fused features where ⊕ represents elementwise addition, and Conv, BN, and Relu represent Conv3×3_str1, batch normalization, and rectified linear unit in the common deep learning network, respectively. Compared with methods using direct addition or concatenation to fuse the features of different layers, the proposed PFEB can remove the information generated during the fusion process that may muddy the original features. Through upward propagation, F post continuously learns useful information from F high so that the enhanced F post contains rich information, such as clear boundaries, accurate localization, and rich semantics.
Stage 2: The f 1 high output of the layer1 PFEB is relatively complete, and the FR is a mechanism that downsamples the f 1 high to the same dimension as each layer in F post to further correct and refine F post . This process can be expressed as follows: Stage 3: Additional features can be used to further enhance F post . The first kind of additional information is DEM; because of the independence between DEM features and optical image features, concatenation is directly used to combine them. This process can be expressed as follows: where C represents the concatenation of two feature matrices. The second one is the labeling of some landslide segmentation results on the postevent image. This network can upsample the output f 1 high to the original image size as a rough landslide segmentation result. Labeled landslides are used to side-supervise the corresponding image regions and update model parameters by using back-propagation.

D. Bifeature Difference Enhancement Module
The architecture of the BFDEM is shown in Fig. 1(c). The BFDEM uses its core block BFDEB as the backbone of the decoder to obtain change detection results by continuously upsampling and upward propagation. The structure of BFDEB is shown in Fig. 1; it generates a differentiated feature map by enhancing the difference between pre-and postevent features. The process of BFDEB is as follows.
First, the difference feature matrix obtained by elementwise subtraction is denoted as F diff . This process can be represented as follows: where represents elementwise subtraction. Second, similar to the PFEB, elementwise multiplication is used to get the consistent part of F diff and F high as follows: Third, similar to (7) and (8), SE-Net is used to achieve "feature recalibration" to get F diff−con . The calculation process will not be repeated.
Finally, F diff−con is combined with the difference feature F diff by an elementwise addition to obtain the feature-enhanced change detection features. This process can be expressed as follows: Compared with methods using direct subtraction or concatenation to fuse pre-and postevent features, the designed BFDEB can highlight spectral, textural, and morphological differences between features to obtain more accurate results.
After BFDEB of layer1, Conv3×3_str1 is used in f 1 dout to reduce its dimension to get the final change detection result. In addition, a hybrid loss function is applied to the supervised training process, as suggested in [39].

E. Flow Direction Calibration Module
FDCM uses D8 to calibrate the landslide flow direction as shown in Fig. 1(d). D8 is a single-flow-direction algorithm using DEM for flow direction analysis. The idea of the single-flowdirection algorithm is that the central grid has only one outflow grid, and all the "water" in the central grid is transferred to the outflow grid after the flow direction is determined. D8 assumes that the "water flow" in a single grid can only flow into the eight adjacent grids and use the steepest slope method to determine the direction of the flow.
Specifically, D8 calculates the elevation weight drop between the central grid and each adjacent grid on the 3 × 3 DEM grid and takes the grid with the largest elevation weight drop as the outflow grid of the central grid. On the final flow direction prediction map, we use eight different values to represent different flow directions, as shown in Fig. 3. D8 is a simple and effective method for the detection task of landslide flow direction.

A. Study Area and Dataset
Under the influence of a torrential rainstorm on June 7, 2008, thousands of landslides with different sizes, shapes, and spatial distributions were caused in Hong Kong, causing huge losses to human life and property. The Hong Kong government attaches great importance to the prevention and postprocessing of such disasters [40]. Historical landslide data are recorded through the Enhanced Natural Terrain Landslide Inventory, which provides a wealth of prior knowledge for landslide identification. Our study area consists of four areas A-D, as shown in Fig. 4. Areas A-C are located in Lantau Island, the largest island in Hong Kong, and Area D is located in the sharp park of Sai Kung East Country Park. The pre-event and postevent aerial photos of areas A-D were collected by an aerial survey system equipped with a Zeiss RMK TOP 15 aerial survey camera at a flight altitude of 2400 m. The spatial resolution of the photos is 5 m, and the size is 2698 × 2698 pixels. At the same time, the postevent DEM data were also obtained from the relevant departments in Hong Kong. In Fig. 5, we illustrate study area A. A more detailed introduction to the study area can be found in [27].
Due to the limitations of the GPU and the need for continuous downsampling of the model, we cropped four large-format photos into several 256 × 256 pixel images with a 20% overlap rate. Flip, rotate, blur, and GridMask [41] are randomly adopted to enhance the data, and finally, 1296 sets of data were obtained. We use the following two data division methods. 1) Randomly divide the training set, validation set, and test set according to the ratio of 3:1:1. This division method is used for comparative experiments to verify the accuracy of the model. 2) Divide the data according to the A-C areas as the training set and the D area as the test set to verify the stability and generalization performance of the model.

B. Comparative Methods
To demonstrate the superiority of MFENet, six SOTA deeplearning-based change detection methods were selected for comparison, which included the three SOTA general change detection methods proposed on nonlandslide datasets and the three landslide-specific methods proposed on landslide datasets. These six methods are briefly introduced. Methods 1)-3) are proposed on nonlandslide data, and methods 4)-6) are proposed on landslide data. 1) FC-Siam-diff [42]: This network is proposed based on the U-Net architecture, which uses an encoder composed of a Siamese network to extract bitemporal features in parallel as the input to the decoder. 2) DSIFN [43]: This network uses pretrained vgg16 as an encoder, spatial and channel attention mechanisms in the feature fusion stage, and deep supervised learning to supervise feature fusion at different scales.

3) SNUNet [44]: This network is a densely connected
Siamese network that maintains high-resolution and finegrained features through dense skip connections. Meanwhile, the deep supervision module of the ensemble channel attention module is used to refine the most representative features of different semantic levels. 4) FCN-PP [28]: This network uses FCN as the encoder of the network in order to construct a U-shaped network. The network is improved by adding a pyramid pooling (PP) module at the bottom of the U-shaped network. PP consists of three convolutions of different sizes and different stride widths to obtain features in the different receptive fields. 5) DP-FCN [29]: This network uses Siamese FCNs to encode and decode pre-and postevent images, respectively, and fuses pre-and postevent features by concatenation in the intermediate stage of decoding. 6) LanDCNN [31]: This network also uses U-Net as the main structure, replacing the encoder with ResNet50. The DEM and the image are concatenated as the input of the network, which proves the effectiveness of the DEM as an additional feature.

C. Implementation Details and Evaluation Metrics
The proposed method is implemented based on Pytorch using python3.8 + cuda10.2. The Adam optimizer is used with an initial learning rate of 0.0001, and the training batch data size is set to 8. The optimizer of the comparison method is only mentioned by SNUNet in its paper and also uses the Adam optimizer; our optimizer is unified as Adam during training. The epoch is set to 100 in order to make the results obtained by all networks converge to their optimal values. All of our experiments are conducted on the NVIDIA GeForce RTX 3090 24 GB.  To evaluate the performance of the proposed method, we used four evaluation metrics: Precision (P), Recall (R), F1 score (F1), and IoU. P represents the proportion of correctly detected changed pixels in the model-predicted changed pixels. The higher P indicates that fewer false pixels are detected. R represents the proportion of correctly detected changed pixels in the true value of changed pixels. The higher the R, the fewer missing pixels are detected. The F1 score can be regarded as the harmonic means of the model's P and R. IoU represents the ratio of the intersection and union between the predicted result of the change pixels and the real change pixels. They are calculated as follows: where TP represents the correct detection as changed pixels, FN and FP represent the missed and falsely detected changed pixels, and TN represents the correct detection as unchanged pixels.

D. Comparisons and Analysis
It is worth noting that MFENet does not use additional features in this section. According to the first data division   Table I. The method proposed in this article achieves the best in P, IoU, and F1 and improves by 0.79%, 4.84%, and 2.85% compared with the suboptimal method. In addition, MFENet achieves the highest accuracy with a modest number of parameters (Params). This shows that the MFENet has a good balance between computational complexity and accuracy. To evaluate the performance more intuitively, details of the experimental results are presented in Fig. 6. It can be found that the proposed method improves the results mainly in the following aspects.

E. Ablation Study
To evaluate the effectiveness of PFEM and BFDEM, the following four ablation experiments are set up to further examine the modules and network structures.  Table II and Fig. 7. BFDEM and PFEM have significant effects regarding the respective improvement of the P and R of the model. Moreover, the overall performance of PFEM is higher than that of BFDEM, which shows that PFEM specially designed for this type of postevent change of landslide is very effective. The MFENet model, which combines PFEM and BF-DEM, achieves the best performance, indicating that there is good compatibility between the two modules. Thus, and in summary, MFENet is able to be effectively applied for LM.

F. Flow Direction
First, the landslide boundary is calibrated on the postevent image using the landslide detection results, which provide information for calculating the perimeter and area of the landslide. Similarly, the landslide boundaries are demarcated on the DEM. Then, the flow direction of each pixel of the landslide is calibrated by using the D8 method combined with the DEM and the landslide boundary, and the flow direction map of the landslide is obtained. Finally, the flow-to-grayscale image is converted to an RGB image for easy visualization.  Table III. Inputting additional features can significantly improve the accuracy of the model. When two additional features are input simultaneously, R, IoU, and F1 are all optimal, but the P value is slightly lower than the input alone, which may be caused by a mutual disturbance between features.
The visualization of the results is shown in Fig. 9. As a commonly used data source for landslide detection or postprocessing, the effect of adding DEM to the network is significant. It can be seen from Fig. 9 that the DEM effectively eliminates the internal cavity of the landslide caused by disturbance factors and the false changes that may be identified as landslides on the spectrum. Segmentation results increase the supervision information of the network, which is equivalent to expanding the training data so that the network can learn more effective landslide spectral features to improve detection accuracy.

B. Generalization Performance
According to the second data division method, the generalization performance of the model is verified by comparative experiments. The experimental metric results are shown in Table IV. MFENet still outperforms other networks on R, IoU, and F1. The visualization of the results is shown in Fig. 10. Combining The detected changes are white pixels, and the unchanged parts are black pixels. metrics analysis and visualization results, it is found that the main reason for poor generalization performance is reflected in the following two aspects: conservative predictions lead to a large number of missed pixels, and aggressive predictions lead to a large number of false pixels. MFENet is neither too conservative to ensure that landslides can be completely detected, nor it is too aggressive to detect many false landslides. MFENet maintains high precision while achieving high recall. However, it can be seen that the generalization performance of MFENet is better than that of other models. It is also worth mentioning that DSIFN obtained the largest P or R in the two data division methods, respectively, showing that conservative and aggressive may not be fixed characteristics of the model and may be related to the dataset.
However, when compared with the first data partitioning method, all models show a significant decrease in accuracy. This appears to be due to the similarity of the geographical environments of the A-C areas, which leads to similar types of landslides afterward. The models only learn the features of landslides in this specific geographical environment, and the decrease in model accuracy is predictable. If a fully supervised model is to have better generalization performance, model training in different scenarios is essential.

V. CONCLUSION
In this article, an end-to-end change detection network based on multilevel feature enhancement is proposed for landslide detection. The postevent feature recalibration and bitemporal difference feature recalibration are completed by means of two feature-enhancement modules, PFEM and BFDEM. Experiments show that MFENet outperforms both the SOTA general change detection methods and landslide-specific change detection methods for LM. The landslides detected by MFENet maintain clear boundaries and internal logical consistency and also reduce false detections and missed detections caused by disturbance factors. Finally, the landslide flow direction is calibrated by using the D8 method combined with the landslide detection results and DEM to complete the LM. On the basis of MFENet, additional features are input through side supervision and concatenation, which further improves the accuracy of the network to detect landslides. In addition, it is experimentally demonstrated that the generalization performance of MFENet outperforms that of other methods. The LM results of this study are shown to have high accuracy, less manual intervention, and richer landslide information, capable of being used for landslide sensitivity analysis, and have great significance as regards the success of the following: postdisaster rescue, disaster prevention, and mitigation. In the future, it will be worthwhile for there to be further focus on the development of LM networks with stronger generalization performance, enabled by combining weakly supervised learning or transfer learning with the current limited landslide remote sensing data, worldwide.