AMIO-Net: An Attention-Based Multiscale Input–Output Network for Building Change Detection in High-Resolution Remote Sensing Images

Building change detection (CD) from remote sensing images (RSI) has great significance in exploring the utilization of land resources and determining the building damage after a disaster. This article proposed an attention-based multiscale input–output network, named AMIO-Net, for building CD in high-resolution RSI. It is able to overcome partial drawbacks of existing CD methods, such as insufficient utilization of information (details of building edges) of original images and poor detection effect of small targets (small-scale buildings or small-area changed buildings that are disturbed by other buildings). In AMIO-Net, the input image is scaled down to different sizes, and performed the convolution to extract features. Then, the feature maps are fed into the encoding stage so that the network can fully utilize the feature information (FI) of the original image. More importantly, we design two attention mechanism modules: the pyramid pooling attention module (PPAM) and the Siamese attention mechanism module (SAMM). PPAM combines a pyramid pooling module and an attention mechanism to fully consider the global information and focus on the FI of changed pixels in the image. The input of SAMM is the parallel multiscale output diagram of the decoding portion and deep feature maps of the network so that AMIO-Net can utilize the global contextual semantic FI and strengthen detection ability for small targets. Experiments on three datasets show that the proposed method achieves higher detection accuracy and F1 score compared with the state-of-the-art methods.


I. INTRODUCTION
C HANGE detection (CD) based on remote sensing images (RSI) is a fundamental research issue as it can facilitate the utilization of land resource, disaster damage estimation with respect to urban space planning [1]. RSI  of traditional methods and deep learning-based methods. The accuracy of traditional methods mainly depends on the difference maps. The lower the information loss of the difference maps, the higher the accuracy of detection. Liu et al. [2], first, used log-ratio and mean-ratio methods to obtain the difference map, then extracted feature vectors by principal component analysis, and finally used fuzzy C-means to classify pixels, which improved the accuracy of CD. But, the method needs to adjust many parameters, which is time consuming. Xin et al. [3] combined the double Gaussian mixture model with the wavelet transform, and employed a hidden Markov chain model to obtain the CD graph. It solved the problem of the low matching rate of the single-function model. Most of the traditional methods rely on hand-crafted features, which fails to effectively establish models for complex changed information, resulting in a poor classification effect. Over the years, with development of remote sensing technology, we can get a large number of remote sensing images. These images can be used for crop production forecast [4], mapping human activity [5], and surface observation [6]. And deep learning methods have been proved superior to traditional methods in feature extraction [7], [8]. It is widely applied in many fields, such as scene classification [9], object detection [10], and change detection [11]. It can perform the feature extraction and CD through building a series of models such as ResNet [12], UNet [13], and FPN [14]. However, most of the existing models have the problem of insufficient utilization of the information (such as edge details information of buildings) of original image. And the original image is used only once for most models when the image is input, which leads to the problem that the edge information of buildings will not be fully utilized. And whether it is single-stream or double-stream network framework, it extracts features basically by multilayer convolution and pooling operation. Dault et al. [15] implemented a Siamese full convolutional network, which had better CD performance than previously proposed methods. Ding et al. [16] designed a cross-layer addition and skip-connection module to aggregate multilevel information. In this way, the information of the feature diagrams generated by encoding stage can be exploited. However, because of the layer-by-layer pooling operations, some of this information has been lost, and more comprehensive edge This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ information of buildings in the original images cannot be used. Lei et al. [17] extracted features of different scales through a multiscale convolution model and improved its learning ability. Yang et al. [18] proposed an improved UNet Siamese network to extract the difference information from a processed image, and performed feature fusion to obtain a changed binary image. Song et al. [19] constructed the Siamese network and U-shaped structure to realize feature extraction with multiscale change. Though abovementioned methods can aggregate the global context semantic feature, they still cannot use full of the edge details of original images. As a result, the obtained CD map is relatively rough and cannot completely restore the changed area, particularly in building edge locations. Therefore, this article designs the multiscale input module, which can drop the original images to different scales and connect with the feature maps of each encoding block. In this way, each encoding block has richer semantic information and also enhances the feature extraction ability of the network. The obtained CD maps have relatively complete buildings. The introduction of attention in the MIO structure can also filter out some interference information and reduce information redundancy.
In addition, traditional convolutional neural networks often ignore the features of small targets when extracting features through multilayer convolution operations, such as small-scale buildings, and small-area changed buildings that are disturbed by other buildings (especially when the small-scale buildings are beside large-area buildings, the small-scale buildings can be mistaken for the large), resulting in false and missing detection. These models do not introduce attention mechanisms, which causes the model to focus on learning large-scale buildings with significant changes while ignoring buildings with minor changes. To solve this problem, researchers adopted the attention mechanism to increase the network's ability to detect small objects. Chen and Shi [20] introduced an integrated attention module to a convolutional neural network (CNN) with a Siamese structure, which could calculate the weight of attention between any two pixels, and capture the spatial-temporal dependence. Zhang et al. [21] used the channel attention mechanism in the skip-connection part of UNet model to enhance the detection ability of small targets. And the experiments indicate that the detection effect is improved. Zhu et al. [22] adopted the attentional mechanism of parallel branch structure and linked it to skip-connection of network. However, these attentional mechanisms cannot fully utilize the global information of feature maps. They cannot distinguish the difference between different scales of buildings, and the changed pixels of small objects in complex backgrounds will be lost. As a result, while some models introduce attention, they will also exist phenomenon of losing small object buildings. To address this drawback, we propose an attention mechanism, which, first, refines the input feature map into different scales, then calculates attention weight separately, and finally aggregates them together. While the network focuses on the changed areas of small targets, the global information of feature maps is preserved to the greatest extent.
For the abovementioned issues, our main contributions are as follows.
1) AMIO-Net adopts a multiscale input structure. The input image is reduced to different scales and input to the encoder, which enables each encoding convolutional layer to make full use of the deep semantic features of the original image. 2) To better utilize the decoding feature map, we add the multiscale output structure to form a parallel branch structure, which makes full use of FI of the decoding stage and strengthens the network's ability to get global FI. 3) Two attention modules are proposed that embedding at the output of encoding and decoding stage, respectively. By using the feature parameters of the feature matrix to realize matrix's update and redistribution, the attention on the changed area of small targets is strengthened. The rest of this article is organized as follows. Section II is devoted to the details and structure of the proposed model. Section III introduces the related datasets, some comparison algorithms, and the evaluating metrics. Section IV compares and analyzes the experimental results. Finally, Section V concludes this article. Fig. 1 shows the AMIO-Net structure, which consists of four parts: the multiscale input, Siamese encoding structure, the multiscale output, and decoding output. In the multiscale input part, T1 and T2 images are reduced to 1/2, 1/4, 1/8, and 1/16 of the original scale by downsampling operation, which are fed into network to extract features through performing convolution and lightweight attention operation. Siamese encoding part adopts a Siamese structure consisting of five encoding blocks, each of which contains convolution, max-pooling and batch normalization (BN) layers, as shown in Fig. 2(a). In the multiscale output part, to combine shallow and deep FI, decoding feature maps of different sizes are expanded to the same scale as the T1 image. Each decoding block of the decoding output part includes upsampling (upsampling adopts the transposed convolution), channel stacking, convolution, and so on, as shown in Fig. 2

A. Model Structure
When the network model works, the size of T1 and T2 images is decreased to 128 × 128, 64 × 64, 32 × 32, 16 × 16 by downsampling operation with bilinear interpolation. Since each encoding block uses the original image, the network can make better use of the FI than the network structures only using a single original image. The first encoding block outputs a feature map by performing two convolutional layers, two ReLu layers, and two BN layers. Then, the scale of the feature map is reduced to half of the original by the max-pooling. After that, the feature graph and another feature diagram with the same scale generated by the multiscale input part are stacked and input into the second encoding block. Thus, as shown in Fig. 2(c), each encoding block outputs one feature map. In the Siamese encoding part, the network outputs 10 feature maps in pairs. We concatenate the corresponding feature maps in pair.
Since the last encoding block has the deepest FI, which is difficult to fully exploit, we design a pyramid pooling attention module (PPAM). This module reduces the fifth encoding feature map to different scales by adaptive average pooling to preserve global information. And, it can filter background information through an attention mechanism, reducing the interference of irrelevant information on CD [23], [24]. Thus, we obtain the fifth output feature graph.
Through Siamese encoding and the multiscale input, we finally obtain five feature maps. We double the scale of the fifth feature map, and concatenate it with the fourth output feature graph on the channel dimension. Then we feed it into the convolution layers to extract feature. After that, the BN layers and ReLu layers are used for accelerating network training. The dropout operation (the dropout rate is set to 0.3) is also adopted to enhance the generalization of the network. After four more such operations, the feature map C 1 is obtained. To better use the decoding feature diagrams at different levels, a multiscale output structure is designed. It expands the scale of the feature graph to 256×256. Thus, as shown in Fig. 1, four feature maps of the same scale and different channel dimensions can be obtained. We merge them and feed them into the convolutional layer with the convolution kernel size of 1×1, adjust their channels, and obtain feature map C 2 . To make the model focus on the changed information of small targets in C 1 and C 2 , C 1 , and C 2 are input into the Siamese attention mechanism module (SAMM). Finally, a convolutional with kernel scale of 1×1 is used to obtain the final CD map where C v represents convolution operation an S represents SAMM. Two attention modules play a key role in AMIO-Net, and they will be introduced in detail next.

B. Pyramid Pooling Attention Module
The PPAM structure is shown in Fig. 3(a). We refer to the pyramid pooling module (PPM) [25] to design PPAM in the AMIO-Net. This module can combine and utilize the contextual information from different regions. Compared to a single pooling operation, PPAM can enhance the network's ability to use global information. To make the model focus on the changes of small objects, we add a lightweight attention mechanism (Lam), which can enhance the model's learning ability of changed information of small targets.
The Lam structure is shown in Fig. 4, and the output characteristic matrix is calculated as where f in (i, j) represents input feature map, (i, j) is pixel value of row i and column j, Sig represents sigmoid function, and AvgPool represents adaptive average pooling.
The input feature matrix f in (i, j) is, first, performed average pooling operation in Lam, then the feature weights are updated by full connection operation and backpropagated with ReLu activation. After that, the matrix weights are updated through sigmoid function, and multiplied by the input feature map f in (i, j) to obtain the characteristic matrix f out (i, j) with attention.
In PPAM, feature matrices A 1 -A 4 of different scales (1 × 12 × 2, 4 × 4, and 8 × 8) can be obtained after adaptive average pooling. We perform convolution operation on them to adjust the number of channels, and apply Lam to obtain feature matrices S 1 -S 5 . These are calculated as where H and W are the length and width of the eigenmatrix, respectively, i takes the values 1, 2, …, Lam represents lightweight attention mechanism. The feature matrixes S 1 -S 5 are expanded to the same scale as the input feature graph X by upsampling. Stacking utilization is carried out on the channel level to aggregate different changed FI to obtain the feature map U . After adjusting the number of channels through convolution, the final feature matrix Y is obtained. The spatial scale and channel dimension of Y are the same as those of X.

C. Siamese Attention Mechanism Module
The structure of SAMM is shown in Fig. 3(b). The feature graphs of different decoding layers all have rich FI, which cannot be fully utilized by only a single-stream decoding structure. Therefore, as shown in Fig. 1, we design a parallel multiscale  output branch structure. Thus, the decoding part is divided into two branch structures, each of which finally outputs a feature graph with the same scale as the T1 image. To better use the output feature maps of the two branches, we design the SAMM. It has two branch structures with same operations. The input feature maps X 1 and X 2 are performed a convolution operation with the convolution kernel size of 1×1 and an adaptive average pooling operation, and their numbers of channels and sizes are adjusted. Then, the matrices are updated through a full connection operation, and the results are added. After inputting the results to the ReLu and softmax activation functions, the weights of the matrix are adjusted.
We multiplied them with the input feature maps X 1 , X 2 to obtain the characteristic matrices Y 1 , Y 2 . After adjusting the number of channels of the feature matrix by convolution, the final feature matrix Y is obtained. The matrix Y fully integrates the FI with the changed class of the two input feature maps. This can improve the network's attention on the changed information of small targets and the CD accuracy. The calculation method of feature graphs is discussed below.
1) In the model structure Fig. 3(b), H and W represent the length and width of input feature graphs X 1 , X 2 , and C is the number of channels of them. The feature matrix is performed the convolution and average pooling operations, and the scales of feature maps X 1 , X 2 are reduced to 1 × 1, 2 × 2 where X(m, n) is the pixel value at the mth column and nth row of the eigenmatrix. 2) After obtaining the feature weights P 1 -P 4 , the full connection operation is used to compress them to reduce parameter calculations. Thus, we obtain the eigenmatrices M 1 -M 4 , and then add M 1 and M 2 , M 3 and M 4 , respectively.
The results of addition are fed into the ReLu activation function, which can simplify the calculation, accelerate network training, and reduce gradient disappearance. In this way, we can acquire the feature matrices R 1 and R 2 . They are processed by softmax function to redistribute the pixel weights and obtain the matrices S 1 , S 2 with updated weights where ϕ is softmax activation function, and L represents the full connection. 3) We multiply S 1 , S 2 by the input feature matrices X 1 , X 2 separately to get eigenmatrices Y 1 , Y 2 and perform the concatenation of channel dimensions to obtain a matrix Y . After adjusting the number of channels by a convolution layer with 1 × 1 kernel size, the final characteristic matrix Y with the same spatial scale as X is obtained as where Conc represents the stack operation.

D. Loss Function
In CD, the cross-entropy loss is often used during the network model training process. This is where x is the input, label is the label image, and N is total number of pixels of images. Building CD in RSI is a binary classification problem, and the pixel values of the label image are only 0 and 255. When constructing the dataset, we carried out a normalization operation for the label image, whose pixel values become 0 and 1. The 1 represents the changed pixels and 0 indicates the unchanged. According to formula (7), the loss is very large when label is 0, and very small when label is 1. It is expected that the loss is as small as possible. However, in some datasets, changed pixels only occupy a small part of the entire image, i.e., the pixels with label 0 account for a large proportion. If cross-entropy loss is directly used to train the model, the loss will be very large and the training effect will be very poor. Therefore, to reduce the influence of imbalance of the pixel label on the CD accuracy, where p j and t j , respectively, represent the predicted value and true value of the changed pixel j. For binary classification problems, Dice loss can effectively clear all pixels in the prediction map that are not activated in the label map. For activated pixels, it mainly penalizes low-confidence predictions, and higher values will get better Dice coefficient. That is, the function can make the loss become smaller during training, and the network can converge faster. Therefore, the loss function in this article is loss = loss ce + βloss dice (9) where β is used to balance the loss ce and loss dice [26].

A. Datasets
In the experiments, three datasets were used to train and test AMIO-Net. LEVIR-CD [20] is a large dataset of binary CD in remote sensing, with 637 image pairs with a resolution of up to 0.5 m/pixel. Image labels are marked by binary labels (1 represents change, 0 is unchanged), and the image size is 1024×1024. The Google dataset [27] collects 19 season-varying VHR RGB image pairs, whose spatial resolution is 0.55 m and the scale is ranging from 1006 × 1168 to 4936 × 5224 pixels. The annotation is focused on buildings. The S2Looking dataset [28] contains 5000 bitemporal image pairs of global rural areas and over 65 920 annotation variation instances. The image size is 1024 × 1024 pixels and spatial resolution is 0.5-0.8 m/pixel. And it has side satellite images taken from various angles. The characters of large illuminance differences and complex rural images are challenging, and increase the difficulty of CD. The range of image sizes contained in the three datasets is 256 × 256-4936 × 5224. Due to memory limitation, images with large sizes must be cropped before sent to the network. Slide cropping is carried out on large images in random window mode, and the size of images is cropped to 256 × 256. After cropped, some label images will not contain changed pixels, which make it hard for the model to learn useful features. Therefore, these label images should be removed. After preprocessing, three datasets are randomly divided into training, validation, and testing set, respectively. The specific divisions are shown in Table I. Then three datasets are augmented by image flipping and rotation, which can increase the network's ability to learn and recognize  complex situations and reduce the overfitting phenomenon during network training. Fig. 5 is the image samples of the three datasets after cropping, and Fig. 6 shows some image samples of the augmented data.

B. Comparison Methods
AMIO-Net was evaluated by comparing the change detection results with the following methods on three different datasets.

1) FCN [29]
: This is a fully convolutional neural network with all convolutional layers except for the full connection layer. It can adapt to input images of various sizes, and its deconvolutional layer increases the data size to ensure that the output image results are sufficiently refined. 2) SegNet [30]: It is a semantic segmentation network. Its shallow features extraction structure comes from a VGG16 convolutional layer. The output of the deep features extraction structure is sent into softmax classifier to obtain Two subnetworks output the corresponding feature maps, which are connected and input to the decoding structure.

5) SNUNet [31]: It is an improved UNet++ full convolution
Siamese network. It adds a channel attention module and ensemble channel attention to enhance the CD accuracy. 6) STANet [20]: This is a Siamese spatial-temporal attention network that can explore spatial-temporal relationships. It extracts features by using weight-sharing CNN and measures the distance between feature diagrams to detect the changed areas. 7) DTCDSCN [32]: It consists of three subnetworks, one of which is used to perform change detection and the other two for semantic segmentation. And, it builds a dual attention module to improve the feature representation. It can better detect the changed regions. 8) DSAMNET [33]: It can generate more useful features by introducing a deeply supervised module. And, it uses the convolutional block attention modules to fuse different levels of features. 9) IDET [34]: It contains three transformers, two of which are used to extract long-range information and the third to enhance differential features.

C. Parameter Settings
In model training, the Adam optimizer is used in AMIO-Net. The Adam combines the advantages of GDM and RMSprop [35]. Compared with other optimizers, the Adam is characterized by efficient calculation, simple implementation, and good learning effect. ReLu activation function [36] is used, which can overcome gradient disappearance and accelerate model training. The specific parameter settings are shown in Table II.
The objective function may drop to a local minimum during training, which worsens the training effect. So, the cosine annealing is used to adjust the learning rate where j is index, δ j min and δ j max are the maximum and minimum values of learning rate, respectively, T cur is current epoch; and T j is the number of epochs in the jth execution. The algorithm makes the learning rate decrease to a certain value, and then increase immediately to the initial value. After each epoch training, it decreases slightly and repeats, so that the model can break out of the local minima and achieves a better training effect.

D. Metrics
The evaluation metrics used in the experiments were recall rate (Re), precision rate (Pr), overall accuracy rate (OA), and where TP is positive samples of the positive class, FN is positive samples of the negative class, FP is negative samples of the positive class output by the model, and TN is negative samples of the negative class output by the model.

A. Experimental Environment
The experiments were performed by an NVIDIA Quadro P6000 graphics card with 24 GB graphics memory. The deep learning framework was PyTorch 1.5, development environment was PyCharm, and the programming language was Python 3.7.

B. Ablation Experiment of AMIO-Net on LEVIR-CD Dataset
To confirm the validity of the multiscale input-output (MIO), PPAM, and SAMM modules in the network, we perform the ablation studies on LEVIR-CD dataset. The basic network is Siamese convolution network without MIO structure, PPAM, and SAMM modules. At the same time, to verify superiority of PPAM module, we compare it with PPM. The result is shown in Table III.   Fig. 7.
From Fig. 7, compared with the detection results of basic network, the detection accuracy is improved after the introduction of MIO, PPAM, and SAMM modules. With the introduction of each module, the utilization of the input image information becomes more and more comprehensive and the accuracy of detection becomes higher. The attention module also makes the model to better capture the changed information of small targets, such as maps (g) and (h). From maps (f) and (g) of the first and fourth rows, we can see that introduction of PPM module has more false detection compared with PPAM module. In addition, it can also use the global information of input feature diagram to the greatest extent and reduce the error detection rate. Therefore, these three modules can reduce the phenomenon of misclassification and omission, and effectively detect the changed areas in CD task.

C. Comparison and Analysis of Experimental Results
Figs. 8-10 show the accuracy, precision, recall rate, and F1 curves of different algorithms during the training processes. Figs. 11-13 illustrate the predicted change maps obtained using different algorithms. Tables IV-VI represent the quantitative performances of different CD algorithms.
1) LEVIR-CD Dataset: As shown in Fig. 8, although the convergence speed of AMIO-Net is not the fastest, its OA, Re, and F1 values exceed those of other models. Fig. 11 shows the predicted results on LEVIR-CD dataset. We can see some differences between label maps and the predicted results of different methods. The partial differences are marked with red boxes. It can be seen that AMIO-Net has better detection effect. For example, in maps (a) and (b), AMIO-Net has more complete details in subtle edges of buildings. From the detection  results of maps (d) and (e), only AMIO-Net and DSAMNET can detect the small-scale buildings better, but DSAMNET has more false detection. The maps (e) show that other methods are poor at detecting small changed objects. This is because the background information is too complicated, there are more interference factors. STANet, DTCDSCN, and IDET could detect most of the changed pixels, but also make mistakes in detecting the building edges. AMIO-Net can detect building edges more accurately. On the whole, AMIO-Net has better performance for buildings of different shapes and scales. Table IV presents the quantitative test results. The OA of AMIO-Net can reach 0.9828, which is nearly 0.34%-3.17% higher than that of other methods. The Pr is 0.906, which is    0.83% lower than STANet, but the Re of this method is 7.37% higher than that of it. This means that accuracy of detection results of AMIO-Net is slightly lower than STANet, but AMIO-Net can get more changed pixels. The F1 value of the proposed method reaches 0.8997, which is 2.53%-15.32% higher than that of other methods. It shows that AMIO-Net has better detection capabilities in summary.
2) Google Dataset: Fig. 9 shows the training process on the Google dataset. It can be seen that volatile amplitude of Pr and Re curves is relatively large. We think this is because the image samples of Google dataset are less, and the model need more time to learn the features of images. From Fig. 9, after about 40 epochs, this method tends to converge, at a slightly lower speed than other methods. But the OA, Pr, and F1 value of AMIO-Net far exceed those of other methods. Fig. 12 shows the CD results on the Google dataset, where differences between the proposed method and the label image  Table V. From Table V, the OA of AMIO-Net is 0.9326, which is 3.27% higher than that of DTCDSCN. The Pr is 0.8055, which is 5.67%-26.8% higher than that of other methods. It shows that detection results of AMIO-Net have less false alarms. The Re is 0.8589, which is 2.48% than that of STANet. And, the comprehensive evaluation index F1 of AMIO-Net is 0.8254, 6.83% higher than that of STANet. From the results, AMIO-Net outperforms other methods on the Google dataset by a large margin.
3) S2Looking Dataset: Fig. 10 shows the training process of each method. It can be seen that the oscillation of curves is large.
This shows that the network needs more time to learn due to complex buildings, large illumination differences, and so on. From about 20 epochs, the proposed method exceeded the other methods on all indicators and tended to converge in more than 40 epochs. Fig. 13 shows the CD results on the S2Looking dataset. The maps (a) and (b) show the changes in large-area buildings, maps (c) and (d) show the changes in small-area buildings. The maps (e) contain the large-scale buildings and small-scale buildings. From maps (a) and (b), most methods can detect the changed regions. But their performance is poor in detecting edge details of buildings. This reason may be that differences between background and buildings are lower. And the small color deviation between environment and building edges also lead to false alarms for these methods. But detection results  Table VI, all metrics of AMIO-Net are higher than those of other methods. The OA of AMIO-Net is 0.9686, which is 0.43% higher than that of STANet. The Pr of AMIO-Net is 0.6394, which is 3.5%-27.92% higher than that of other methods. The Re of AMIO-Net increased significantly, which is 4.28%-24.08% than that of other methods. The F1 of AMIO-Net is 0.5334, 6.81% higher than that of STANet. AMIO-Net significantly outperforms other methods on the S2Looking dataset.

D. Comparison of Inference Efficiency
To test the inference speed of different models, we use 935 images of LEVIR-CD dataset to perform the experiments. The qualitative comparison result is shown in Table VII.
From Table VII, we can see that AMIO-Net takes 55s to predict 935 images. Compared with SegNet, it takes 19s more. It takes 95s less compared with IDET. Overall, the inference speed of AMIO-Net is quite fast.

E. Discussion
From the abovementioned experiments, we can know that the performance of the proposed method is higher than other models. When the complexity of the dataset is relatively low and the color deviation between background and buildings is obvious, some networks with simple structure also can get good effect. For example, the F1 of SegNet is 0.8469 on LEVIR-CD. But they cannot catch the changes of edge details of buildings well. For AMIO-Net, its ability to capture subtle changes is stronger. Whereas, when the complexity of datasets increases and the differences between environment and buildings become small, the performance of networks with sample structure become poor, such as on S2Looking. Especially for changes in small targets, these networks cannot detect them. However, it can be seen from experiment on S2Looking that AMIO-Net is more sensitive for changes of small targets. It can effectively detect small targets under the interference of many factors. This shows AMIO-Net has a good perception of color changes caused by climate and light. Although the performance of this algorithm has more improvements than that of others on S2Looking dataset, the actual performance of which is still not outstanding. It shows that the proposed methods may not adapt well to complex architectural changes. This reason may be due to insufficient utilization of architectural features on different levels.
Of course, CD can be made better through processing on the dataset. For example, the spatial resolution of the regions of interest can be improved by super-resolution reconstruction algorithm, such as SRCNN [37] and DRCNN [38]. Or, the image noise can be reduced and the image signal-to-noise ratio can be increased through the methods of deep neural network, such as DNCNN [39] and FFDNET [40]. Whereas, these methods are usually time-consuming. When natural disasters occur, we need to respond immediately, and there is almost no time to prepare. Therefore, our next research direction is to improve the generalization ability and convergence speed of the model, so that it can be better adapt to CD and achieve faster detection.

V. CONCLUSION
In this article, we present an attention-based multiscale inputoutput network, AMIO-Net. To improve the ability to detect the changes of buildings, we introduced the MIO, PPAM, and SAMM modules. The multiscale input structure was beneficial to fully extract and utilize the FI of the original image. To fully utilize the information of the encoder effectively, we designed a multiscale output structure. Unlike a single-output decoder network, it has a parallel decoding branch structure and greatly enhanced the ability to use global information and improved the robustness of the network. The PPAM increased the attention to information of the changed pixels in the deep encoding layer. The SAMM module was used in the output of the double branch decoder to implement the fusion and utilization of multiscale FI. Experimental results showed that AMIO-Net was superior on LEVIR-CD, Google, and S2Looking datasets. In particular, on the most challenging S2Looking dataset, the F1 score was almost 6% higher than those of other algorithms.
Wei Gao received the B.Sc. degree in measurement and instrumentation and the M.Sc. degree in control science and control engineering from Wuhan University, Wuhan, China, in 2006 and 2008, respectively.
Since 2008, he has been a Lecturer with the Department of Measurement and Instrumentation, School of Physics and Electronics, Henan University, Kaifeng, China. His research interests include deep learningbased image processing methods and fractional order control theory.
Yu Sun received the B.E. degree in measurement and control technology and instrument from Henan University, Kaifeng, China, in 2020, where he is currently working toward the master's degree in electronic information with the School of Physics and Electronics.
His research interests include deep learning, building change detection, and image processing methods based on deep learning.