CBW-MSSANet: A CNN Framework With Compact Band Weighting and Multiscale Spatial Attention for Hyperspectral Image Change Detection

Change detection (CD) aims to detect the changing area of the same scene at different times, which is an important application of remote sensing images. As the key data source of CD, hyperspectral image (HSI) is widely used in CD technology because of its rich spectral–spatial information. However, how to mine the multilevel spatial information of dual-temporal HSIs and focus on the features of the pixels to be classified individually remains a problem in the spatial attention mechanism (SAM). To make full use of the spectral–spatial information of HSIs, in this article we propose a CNN framework with compact band weighting and multiscale spatial attention (CBW-MSSANet) for HSI pixel-level CD. The main contributions of this article are as follows: 1) a new method of pseudolabel training sample selection based on $k$ -means (KM) centroid distance is designed; 2) apply the CBW module to HSI CD to take full advantage of the spectral information of HSIs; 3) an MSSA module is developed for the pixel-level CD, which can mine multilevel spatial information and pay more attention to the features of the pixels to be classified and combine the spatial information of adjacent pixels to make it more conducive to pixel-level CD. Experimental results on four real-HSI datasets demonstrated that the performance of MSSA surpasses the classical single-scale SAM, and CBW-MSSANet is superior to some representative CD methods.


I. INTRODUCTION
H YPERSPECTRAL image (HSI) data are widely used due to their high spectral and spatial resolution. With the continuous enrichment of the HSI dataset, there are more and more applications based on HSIs, such as change detection (CD) [1], image classification [2], [3], [4], target detection (TD) [5], and crop mapping [6]. HSI CD is the information processing process of determining the surface changes in the same area based on HSIs at different times [7]. The collection and analysis of surface change information are of great significance to environmental protection, natural resource management, and the study of the relationship between human social development and the natural environment [8]. At present, HSI CD technology is used in disaster assessment [9], land-use CD [10], urban planning [11], military reconnaissance [12], and other related fields. However, at this stage, for the detection of hyperspectral changes to be more widely used, there is a need to further improve the degree of automation and precision of the CD technology [7].
In the development process of CD technology, there are many types of traditional methods that appeared. Classical methods based on algebraic analysis, e.g., using change vector analysis (CVA) [13]. Some methods are based on image matching, e.g., Euclidian distance (ED) [14], spatial-spectral cross-correlation (CCO) [15], and image difference (ID) [16]. The methods based on image transformation include 1 (ICA) [17], principal component analysis (PCA) [18], multivariate alteration detection (MAD) [19], iteratively reweighted MAD (IR-MAD) [20], and so on. Mainstream methods based on direct classification include support vector machines (SVMs) [21], maximum likelihood classification (MLC) [22], fuzzy c-means (FCM) [23] clustering, and so on. These traditional CD methods have high requirements for image preprocessing and remain in shallow calculations or operations on image pixels, so the accuracy is not always satisfying. In addition, a hierarchical-based unsupervised HSI CD method is proposed, which discriminates the types of pixel changes from the perspective of spectral changes [24]. Subsequently, [25] developed an HSI CD method based on the sequential spectral CVA, which employs an iterative hierarchical scheme to discover and identifies a subset of changes in each iteration. These methods improve the performance of CD in HSIs, but there is still a lot of room for improvement, and the performance of unsupervised CD methods on different HSI datasets lacks stability. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The HSI CD method based on deep learning can directly learn robust change features from dual-temporal images and perform two classifications of pixels using the learned features [26]. Some of the most common deep learning frameworks, such as convolutional neural networks (CNNs) [27], recurrent neural networks (RNNs) [28], deep belief networks (DBNs) [29], and so on. A deep slow feature analysis (DSFA) [30] method combines deep network and SFA [31] theory to highlight changing information. Simplified 3-D convolutional autoencoder (S3DCAECD) [32] is based on a deep unsupervised autoencoder, which can extract spatial-spectral features from dual-temporal images without prior information, and at the same time, use 3-D convolution to extract change information features, achieving good detection results. Three-direction spectral-spatial convolution neural network (TDSSC) [33] is a new three-direction decomposition method of hyperspectral change tensor, which decomposes the change tensor along the spectral direction and two spatial directions and obtains tensors in three different directions for CD. Bilinear CNNs (BCNNs) [34] use two symmetrical CNNs for end-to-end training, extract features from dual-temporal images by a two-way network, and then use a classifier to perform two classifications. In HSI CD, the above-mentioned classic deep learning frameworks and innovative frameworks are more automated than traditional methods, but the detection robustness and detection accuracy still need to be improved.
Supervised-based CD methods all require ground-truth labels to train the network, and labeling labels will require a lot of labor; therefore, we use weakly supervised methods instead. However, the spectral-spatial HSI CD method based on weak supervision is still in the development stage, and the main challenges are summarized as follows.
1) Improve the confidence of pseudolabels: The confidence level of selecting training samples from pseudolabels is very important for weakly supervised learning CD methods. However, current methods for picking high-confidence training samples for unchanged pixels and changed pixels are mostly based on the absolute distance (AD) or ED between pixel pairs in dualtemporal images, which can easily result in selected training samples comprising noise pixels or aberrant pixels. 2) Reduce the complexity of spectral information utilization: The hundreds of bands in HSIs can provide rich spectrum information for CD technology, and hence, the goal of this research is to find a method that can effectively use the spectral information of HSIs while having a lower parameter complexity. 3) Mining the multilevel spatial information of HSIs: Single-scale spatial attention module (SAM) cannot mine multilevel spatial information and cannot fully exploit the spatial advantages of HSIs. In addition, CD is a pixel-level classification technique, and single-scale SAM cannot pay more attention to the central pixel to be classified in the selected spatial patch, which may cause the weight of adjacent pixels to be much larger than that of the pixel to be classified, increasing the instability of pixel-level CD technology.
To solve the above-mentioned problems in the research process of HSI CD, we do some necessary work as follows, also our main contributions. 1) A new method for pseudolabel training sample selection based on k-means (KM) centroid distance is designed. After the CVA algebraic operation and KM clustering generates pseudolabels for binary classification, we propose using the distance between each type of pixel and its centroid to select the training samples with high-confidence labels and obtain the pseudotraining set that is most conducive to CD. 2) A compact band weighting (CBW) [35] module is adopted, which can give full play to the advantages of the HSIs bands and can use the correlation between HSIs adjacent bands and spectral statistical information to implement band weighting. In addition, CBW is a lightweight module with only 20 parameters, which can reduce the time cost of band processing. 3) A multiscale spatial attention (MSSA) module is developed. To acquire multilevel spatial information from HSIs, we create spatial patches of various scales centered on the pixels to be classed, allowing us to fully use the spatial advantages of HSIs. Simultaneously, we run spatial attention mechanism (SAM) operations on spatial patches of various scales and weight them to generate reconstructed patches in which the weight of the pixel to be classified is the largest, and the surrounding pixels provide spatial information support. MSSA pays greater attention to the features of the pixel to be classified and combines the spatial information of the pixels around the pixel to be classified, making it more conducive to improving the performance of pixel-level CD technology. The rest of this article is organized as follows. Section II summarizes related work in the field of CD. Section III introduces the details of the CBW-MSSANet. Section IV designs experiments to verify the effectiveness of the proposed method. Section V concludes this article.

II. RELATED WORK
Recently, many CD methods have used HSIs as data sources, and more and more methods have begun to pay attention to the spectral information and spatial information in HSIs. The existing CD methods that combine the spectral information or spatial information of HSIs are mainly three categories: spectral information method, spatial information method, and spectral-spatial information method.

A. Spectral Information Methods
Making full use of the band/spectral information of the HSIs can improve the CD technology. Liu et al. [36] proposed a dimensionality reduction technique based on band selection, which analyzes and evaluates the performance of CD by selecting the band with the largest amount of information from the original high-dimensional data space, to solve the challenging problem of multitemporal HSI CD. Ma et al. [37] use band selection and iterative weighting of bands, first select the band with more change information, and then iteratively weights a single band to better suppress noise and background information, thereby obtaining higher band correlations and facilitating the extraction of change information. Wang et al. [38] proposed a general end-to-end 2-D CNN (GET-NET) framework, which constructs a mixed-affinity matrix through spectral linear and nonlinear unmixing to mine the cross-channel gradient features of dual-temporal HSIs for CD. Lei et al. [39] proposed a new unsupervised HSI CD (UHCD) framework, which uses unsupervised spectral mapping to exploit underlying spectral features.

B. Spatial Information Methods
The spatial information of the HSI helps to combine the neighboring pixels of the pixel to be detected, to obtain information that is beneficial to CD from the surrounding pixels. To make full use of the spatial structure information of HSIs, Hou et al. [40] proposed a novel patch tensor-based CD method (PTCD). In the case of in-depth exploration of remote sensing image CD, Chen et al. [41] introduced a CD method based on spatial neighborhood analysis. Chen and Shi [42] proposed a novel Siamese-based spatial-temporal attention neural network, by designing a CD self-attention mechanism to model spatiotemporal relationships and encode dual-temporal images individually, dividing the image into multiscale subregions to capture spatial information. Chen et al. [43] proposed a novel and general deep Siamese convolutional multiple-layers RNN (SiamCRNN), which maps the spatial features extracted by a deep Siamese CNN (DSCNN) to new latent feature space for CD.

C. Spectral-Spatial Information Methods
Combining the spectral-spatial information methods can maximize the advantages of HSIs in the field of CD. Liu et al. [44] developed a novel spectral-spatial joint multiscale method, which is based on a multiscale morphological compressed CVA, which expands the compressed CVA while preserving more geometric details of the change target. Ran et al. [45] proposed a spectral-spatial one-class sparse representation classifier (OCSRC) method by applying spectral-spatial features to the one class of sparse representation processes instead of the original spectral bands. To solve the challenge of CD caused by artificial objects, such as clouds and shadows, Negri et al. [46] proposed a novel spectralspatial-aware unsupervised CD framework. Zhan et al. [47] proposed a spectral-spatial convolution neural network with a Siamese architecture (SSCNN-S) for HSI CD, which extracts the spectral-spatial vector from dual-temporal images, and then uses a Siamese network based on contrast loss to train and optimize the network. To find a feature space that can best express spectral-spatial features, Song et al. [48] proposed a bidirectional reconstruction coding network and enhanced residual network (BRCN-ERN) for HSI CD. Zhang et al. [49] proposed a deeply supervised image fusion network (IFN) for dual-temporal remote sensing image CD. To increase the accuracy of HSI CD, Wang et al. [50] proposed an end-to-end Siamese CNN (SiamNet) with a spectral-spatialwise attention (SSA-SiamNet) mechanism.
Most of the CD methods that use the SAM combine single-scale spatial information. Single-scale SAM cannot mine multilevel spatial information, and may cause the weight of surrounding pixels to be much larger than the central pixel, that is, adjacent pixels are more important than the pixels to be classified. Therefore, it is necessary to develop a SAM suitable for the pixel-level CD.

III. METHODOLOGY
The overview of the proposed CBW-MSSANet method is shown in Fig. 1. First, make the difference between the two time-collected HSIs and take the absolute value to obtain the distance spectrum of the dual-temporal images. Then, the global average pooling operation is performed on the distance spectrum, and the refined distance spectrum is obtained through the CBW module. Next, MSSA is used to combine multiscale spatial information to obtain a patch after scale fusion. Finally, through the classic CNN, with softmax as the classifier, the binary image of CD is obtained.
In this section, we will introduce our method from four aspects: pseudolabel sample selection for CD, CBW for the fusion of spectral information, MSSA for the fusion of spatial information, and classification and loss function.

A. Pseudolabel Sample Selection for CD
Weakly supervised CD methods usually adopt CVA algebraic analysis and KM clustering to generate binary pseudolabels and select high-confidence labels from the pseudolabels for network training, avoiding manual labeling. However, not all bands in dual-temporal HSIs are valid, and using all bands for generating pseudolabeled training samples reduces the effectiveness of the selected training samples and the efficiency of the algorithm. Therefore, we applied slow-fast band selection (SFBS) [26] before CVA, which can select more effective bands for HSI CD. In addition, current methods for selecting high-confidence pseudolabel training samples for CD are primarily based on the AD or ED between pixel pairs in dual-temporal HSIs, which can easily result in the selected training samples containing noisy or abnormal pixels. To that end, we devise a high-confidence pseudolabel training set selection approach based on KM centroid distance, which, using the KM clustering principle, picks the pseudotraining set with the highest confidence and the most favorable CD.
Given a set of dual-temporal HSIs, a new set of bands is obtained after applying SFBS to select bands. Assuming that the HSIs of the first phase and the second phase of the dual-temporal HSIs after dimensionality reduction are R and Q, respectively, and then apply CVA [13] to perform algebraic analysis operations, the intensity equation of the change vector is defined as where | V | is the intensity of the change vector, i is the ith band, and n is the total number of bands of the HSI after applying SFBS dimensionality reduction. After processing dual-temporal HSIs by algebraic analysis, we choose the commonly used KM as the method to generate binary pseudolabels. For a given sample set, KM divides the sample set into k clusters according to the distance between the samples, so that the points in the clusters are connected as closely as possible, and the distance between the clusters is as large as possible. Suppose the cluster is divided into where u i is the mean vector of the cluster C i , also called the centroid, and the expression is as follows: As shown in Fig. 2, KM is used to perform two classifications on the | V | obtained in (1), and the predetection classification results are obtained. The red "+" in Fig. 2 are the two types of centroids. After the above-mentioned operation, the pseudolabel of the dual-temporal HSIs can be obtained, in which the pseudounchanged pixel label is set to "0," and the pseudochanged label is set to "1." The above-mentioned steps are common, and the most critical step is the selection of the pseudotraining set. How to select some high-confidence pseudotraining samples that are beneficial to CD from the predetection results can greatly affect the performance of the CD method.
The classic selection of high-confidence pseudotraining samples from predetection results is usually based on the AD or the Euclidean distance between pixel pairs in dual-temporal HSIs. Because of the noise in the acquisition device, the pixel with the least AD or ED among the pixels whose pseudolabel is "0" does not necessarily suggest that it is more likely to be an unchanged pixel, and the minimum value may be an extreme case. Due to the noise in the acquisition device and the difference in acquisition time, the pixel with the largest AD or ED among the pixels with the pseudolabel "1" does not necessarily indicate that it is more likely to be a changed pixel, the maximum value is more likely to be an abnormal value or noise value. As shown in Fig. 2, the minimum or maximum distance between pixel pairs in the dual-temporal HSIs may correspond to the black triangle pixels in the figure, and they do not represent high-confidence unchanged pixels or changed pixels. In view of the shortcomings of the classic high-confidence training sample selection in the above-mentioned analysis, we propose a high-confidence pseudotraining sample selection method based on the KM centroid distance. From (2), according to the principle of KM clustering, it can be obtained that the pixel closest to the centroid has the highest confidence in the pseudolabel generated by KM. As shown in Fig. 2, we serve as high-confidence training samples for the weakly supervised learning-based CD method by finding several pixels closest to the KM centroid.

B. CBW for Fusion of Spectral Information
CBW [35] can combine the correlation and spectral statistics between adjacent spectra, and has extremely low model complexity, as shown in the CBW module in Fig. 1. The distance spectrum with width W , height H , and band number C is obtained by calculating the AD between the dual-temporal HSIs. The global information of each band of the distance spectrum is defined as I = [I 1 , I 2 , . . . I c . . . , I C ], with I c ∈ R H ×W ×1 . Then, the global average pooling of the distance spectrum can be expressed as where B f is the band statistics vector, B c f is the value of the cth band in B f , u(·) is the global average pooling function, and I c (i, j) is the pixel value of the cth band with coordinates (i, j).
Then, use 1-D convolution with a core number of K to capture the local dependency relationship B d between adjacent bands. The local convolution expression formula is as follows: where B c d is the cth element in the dependency vector B d , relu(·) represents the Relu activation function, and w i represents the ith network parameter of the 1-D convolution, K c is a collection of K -band statistics selected with the cth element in B f as the center. Therefore, the complete dependency vector can be expressed as where Conv1D(·) represents the 1-D convolution function. B d1 and B d2 in the CBW module in Fig. 1 represent the output vectors of two cascaded 1-D convolutions. Next, aggregate the output of different layers to enrich the band feature information, the specific expression is as follows: where B w is the final band weight, sigmoid(·) is the Sigmoid activation function, and ⊕ represents the addition operation between the network layers. Finally, use the B w in (7) to obtain the refined spectrum, and the reconstruction formula is as follows: whereĨ is the reconstructed distance spectrum, and ⊗ represents the multiplication operation. The network framework details of the CBW module are shown in Table I.

C. MSSA for Fusion of Spatial Information
HSI CD is a pixel-level detection task, and single-scale SAM cannot mine multiscale spatial information and may cause the weight of other position pixels in the spatial patch to be much larger than the weight of the pixel to be classified. As shown in Fig. 3, in a spatial patch of size 9 × 9, the weights are focused on a row of pixels far from the center, that is, the model considers that the information of this row of pixels is far more important than the information of the pixel to be classified, which is obviously unreasonable. As shown in Fig. 1, for the distance spectrum recalibrated by the CBW module, to mine the multilevel spatial information in the dual-temporal HSIs and focus on the features of the pixels to be classified themselves, we develop an MSSA module. In the MSSA module, multilevel spatial information in dual-temporal HSIs is mined by creating spatial patches of various scales centered on the pixels to be classified. In addition, the MSSA module performs equal weighted reconstruction on the multiscale spatial patchs in the manner of Fig. 4, and the obtained reconstructed patches can not only ensure that the pixels to be classified have the maximum weight but also have the spatial information supplement provided by the surrounding pixels. From a practical point of view, the adjacent pixels that are closer to the pixel to be classified are more likely to be the same object, which can provide more effective spatial information. Therefore, we perform an equal weighting operation on the spatial patches of five scales, that is, the closer to the target pixel, the more weighting times, in order to improve the attention of the model to the pixels to be classified. The proposed MSSA module can give full play to the spatial advantages of dual-temporal HSIs, while also being more conducive to improving the performance of the pixel-level CD.
To capture the spatial information of the pixel to be classified, we need to get the patch of the center pixel and its surrounding pixels. MSSA takes five scale patches for the spectra refined by CBW, the sizes are S 4 , and S 5 are the length and width of the five scale patches, and C is the number of bands of the refined spectrum. Let where H and W are the spatial dimensions, and x(i, j) represents the coordinate of the pixel in the spatial dimension is (i, j). The equation used to obtain patches of different scales inĨ is defined as where P S k ∈ R S k ×S k ×C represents patches of different scales, and x(m, n) is the center of the patch.
In MSSA, we set the values of S 1 , S 2 , S 3 , S 4 , and S 5 to 1, 3, 5, 7, and 9, respectively. For patch P S 1 , its size is 1 × 1 × C and does not contain spatial information, so the spatial feature extraction operation is mainly for P S 2 , P S 3 , P S 4 , and P S 5 . In the direction along the spectral axis, the max-pooling operation and the average-pooling operation are performed on P S 2 , P S 3 , P S 4 , and P S 5 , respectively, and the specific expressions are as follows: where P S k concat ∈ R S k ×S k ×2 is a concatenate layer of max-pooling and average-pooling, MaxPool(·) represents the max-pooling function, and AvgPool(·) represents the averagepooling function.
Perform 2-D convolution operations on the outputs of (10), and the expressions is where P S k C 1 ∈ R S k ×S k ×2 represents the feature after the 2-D convolution operation, C 1 represents the 1th convolution operation in MSSA, and f N ×N represents a convolution operation with a kernel size of N × N .
Then, the concatenated layer and the 2-D convolution layer are added by the following calculation: where P S k A 1 ∈ R S k ×S k ×2 is the output after performing the layer addition operation, A 1 represents the 1th addition operation in MSSA, and ⊕ represents layer addition operation.
Next, (12) does the 2-D convolution operation, and the expression is where P S k C 2 ∈ R S k ×S k ×1 represents the feature after the 2-D convolution operation, C 2 represents the second convolution operation in MSSA, and f N ×N represents a convolution operation with a kernel size of N × N .
Then, the spatial attention map is generated by the following calculation: where P S k w ∈ R S k ×S k ×1 represents the attention weight map of patches of different scales and sigmoid(·) represents the sigmoid activation function.
The reconstruction patches of P S 2 , P S 3 , P S 4 , and P S 5 can be obtained by the following operation: whereP S k ∈ R S k ×S k ×C denotes reconstructed patches of different scales, and ⊗ represents the layer multiplication operation.
Since the spatial scales of P S 1 ,P S 2 ,P S 3 ,P S 4 , andP S 5 are inconsistent, it is necessary to fill P S 1 ,P S 2 ,P S 3 , andP S 4 with 0 to obtain P S 1 pad ,P S 2 pad ,P S 3 pad , andP S 4 pad , so that the scales of Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
whereP ∈ R S 5 ×S 5 ×C is the output of the MSSA module. The network framework details of the MSSA module are shown in Table II, in which all padding parameters are set to "same."

D. Classification and Loss Function
Classi f ication : The final step of the CBW-MSSANet CD method is classification. After obtainingP in MSSA, we use the classic CNN structure to extract effective features.
The feature map of an effective convolution is obtained by the following calculation: whereP ′ ∈ R (S 5 −2)×(S 5 −2)×C ′ is the feature after an effective convolution, C ′ is the number of filters for 2-D convolution, f N ×N represents a convolution operation with a kernel size of N × N , and the padding parameter is set to "valid." Then, three 2-D convolutional layers are cascaded, which are defined asP whereP ′′′′ ∈ R (S 5 −8)×(S 5 −8)×C ′′′′ is the effective output feature of the last convolutional layer, C ′′′′ is the number of filters in the last convolutional layer, and the padding parameters of (18)- (20) are all set to "valid." Finally, perform a flattening operation on the obtained feature map to obtain the feature tensor. Connect the feature tensor to the fully connected layer and use the softmax classifier to classify. The network framework of the classification part is shown in Table III.
Loss function: In CBW MSSANet, we use the categorical cross-entropy loss function for CD classification. Its role is to evaluate the degree of difference between the probability distribution of the current training output and the true distribution. When the corresponding cross-entropy value is smaller, it indicates that the difference between the actual output of the model and the label is smaller and the probability distribution is closer. Categorical cross-entropy combined with softmax is used to achieve binary classification for CD. The categorical cross-entropy equation is as follows: where x represents the actual output of the current model and y represents the expected output. The definition expression of softmax used in combination with categorical cross-entropy is as follows: where softmax(z) is uniformly expressed as s. After combining categorical cross-entropy and softmax, the loss function is defined as

IV. EXPERIMENTS
To evaluate the performance of CBW-MSSANet, in this section, we use TensorFlow as the platform to implement experiments on four real-HSI datasets. In the experiments of CBW-MSSANet, we adopted SGD as the optimizer with learning rate lr=0.0001, decay=10 −5 , and momentum=0.9. In addition, the number of epochs is 60 and the batch size is 64. The first dataset "River" was collected from a river in Jiangsu province, China [38]. The second dataset "Farmland" was collected in Yancheng city, Jiangsu, covering some farmland around the city [51]. The third dataset "Hermiston," was collected in Hermiston, OR, USA [52]. The fourth dataset "Bay Area," was collected in Paterson, CA, USA [53]. The two algebraic analysis and comparison methods adopted in the experiment are CVA [13] and PCA-CVA [54]. The four comparison methods based on deep learning are 2D-CNN [55], Diff-ResNet [56], GETNET [38], and SSA-SiamNet [50]. We also design ablation experiments to apply a single-scale SAM module to spatial patches of different scales, replacing the MSSA module, to verify the effectiveness of MSSA. Since patches of size 1 × 1 do not contain spatial information, four contrast methods are designed for patches of size 3 × 3, 5 × 5, 7 × 7, and 9 × 9 in ablation experiments. Based on this, we named the four ablation contrast methods as CBW-SAM (3×3), CBW-SAM (5×5), CBW-SAM (7×7), and CBW-SAM (9 × 9), respectively. In addition, we conduct five repeated experiments for each deep learning-based CD method and take the average and standard deviation as the final experimental results.
To better evaluate the performance of all experimental methods, we adopt missed detection rate (MDR), false alarm rate (FAR), overall accuracy (OA), kappa, and F1 score to establish a comprehensive evaluation system. Bold entities are used in this experiment to show that the corresponding method has the best performance under the corresponding evaluation index. Table IV is the confusion matrix needed to calculate each evaluation index, where TN represents true negatives, FN represents false negatives, FP represents false positives, and TP represents true positives. The calculation equations for MDR, FAR, and OA are as follows: The kappa coefficient is used for consistency testing and is an important indicator to measure the accuracy of detecting. The kappa coefficient is calculated as follows:  The F1 score is an index used to measure the accuracy of the two-class model while taking into account the accuracy and recall of the classification model. F1 is calculated as follows: Among them, p r in (30) is the Precision, and r e in (31) is the recall.

A. Experiment on River Dataset
The two HSIs in the River dataset were acquired on May 3, 2013 and December 31, 2013. The sensor used for shooting is Earth Observing-1 (EO-1) Hyperion, with a spectral range of 0.4-2.5 µm, a spectral resolution of 10 nm, and a spatial resolution of 30 m. The image scene size of the River dataset is 463 × 241, which contains 198 bands for CD after preprocessing. The main type of change in the image is the reduction of river channels. Fig. 5 shows pseudocolor maps of two images collected at different times and the groundtruth map. In Fig. 5(c), changed pixels are marked in white and unchanged pixels are marked in black. The River dataset contains 9698 changed pixels and 101 885 unchanged pixels. In the experiment, we selected 1860 high-confidence changed pixels and 3720 high-confidence unchanged pixels as training samples by using the distance between the pixel and the KM centroid, accounting for 5% of the total number of pixels Table V shows the CD results of all experimental methods on the River dataset. PCA-CVA has the lowest MDR, indicating that PCA-CVA has the best performance in detecting changed pixels, but its FAR is the second highest, so the overall performance is not the best. GETNET has the lowest FAR, but the MDR reached 50.54%, which shows that GETNET has an excellent performance in detecting unchanged pixels, but the overall performance is average. The detection performance of SSA-SiamNet ranks second, slightly lower than CBW-MSSANet. Compared with all methods, CBW-MSSANet is the fourth best in MDR and FAR, and its OA, kappa, and F1 are the best. In terms of standard deviation, the robustness of CBW-MSSANet is also excellent.  In the ablation comparison experiments, CBW-SAM (3 × 3) has the lowest MDR, but its FAR is the highest at 3.91% because the number of unchanged pixels in the River dataset is more than ten times that of changed pixels, which leads to the worst overall detection performance of CBW-SAM (3 × 3). The FAR of CBW-SAM (7 × 7) is the lowest among the ablation methods, but its MDR is 35.43%, so its overall detection performance is close to CBW-SAM (3 × 3). For the four ablation methods, CBW-SAM (3 × 3) and CBW-SAM (9 × 9) have poor detection performance because CBW-SAM (3×3) contains less spatial information and CBW-SAM (9×9) has more invalid spatial information. MSSA can make up for the lack of spatial information of CBW-SAM (3 × 3) and can dilute the weight of invalid spatial information to effectively fusion multilevel spatial information. The OA, kappa, and F1 of CBW-MSSANet are all significantly higher than the four ablation methods, which shows that the MSSA module is more suitable for pixel-level CD than the single-scale SAM. Fig. 6 shows the CD binary maps of all experimental methods on the River dataset. It can be seen from Fig. 6 that the detection effect of CVA is the worst, and the FAR is very high. The difference between the detection results of all methods and the ground-truth map is mainly reflected in three areas, which are marked with red lines of different shapes in the binary map. For the area marked by the ellipse, the FAR of CVA in this area is very high, while the MDR of 2D-CNN, Diff-ResNet, and GETNET in this area is very high, and SSA-SiamNet and CBW-MSSANet are relatively close to the ground-truth map. For the rectangular area in the middle, CVA has many false alarm pixels, 2D-CNN, Diff-ResNet, GETNET, and SSA-SiamNet all missed a large number of pixels, CBW-MSSANet missed a small number of pixels, and PCA-CVA is most consistent with the ground-truth map. For the bottom area, CVA and PCA-CVA have a large number of false alarm pixels, while GETNET has a large number of changed pixels that are not detected. Globally, the binary CD map of CBW-MSSANet has the highest consistency with the ground-truth map among all comparison methods.
The CD binary maps of the ablation experiments are shown in Fig. 6(g)-(k). For the elliptical area in the middle, there are many changed pixels in CBW-SAM (7 × 7) and CBW-SAM (9 × 9) that are not detected and other methods are close to the ground-truth map. For the rectangular area on the right, a large number of changed pixels are not detected for CBW-SAM (7 × 7) and CBW-SAM (9 × 9), while a small number of changed pixels are not detected for CBW-SAM

B. Experiment on Farmland Dataset
The two HSIs in the Farmland dataset were acquired on May 3, 2006 andApril 23, 2007. The sensor used is Earth Observing-1 (EO-1) Hyperion, with a spectral range of 0.4-2.5 µm, a spectral resolution of 10 nm, and a spatial resolution of 30 m. The image scene size of the Farmland dataset is 450 × 140, which contains 155 bands for CD after preprocessing. The main type of change in the image is the area covered by farmland. Fig. 7 shows pseudocolor maps of two images collected at different times and the ground-truth map. In Fig. 7(c), changed pixels are marked in white and unchanged pixels are marked in black. The Farmland dataset contains 18 277 changed pixels and 44 723 unchanged pixels. In the experiment, we selected 2100 high-confidence changed pixels and 4200 high-confidence unchanged pixels as training samples by using the distance between the pixel and the KM centroid, accounting for 10% of the total number of pixels. Table VI shows the CD results of all experimental methods on the Farmland dataset. CVA has the lowest MDR, indicating that CVA has the best performance in detecting changed pixels in the Farmland dataset, but its FAR is higher, so OA only ranks sixth. Both 2D-CNN and Diff-ResNet have low FAR, but their MDRs are 11.22% and 11.39%, respectively, so neither has a high OA value. The MDR and FAR of GETNET and SSA-SiamNet are relatively close, so their OA is also close, ranking third and second, respectively. CBW-MSSANet has an MDR of 4.13%, but its FAR is the lowest among all comparison methods, so CBW-MSSANet has the highest OA, kappa, and F1 scores. From the standpoint of standard deviation, CBW-MSSANet is also the best robust.
Among the four ablation methods, CBW-SAM (7 × 7) had the lowest MDR, but its FAR was the highest, so its OA was only the second highest. CBW-SAM (9 × 9) has the lowest FAR, but its MDR is 12.00%, so its OA is the worst among the four ablation methods. The overall CD performance of CBW-SAM (3 × 3) is poor, which is caused by the less spatial information of CBW-SAM (3 × 3). The overall CD performance of CBW-SAM (9 × 9) is the worst, which is due to the interference of more invalid spatial information in CBW-SAM (9 × 9). MSSA has rich multilevel spatial information, and the unique scale weighting method can dilute the weight of invalid spatial information to have a positive effect on the CD task. All five metrics of CBW-MSSANet are the best among ablation experimental methods, indicating that the MSSA module has more advantages than single-scale SAM in pixel-level HSI CD. Fig. 8 shows the CD binary maps of all experimental methods on the Farmland dataset. The difference between the binary detection map of all methods and the ground-truth map is mainly divided into four areas, which have been marked in red in Fig. 8. CVA, PCA-CVA, GETNET, and SSA-SiamNet detected more false alarm pixels in the top-right corner area. In the small square area in the middle, CVA, PCA-CVA, 2D-CNN, and Diff-ResNet all have more false alarm pixels, GETNET has a large number of changed pixels undetected, and SSA-SiamNet and CBW-MSSANet have a small number of changed pixels undetected. In the bottom-right corner area, all methods except GETNET detect that there are more false alarm pixels in this area. In the bottom-left area, the CD performance of SSA-SiamNet and CBW-MSSANet are highly consistent with the ground-truth map. In terms of overall visual effects, the CD binary map of CBW-MSSANet has the highest consistency with the ground-truth map.
The CD binary maps of the ablation experiments are shown in Fig. 8(g)-(k). For the rectangular area in the top-right corner, each method has a small number of false alarm pixels. For the square area in the middle, CBW-SAM (3 × 3), CBW-SAM (5 × 5), CBW-SAM (7 × 7), and CBW-SAM (9 × 9) have a large number of false alarm pixels, while CBW-MSSANet has a small number of changed pixels that are not detected. For the rectangular region in the bottomright corner, CBW-MSSANet has only a small number of false alarm pixels, while all four ablation contrast methods have a large number of false alarm pixels. For the square area in the bottom-left corner, the detection results of CBW-MSSANet are almost consistent with the ground-truth map, CBW-SAM (9 × 9) has a small number of false alarm pixels, and the other ablation contrast methods have a large number of false alarm pixels. From the overall visual effect, the CD binary map of CBW-MSSANet is closer to the ground-truth map than the four ablation contrast methods.

C. Experiment on Hermiston Dataset
The two HSIs in the Hermiston dataset were taken in 2004 and 2007. The sensor used is Earth Observing-1 (EO-1) Hyperion, with a spectral range of 0.4-2.5 µm, a spectral resolution of 10 nm, and a spatial resolution of 30 m. The image scene size of the Hermiston dataset is 390 × 200, which contains 224 bands for CD. The main type of change in the image is the area of the city. Fig. 9 shows pseudocolor maps of two images collected at different times and the ground-truth map. In Fig. 9(c), changed pixels are marked in white and unchanged pixels are marked in black. The Hermiston dataset contains 9986 changed pixels and 68 014 unchanged pixels. In the experiment, we selected 1560 high-confidence changed   pixels and 3120 high-confidence unchanged pixels as training samples by using the distance between the pixel and the KM centroid, accounting for 6% of the total number of pixels. Table VII shows the CD results of all experimental methods on the Hermiston dataset. CVA has the lowest FAR, but its MDR is as high as 46.84%, indicating that its lowest FAR comes from the increase in MDR, so its OA is only 93.92%. The FAR of 2D-CNN, Diff-ResNet, GETNET, and SSA-SiamNet are all low and fairly close, but their MDR are all high, so the overall detection results are not the best. SSA-SiamNet is the best performer among all comparison methods, compared with CBW-MSSANet, its MDR is 6.21% higher, FAR is 0.35% lower, so OA is 0.51% lower. The OA, kappa, and F1 scores of CBW-MSSANet are the best among all experimental methods in this article, indicating the excellent CD performance of MSSA. From the perspective of standard deviation, the robustness of CBW-MSSANet is also the best.
In the ablation contrast experiments, CBW-SAM (9×9) has the lowest MDR, but its FAR is the highest among the ablation contrast methods, so its OA is tied for the worst with CBW-SAM (3×3). SAM (5×5) has the lowest FAR, while its MDR reaches the highest 5.6% because the Hermiston dataset has six times more unchanged pixels than changed pixels, so CBW-SAM (5 × 5) has the highest OA among the four ablation contrast methods. For the four ablation methods, CBW-SAM (3 × 3) and CBW-SAM (9 × 9) have poor CD performance because CBW-SAM (3 × 3) contains less spatial information and CBW-SAM (9 × 9) has more invalid spatial information.  MSSA can effectively fuse multilevel spatial information, so the OA, kappa, and F1 scores of CBW-MSSANet are significantly better than the four ablation contrast methods, which indicates that MSSA is more effective than single-scale SAM for pixel-level CD tasks. , and CBW-SAM (9 × 9) all have a large number of false alarm pixels, while the CBW-MSSANet detection performs almost perfectly. Except for the three regions marked, other regions of the CD binary map for each ablation contrast method also contain false alarm pixels, but only CBW-MSSANet has the least false alarm pixels. Overall, the visual effect of the CD binary map of CBW-MSSANet is significantly better than the four ablation contrast methods.

D. Experiment on Bay Area Dataset
The two HSIs in the Bay Area dataset were collected in 2013 and 2015. The sensor used for shooting is AVIRIS, with a spectral range of 0.4-2.5 µm, a spectral resolution  of 10 nm, and a spatial resolution of 20 m. The image scene size of the Bay Area dataset is 600 × 500, which contains 224 bands for CD. Fig. 11 shows pseudocolor maps of two images collected at different times and the groundtruth map. In Fig. 11(c), changed pixels are marked in white and unchanged pixels are marked in black. The Bay Area dataset has a total of 73 481 labeled pixels, including 39 270 changed pixels and 34 211 unchanged pixels, and the rest of the pixels in the dataset belong to unlabeled unknown pixels. In the experiment, we selected 2450 high-confidence changed pixels and 4900 high-confidence unchanged pixels as training samples by using the distance between the pixel and the KM centroid, accounting for 10% of the total number of pixels. Table VIII shows the CD results of all experimental methods on the Bay Area dataset. Among the two traditional CD methods and the four deep learning-based comparison methods, SSA-SiamNet has the lowest MDR and FAR, so its OA reaches 97.91%. Because the scene of the dataset is more complex, the detection performance of traditional methods CVA and PCA-CVA are not ideal. All metrics of CBW-MSSANet are the best among all contrasting methods, which shows that it performs well in detecting both changed and unchanged pixels in the Bay Area dataset. The robustness of CBW-MSSANet is also excellent in terms of standard deviation.
Among the four ablation contrast methods, CBW-SAM (5 × 5) has the lowest MDR and FAR, so its OA, kappa, and F1 scores are the highest. The MDR of CBW-SAM (3 × 3) is the highest among the ablation contrast methods, so its OA is only 94.94% because CBW-SAM (3 × 3) contains less spatial information. The FAR of CBW-SAM (9 × 9) is the highest among the ablation contrast methods, so its OA is only 95.07%, which is caused by the interference that CBW-SAM (9×9) contains more invalid spatial information. MSSA has rich multilevel spatial information, and its unique scale weighting process can dilute the weight of invalid spatial information, so all metrics of CBW-MSSANet are the best. The CD binary maps of the ablation experiments are shown in Fig. 12(g)-(k). In the top-right square area, CBW-SAM (5 × 5) and CBW-SAM (7 × 7) have a large number of undetected changed pixels, while CBW-SAM (3 × 3), CBW-SAM (9 × 9), and CBW-MSSANet have only a few of changed pixels undetected. For the circular area in the top-left corner, all but CBW-MSSANet have false alarm pixels. For the rectangular region on the right, all ablation methods have a few changed pixels that are not detected. For the bottom oval area, CBW-SAM (3 × 3) and CBW-SAM (9 × 9) have a lot of false alarm pixels, CBW-SAM (5 × 5) and CBW-SAM (7 × 7) have a few false alarm pixels, and only CBW-MSSANet has excellent detection performance. From the overall visual effect, CBW-MSSANet significantly outperforms the four ablation contrast methods.

E. Pseudolabel Sample Selection Experiment for CD
To examine the effectiveness of the proposed pseudolabel training sample selection method based on the distance between pixels and KM cluster centroids, we design experiments and compare them with traditional AD-based and ED-based methods. Since the steps of the entire pseudolabel training sample selection are SFBS, CVA, KM, and selection operations, we name the two comparison methods SFBS+CVA+KM+AD and SFBS+CVA+KM+ED, respectively. In this experiment, we use the MDR, FAR, and OA of the selected pseudolabel training samples as the evaluation system to comprehensively evaluate the confidence and reliability of the pseudolabel training samples selected by the three selection methods. Table IX shows the confidence performance of three pseudolabel training sample selection methods on four dual-temporal HSIs, where the number of training samples is set in accordance with Sections IV-A-IV-D, respectively. For River and Farmland datasets, our proposed method has the best MDR, FAR, and OA. For the Hermiston dataset, SFBS+CVA+KM+ED achieves the lowest FAR, but its MDR is higher than our method, so our method slightly leads SFBS+CVA+KM+ED in OA. For the Bay Area dataset, SFBS+CVA+KM+AD achieves the lowest MDR, but its FAR is higher than our method, so our method leads SFBS+CVA+KM+AD in OA. From the global perspective of Table IX, the pseudolabel training sample selection method proposed by us has the highest confidence and reliability.

F. Experiment on Training Sample Ratios for CBW-MSSANet
To test how sensitive CBW-MSSANet is to the number of pseudolabeled training samples, we conduct experiments on CBW-MSSANet and its ablation contrast methods on four real HSI datasets using different percentages of training samples. The relationship between the percentage of pseudolabeled training samples and OA on the four HSI datasets is shown in Fig. 13. For River, when the percentage of pseudolabeled training samples is 5%, the curve of CBW-MSSANet reaches the peak and gradually stabilizes. However, when the percentage of pseudolabeled training samples is 18%, the curve starts to slowly decline because the confidence of pseudolabels decreases. For Farmland, the CBW-MSSANet curve has been rising steadily, reaching its highest point around 10%. For Hermiston, the curve of CBW-MSSANet experienced some fluctuations and finally reached the inflection point at 6%, which indicates that 6% of the pseudolabel training samples can make CBW-MSSANet obtain the best CD performance. For the Bay Area, the curve of CBW-MSSANet gradually rises until it becomes flat around the 10% position. Furthermore, it can be seen from Fig. 13 that the CD performance of CBW-SAM (5 × 5) is the best among the four ablation contrast methods on all four real-HSI datasets.

G. Computational Cost Analysis
To compare the computational cost of CBW-MSSANet and other deep learning-based methods more intuitively, we fix and unify the hyperparameters and then record the training time, testing time, and total parameters of these methods. Table X shows the computational cost of all deep learning-based CD methods on the four HSI datasets, where the number of training samples is set in accordance with Sections IV-A-IV-D, respectively. The training time is related to the number of training samples, the number of bands of the HSI dataset, and the complexity of the model. The test time is related to the total number of pixels in the HSI dataset, the number of bands, and the complexity of the model. The amount of parameters is only related to the number of bands in the HSI dataset and the complexity of the model. In Table X, GETNET has the largest amount of total parameters because it uses a mixed affinity matrix to mine cross-channel gradient information, which expands the amount of data. The number of parameters of CBW-MSSANet is slightly less than that of SSA-SiamNet, but its training time and test time are slightly more than SSA-SiamNet, because CBW-MSSANet has more calculation steps, so the access cost is slightly higher. Compared with different single-scale ablation methods, CBW-MSSANet has the highest training time, test time, and parameter amount, which is because CBW-MSSANet integrates multiple scales. Overall, the computational cost of CBW-MSSANet is acceptable.

H. Experiment Summary
Experiments were carried out on four HSI datasets: River, Farmland, Hermiston, and Bay Area. Two algebraic analysis methods are CVA and PCA-CVA. Four comparison methods of CD based on deep learning are 2D-CNN, Diff-ResNet, GETNET, and SSA-SiamNet. To better test the performance of the MSSA module and single-scale SAM in pixel-level HSI CD, we also design four contrastive ablation methods. Fig. 14 shows the comparison chart of the OA of CD for all experimental methods on four different datasets. In the River dataset, CBW-MSSANet performed the best, CVA performed the worst, and SSA-SiamNet performed the second best, very close and second only to CBW-MSSANet. In the Farmland dataset, CBW-MSSANet performs the best, and 2D-CNN and Diff-ResNet perform poorly. PCA-CVA, GETNET, and SSA-SiamNet perform relatively close, slightly lower than CBW-MSSANet. In the Hermiston dataset, the best performing method is still CBW-MSSANet, the worst is CVA, and the PCA-CVA's performance is very close to SSA-SiamNet, slightly lower than CBW-MSSANet. In the Bay Area dataset, both CVA and PCA-CVA have a poor detection performance,   contains less spatial information, while CBW-SAM (9 × 9) contains more invalid spatial information. MSSA can make up for the lack of spatial information of CBW-SAM (3 × 3) and can dilute the weight of invalid spatial information to effectively fusion multilevel spatial information. Experiments have shown that CBW-MSSANet is superior to most existing CD methods and MSSA is more effective than single-scale SAM in pixel-level CD.

V. CONCLUSION
In this article, we propose an HSI CD method named CBW-MSSANet, which is based on weakly supervised training. It does not rely on ground-truth labels and can save labor and time costs. To mine multilevel spatial information, we develop the MSSA module, which pays more attention to the characteristics of the pixels to be classified and combines the spatial information of adjacent pixels to make it more conducive to pixel-level CD. The MSSA module can make up for the lack of single-scale spatial information and can dilute the weight of invalid spatial information in the weighted reconstruction step to better integrate multilevel spatial information. In addition, we propose a new way to select pseudolabel training samples based on the KM centroid distance and design experiments to test the confidence of the selected pseudolabel training samples. To efficiently utilize the spectral information in dual-temporal HSIs, in this article, we apply the CBW module to realize the statistics of spectral information. The method we proposed in this article has experimented on four real-HSI datasets. Experimental data show that CBW-MSSANet has the best OA, kappa, and F1, indicating that it outperforms most existing representative CD methods. Furthermore, CBW-MSSANet also has excellent robustness compared with all deep learning-based CD methods. To test the effectiveness of the MSSA module, in this article, we design four single-scale ablation contrast methods to more intuitively reflect the advantages of MSSA in pixel-level HSI CD tasks.