Spectral Token Guidance Transformer for Multisource Images Change Detection

With the development of Earth observation technology, more multisource remote sensing images are obtained from various satellite sensors and significantly enrich the data source of change detection (CD). However, the utilization of multisource bitemporal images frequently introduces challenges during featuring or representing the various physical mechanisms of the observed landscapes and makes it more difficult to develop a general model for homogeneous and heterogeneous CD adaptively. In this article, we propose an adaptive spatial-spectral transformer CD network based on spectral token guidance, named STCD-Former. Specifically, a spectral transformer with dual-branch first encodes the diverse spectral sequence in spectral-wise to generate a corresponding spectral token. And then, the spectral token is used as guidance to interact with the patch token to learn the change rules. More significantly, to optimize the learning of difference information, we design a difference amplification module to highlight discriminative features by adaptively integrating the difference information into the feature embedding. Finally, the binary CD result is obtained by multilayer perceptron. The experimental results on three homogeneous datasets and one heterogeneous dataset have demonstrated that the proposed STCD-Former outperforms the other state-of-the-art methods qualitatively and visually.


I. INTRODUCTION
C HANGE detection (CD) can dynamically identify the changed areas by comparing the multitemporal images captured in a fixed geographical area at different times [1], which has been utilized in many significant fields, e.g., urban building CD, dynamic forest CD, and natural disaster detection [2], [3], [4], [5], [6], [7]. With the advancement of remote sensing (RS), multisource RS images are frequently acquired which have enriched the data source of CD. According to the types of the given RS image, CD could be categorized into homogeneous CD and heterogeneous CD. Traditionally, the bitemporal images of homogeneous CD are captured by the same sensor in a specific area at different times. While for heterogeneous CD, the two input RS images are normally acquired by different sensors with possible different resolutions, dynamic ranges, or noises. In practice, the task of heterogeneous CD is more challenging than homogeneous CD, and the works of developing general models for both homogeneous and heterogeneous CD have attracted extensive attention from researchers.
Generally speaking, CD is a fine-grained task to calculate the pixel-wise binarized value from the bitemporal images: changed or unchanged. In the past decades, many traditional CD methods have been proposed which can be divided into three categories: algebra-based methods, transformation-based methods, and classification-based methods. First, the algebrabased methods directly calculate the difference or ratio between multitemporal images to detect the changed area. As one of the most commonly applied algebra-based CD [8], change vector analysis detects change target by calculating the change intensity and change direction of bitemporal images, but still has some difficulties especially in selecting the suitable thresholds. Second, the transformation-based methods convert the bitemporal images into a unique feature space and utilize the feature maps to generate the CD result. Principal component analysis (PCA) [9], slow feature analysis (SFA) [10], and multivariate alteration detection [11] are typical transformation-based CD models demonstrating high detection performance. Especially, these transformation-based methods are effective for high dimensional spectral processing and noise reduction, but prominent time-consuming cannot be neglected during the calculation of the new component for large changed areas [12]. Third, the classification-based methods compare classified bitemporal images pixel-by-pixel to realize CD. Support vector machine (SVM) [13] and extreme learning machine are common models in the classification-based methods. These methods are able to This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ deal with the heterogeneous images for CD, but they depend heavily on the performance of classifier [14].
In recent years, deep learning (DL) methods have demonstrated superior performance in many CV tasks [15], such as classification [16], [17], [18], visual question answering [19], spectral super-resolution [20], and anomaly detection [21], [22]. Similarly, benefited from the powerful representation capability of DL networks, many widely-renowned backbones have been successfully applied in CD tasks [23]. For example, Du et al. [24] proposed a deep slow feature analysis network to learn nonlinear features with deep neural network. Afterward, many well-known networks are introduced into CD tasks, such as recurrent neural network (RNN), convolutional neural network (CNN), and graph convolutional networks (GCN). In reality, many DL-based methods have achieved good CD results; however, several limitations still are not addressed as follows. First, CNNs are not skilled in capturing the sequence information, especially for long-term dependencies. The defect of CNNs in long-term dependency will then decrease the performance of CD with obvious long-term related information. In addition, CNNs cannot accurately learn global information in the spatial dimension due to the fixed size of receptive field. Although the updated RNNs or GCNs can learn the spectral sequence features from RS images, they usually suffer from gradient loss within long-term dependencies or perform generally during modeling spectral sequence [25].
Very recently, vision transformer (ViT) has been applied into CV tasks due to the powerful ability for sequence data processing, including the tasks of image classification [26], [27], object detection [28], and spectral image reconstruction [29]. Inspired by the superior performance of ViT, Hong et al. [30] rethink the HSI classification from a spectral sequence perspective and proposed a transformer-based network named SpectralFormer. Thereafter, many ViT-based models are proposed for homogeneous CD and acceptable performances have been obtained. However, as the physical mechanism of ground in multisource images is distinguishing, those DL or ViT-based CD methods have not adequately or accurately modeled the diverse spectral sequence of multisource images in spectral-wise, and this defect may lead to big challenges during adaptive homogeneous and heterogeneous CD. To address these challenges, in this article, we propose a novel spatial-spectral transformer based adaptive CD network that can accurately detect the changed area with either homogeneous or heterogeneous RS images. The proposed spatial-spectral CD Transformer is essentially based on the spectral token guidance and can be named as STCD-Former. Specifically, the STCD-Former mainly contains spectral token transformer (ST-Former) and spectral token guidance spatial transformer (STS-Former). The ST-Former with a dual-branch can adaptively encode the diverse spectral sequence in spectral-wise to generate a corresponding spectral token as the representation of the physical mechanisms. The STS-Former is designed to efficiently learn the change rules by interacting the spectral token and patch token from the bitemporal images with the CD segmentation image obtained.
The main contributions of the proposed STCD-Former can be summarized as follows: 1) This article presents an adaptive CD network architecture, STCD-Former, which can adaptively realize CD for both homogeneous and heterogeneous images. We evaluated STCD-Former in three homogeneous and one challenging heterogeneous dataset. The experimental results showed that our proposed STCD-Former achieved excellent performance in both homogeneous and heterogeneous data. 2) We design a flexible structured ST-Former to adaptively encode the spectral signals of homogeneous or heterogeneous bitemporal images to generate a corresponding spectral token as guidance in STS-Former for learning the change rules. 3) A difference amplification module (DAM) is innovatively embedded in the STS-Former to highlight discriminative features between the difference embedding and feature embedding, which can significantly optimize the learning of difference information.

II. RELATED WORK
The related works are considered in three groups: DL-based CD methods, ViT-based methods, and adaptive CD network for both homogeneous and heterogeneous data.

A. DL-Based CD Methods
In the past few years, many DL networks and their variants have been applied in CD. As a typical branch of the DL framework, RNN accumulatively learns spectral sequence features from spectral images in a band-orderly fashion. Lyu et al. [31] used an end-to-end RNN with an improved long short-term memory (LSTM) module to process the long-term information of changed target. Considering that CNN can extract spatial-spectral features well by using convolutional kernels, Wang et al. [32] developed an end-to-end 2D-CNN network, named GETNET, to realize CD for HSIs. In addition, Qu et al. [33] developed a dual-branch difference amplification GCN (D2AGCN) with a graph structure to extract non-Euclidean information of bitemporal images and made a DAM to improve the learning of change information. Generally, most DL-based CD methods have achieved good CD results by extracting local spatial-spectral features, but they are hard to acquire the longterm dependency for improving the CD performance.

B. ViT-Based CD Methods
Recently, many ViT-based methods are proposed to extract robust and discriminative features to represent RS images during CD tasks. Liu et al. [34] proposed a CNN-transformer network to fulfill efficient cropland CD results, where they combine the merits of CNN and transformer to fuse the multiscale context information. Zhang et al. [35] designed a pure transformer network named SwinSUNet with a Siamese U-shaped structure to realize CD. Wang et al. [36] proposed SST-Former to extract bitemporal spatial-spectral sequence features, and the extracted temporal sequence information was utilized to learn the change rules for generating CD results. From these related works earlier, we observed that most ViT-based methods have achieved well Fig. 1. Network structure of the proposed STCD-Former, in which the left one is CD process for homogeneous images, and the right one is CD process for the heterogeneous images. performance in homogeneous CD, but the heterogeneous CD is still a challenge for these methods. As the powerful ability of ViT in sequence data processing, it is suitable to design the enhanced ViT-based network to model the physical mechanism of multisource RS images by extracting the diverse spatialspectral sequence information, and accurately fulfill the tasks of homogeneous and heterogeneous CD adaptively.

C. Adaptive CD Network for Both Homogeneous and Heterogeneous Images
With the rapid development of remote sensors for Earth observations, many CD methods for homogeneous or heterogeneous data are proposed [37], [38], but most current CD methods only consider CD with multitemporal homogeneous images. As the imaging mechanism of heterogeneous images is distinguishing, it is difficult to learn the representation of the physical characteristics between heterogeneous images. Besides, it is more challenging to realize the homogeneous and heterogeneous CD adaptively with one fixed network. Among those works of heterogeneous CD, Zheng et al. [39] proposed a difference learning method to achieve heterogeneous CD between the bitemporal images with different resolutions. Chen et al. [40] proposed a deep Siamese convolutional multiple-layers recurrent neural network (SiamCRNN) to realize CD for both homogeneous and heterogeneous very-high-resolution images, where the LSTM module is utilized to model the temporal information between the bitemporal images.
Overall, the above-mentioned methods can achieve homogeneous and heterogeneous CD, but most of those methods did not consider the modeling of physical mechanism in multisource RS images. It is significant to extract the diverse spectral information in spectral-wise for improving CD performance. Therefore, in this article, a novel adaptive CD network for both homogeneous and heterogeneous images is proposed to detect changes, which could adaptively model the diverse spectral sequence of multisource images to generate a corresponding spectral token as guidance, and interact with the spectral token as well as patch token between two branches to realize CD.

III. METHODOLOGY
The framework of our proposed STCD-Former is shown in Fig. 1. It can be elaborated from three aspects: the acquisition of spectral token in ST-Former, the change rule capture from the two branches in STS-Former with the differential information learning of DAM, and the CD result generation by predicting the differential token with multilayer perceptron (MLP). As shown in Fig. 1, we note that the bitemporal images are divided into many patch pairs before entering the CD network. Within the STCD-Former, the feature extraction module ST-Former first encodes the bitemporal patch pairs in the way of spectral-wise to generate the spectral token. Then, the STS-Former integrates the spectral token and the patch token to learn the change rule; meanwhile, the DAM is embedded into STS-Former to capture the difference information between the two temporal sequences. Finally, the MLP utilizes the differential token to predict the CD results.

A. Spectral Token Transformer
To adaptively encode the multisource spectral sequence of multisource images, we embed ST-Former in our proposed STCD-Former to generate the spectral token in the two branches.
As CD is a pixel-level detection task, we first divide the bitemporal images into many patch pairs. For a patch x ∈ R c×p×p , the p means the same width and height with p pixels, while c represents the number of channels. We convert the patch into a matrix of c × p 2 and use a linear projection to map the sequence data into a set of d dimensions vectors, which can be formulated by where ω 1 is the trainable weight matrix with size of p 2 × d, and X ∈ R c×d is the feature embedding of temporal-one.
Considering that position embedding is important for capturing the order of an input sequence [41], we add position embedding P E ∈ R c×d into X, where P E is a randomly generated matrix and is determined by model training. The Fig. 2. Illustration of the ST-former and STS-former, in which ST-former encode the spectral sequence data from spectral domain to generate spectral token, and STS-former integrates the spectral token of temporal-one features and the patch token of another temporal feature to obtain the change rule and the DAM is embedded to capture the differential information between the two temporal sequence information. updated embedding features can be expressed as follows: After adding the position embedding, X 0 ∈ R c×d are sent to the ST-Former. As shown in Fig. 2, spectral transformer encoder consists of alternating layers of multiheads self-attention mechanism (MHSA) and feed-forward (FF) blocks. MHSA is utilized to learn the multiple dependencies from the spectral sequence, which can be described as follows: where h is the number of heads, W O is a learnable transformation parameter.
denotes the query, key, and value, respectively, where W q i , W k i and W v i are three trainable matrix parameters and X i represents the feature embedding of ith layer. FF is also a key block in the transformer structure, which holds two linear transformations with a Gaussian error linear unit function. The FF can be formulated as follows: where W 1 , W 2 are parameter matrices, and b 1 , b 2 are biases, respectively. The features in ST-Former are updated as follows: where LN is the layer-norm. L is the number of transformer encoder that belongs to {1, 2, 3}. In the end, the last transformer encoder X 3 is used to generate the spectral token ST ∈ R 1×d by the following formulation: where the Mean(.) stands for the average value in spectralwise. Thus, the spectral token from one of the bitemporal images is obtained. Similarly, the other spectral token can be calculated based on another image.

B. Spatial Transformer Based on Spectral Token Guidance
Difference and concatenation methods are commonly used to learn the change rule to generate the CD results. Different from those methods which only compare the characteristics of the bitemporal images to realize CD, we design the STS-Former to learn the change rule by using the spectral token as a guidance to interact with the patch token of another branch. STS-Former mainly contains cross-attention layers, DAM modules, and differential token modules. The key cross-attention layer includes the multiheads cross attention (MCA) module and FF layer.
As shown in Fig. 2, we first rearrange the input patch and then utilize the linear projection to map the rearranged patch into a set of d t dimension vectors. Besides, the corresponding ST is mapped into d t dimensions, which can be formulated by where ω 2 and ω 3 are trainable weight matrixes. We exchange the spectral token and patch tokens from the two branches, and the F 0 1 , F 0 2 can be represented by where ST 0 1 , ST 0 2 , and F 1 , F 2 represent the spectral token and the patch token, respectively.
As shown in Fig. 3, we fed F 0 1 and F 0 2 into MCA to interactively learn the change rule, which can be mathematically expressed as follows: Diagram of DAM module. The dual-branch output of MCA in corresponding layer is fed into DAM to calculate the differential matrix and fuse to obtain the differential information.
in which W q , W k andW v are three learnable parameter matrices, and r is the number of heads in MCA.
After the MCA calculation, we also apply the FF layers into the corresponding MCA layer output. For the two branches, the general calculation is defined as follows: where z is the number of MCA layers. The MCA can dynamically process the homogeneous or heterogeneous images and learn the implicit change rules between the two branches.

C. DAM and CD Result Generation
To further improve the ability of learning the difference information, we embed DAM modules after MCA layers. DAM can adaptively integrate the difference information into the feature embedding to improve CD performance, which can be simply expressed by where the ω 4 is a learnable network parameter for differential feature learning. As shown in Fig. 4, the dual-branch features of z − 1 layer are input to calculate the differential matrix. Then, each corresponding vector of the differential matrix and the embedding features of one branch are concatenated. Finally, the differential information and sequence information are fused by a size of 1 × 2 convolution kernel.
In the last of MCA and DAM, we add a differential token module to generate the final differential token, and the CD result is obtained by the MLP heads with a linear layer and one LN layer, which can be formulated by where y ∈ R 1×2 is the probability of being changed or unchanged, and ω 6 is the weight of the linear layer. If y(0, 1) < y(0, 0), the CD result of patch pair is unchanged; otherwise, it is changed.

IV. EXPERIMENT IN THE HOMOGENEOUS IMAGES
In the experiment of homogeneous CD, we evaluate the performance of STCD-Former in three homogeneous datasets. First, three well-known HSI datasets are described in detail. Second, we provide a description of the experimental setting, including the evaluation criteria, comparative experiment, and hyper-parameters of STCD-Former. Third, we compare the performance between our proposed method and other state-of-theart methods for homogeneous CD. Finally, we make ablation analysis of the ST-Former, STS-Former, and DAM modules.

A. Experimental Datasets
The first dataset used in the experiment is Farmland dataset, which was acquired by the Earth Observing   The CD ground truth of the above-mentioned three datasets is shown in Figs. 8(f), 9(f), and 10(f), respectively. Within these ground-truth CD results, the white, black, and gray pixels represent the changed region, unchanged region, and undetermined region, respectively.
For the purpose of obtaining better CD performance with limited samples, we selected one percentage of changed pixels and unchanged pixels to train STCD-Former. The details are listed in Table I.

B. Experimental Setup
We evaluate the CD performance of each method quantitatively in terms of two commonly-used indices, including overall where true positive (TP) is the number of correctly detected changes in pixels, true negative (TN) is the number of correctly detected unchanged pixels, false positive (FP) represents the number of pixels changed by error detection, and false negative (FN) is the unchanged number of pixels in error detection.
To verify the superiority of STCD-Former, we choose four representative methods for comparison: SVM, multiscale 3D deep convolutional neural network (M3D-DCNN) [42], Re3FCN [43], and SST-Former [36]. We use the same training samples to train these above-mentioned networks for fairness. And the patch size is 7 × 7. The other details in the comparison experiment are as follows.
When applying SVM for CD, we first used PCA to extract the principal components of the difference map, and use the SVM to  I  NUMBERS OF PIXEL PAIRS IN TRAINING AND TESTING SETS FOR THE THREE DATASETS   TABLE II  ACCURACY COMPARISON OF EXCELLENT HOMOGENEOUS CD METHODS ON THREE DATASETS detect changes, which is achieved by the function sklearn.svm in Python.
The M3D-DCNN with ten convolution layers and one fully connected layer utilized the multiscale 3D convolution block to realize classification. We input the difference map directly to M3D-DCNN for binary classification. The other detailed parameters are the same as the default settings.
Re3FCN consists of two convolutional layers and recurrent layers with LSTM module to extract spatial-temporal information for CD. The detailed setting can be found in their article.
Within the transformer-based method, SST-Former can extract the spatial-spectral-temporal sequence information for generating CD results. The other parameters are the same as described in the work [36].
Our proposed STCD-Former was implemented on Py-Torch 1.10.1 with Intel Xeon Silver 4210R CPU and an NVIDIA GeForce RTX 3090 24 G GPU. The learning rate of Adam optimizer was set to 0.0005. We used torch.optim.lr_scheduler.ExponentialLR to update the learning rate by multiplying 0.997 after each epoch. We set the epochs to 200 and used Sigmoid as the activation function. The loss function was weighted binary cross entropy, where the weight calculation was introduced from [44], and we used it to alleviate the imbalanced sample problem. The patch size and batch size were set to 7 and 64. The dimensions d and d t were set to 64 and 512. The number of heads in ST-Former and STS-Former is 8.

C. Comparison With Other Methods
To evaluate the performance of our proposed method, the other four excellent methods are compared with STCD-Former both visually and qualitatively.
All these methods were tested five times in different datasets, and the qualitative results for the three datasets are shown in Table II. Figs. 8-10 list the visualization results for the three datasets, respectively. Table II, the traditional machine learning SVM algorithm achieved the lowest OA and Kappa, and Fig. 8(a) also shows that the visualization result of SVM was very unsatisfactory. The OA and Kappa of CNN-based methods, including M3D-DCNN and Re3FCN, were higher than SVM. But they were still not very accurate as shown in Fig. 8(b) and (c). The transformer-based methods, including SST-Former and STCD-Former, achieved higher accuracy in the experiment. The experimental results of our proposed STCD-Former were similar to SST-Former but slightly higher than it. As depicted in Fig. 8(d) and (e), the visualization result of SST-Former introduced a small area of false detection in the upper right of the image, and there were some changed small areas mixed with the unchanged areas in the visualization result of STCD-Former. In the red box of Fig. 8(e), affected by noise, there were still some small false detections in unchanged area. 2) Analysis for the Santa Barbara Dataset: The lowest OA and Kappa are also generated by SVM, but they were only 2% lower than the CNN-based methods. As shown in Fig. 9(b) and (c), M3D-DCNN obtained better visual effects than Re3FCN in the lower part of the image. And M3D-CNN achieved the third high OA and Kappa. For the two transformer-based methods, the Kappa value of STCD-Former was 1% higher than SST-Former, which also obtained the best visual results. 3) Analysis for the Bay Area Dataset: In Table II, SVM similarly had low accuracy. Re3FCN performed approximately 2% better than SVM. Different from the Farmland and Santa Barbara Datasets, the CNN-based methods M3D-DCDD showed a better result than the SST-Former. By contrast, STCD-Former still achieved the best CD performance among all the compared methods quantitatively and qualitatively. However, in the red box of Fig. 10(e), there was local misdetection in the unchanged area.

D. Ablation Analysis
Our proposed STCD-Former mainly consists of two parts with one key module: ST-Former, STS-Former, and DAM module. To validate the effectiveness of these components, we carried out the ablation analysis.
For the ablation analysis of ST-Former, we use ST-Former with dual-branch to generate the corresponding spectral token and apply MLP head to predict the difference result during CD. For STS-Former, we randomly generate a class token to replace the spectral token. From the results in Table III, the Kappa of ST-Former and STS-Former indicate that it is unsatisfactory to encode the image only in spectral-wise or spatial-wise. To validate the effectiveness of DAM module, we directly remove the DAM modules in the STS-Former since the DAM module is plug-and-play. As shown in Table III, the Kappa of ST-Former + STS-Former (without DAM) is higher than ST-Former and STS-Former, which demonstrates that the joint spatial-spectral sequence is helpful to the CD performance. And the Kappa of STCD-Former is higher than ST-Former + STS-Former (without DAM), especially in Santa Barbara and Bay Area, which confirms that DAM modules can effectively improve the learning of change rules.

E. Effect of Parameters
In our proposed STCD-Former, several hyper-parameters, e.g., the proportion of training samples, number of transformer encoders, patch size, and batch size, can affect the training process and CD performance. Thus, we investigated the influence of these hyperparameters in this section. In this process, we fixed the other parameters when analyzing the impact of one certain parameter.

2) The number of transformer encoders (spectral):
We experimented the number of transformer encoders with 4 numbers {2, 3, 4, and 5}. As shown in Fig. 11(b), we found that the selected 3 spectral transformer encoders can be the most beneficial to the CD performance.
3) The number of MCA and DAM: In this experiment, we chose five different numbers of MCA and DAM to find the optimum number. As shown in Fig. 11 Fig. 11(e), we chose 256 as the dimension of MCA. 6) Batch size: To evaluate the influence of the batch size on STCD-Former, a set of batch sizes {16, 32, 64, 128} was considered. As shown in Fig. 11(f), we found that the Kappa values corresponding to four different batch sizes are similar, thus 64 was set as the batch size at random.

7)
Patch size: We examined four patch sizes {3×3, 5×5, 7×7, 9×9} to analyze the influence of the patch size. As shown in Fig. 11(g), we can see that the patch size of 7×7 achieved almost the best performance. Thus, 7×7 was set as the input patch size.

V. EXPERIMENT IN THE HETEROGENEOUS IMAGES
Similar to the experiment in the homogeneous CD, we evaluate the performance of STCD-Former for heterogeneous CD in this section. First, we describe the heterogeneous dataset in detail. Second, we introduce the experimental setup, including comparative methods and hyperparameters in the experiment. Third, we compare the performance between STCD-Former and other excellent methods for heterogeneous CD. Finally, we make ablation analysis of ST-Former, STS-Former, and DAM modules, also with the parameters of STCD-Former in heterogeneous CD analyzed.

A. Experimental Datasets
In this experiment, one challenging heterogeneous dataset of multispectral images (MSIs) is selected for evaluation. This heterogeneous dataset is named Bastrop, in which the pre-event image is sensed by Landsat-5 on August 26, 2011, and the postevent image is sensed by the Advanced Land Imager from the Earth Observing mission (EO-1 ALI) on September 12, 2011. Especially, the spectral range of the bitemporal images is different, where the pre-event image is characterized by 7 bands covering the spectral range of 0.45-2.35 and 10.40-12.50 μm, while the postevent image is characterized by 10 bands covering the spectral range of 0.4-2.4 μm. The Bastrop dataset is mainly used to observe the disasters caused by forest fires in Bastrop County, Texas (USA). The pseudo-color images are shown in Fig. 12. The size of bitemporal images is 1534 × 808 pixels with 30-m spatial resolution.

B. Experimental Setup
Owing to the heterogeneous MSIs usually contain different spectral ranges, it is necessary to modify the structure of the ST-Former to generate the corresponding spectral token for different types of images. We used two independent as well as weightunshared ST-Formers for heterogeneous CD. The number of transformer encoders (spectral) in the temporal-one is 3, and the number of encoders in another temporal image is set to 5. As the postevent owned ten bands with richer spectral information in the spectral domain, we set a deeper layer to extract the information of physical mechanisms further.
Except for the encoder numbers, the other hyper-parameters are set as the same in the two branches. The number of MCA and DAM is 6. The dimensions d and d t are set as 32, 64. We also used Sigmoid as the activation function. The loss function also applies the weighted binary cross-entropy function. The learning rate of Adam optimizer was set to 0.0005. We used torch.optim.lr_scheduler.ExponentialLR to update the learning rate by multiplying 0.998 after each epoch. And 5 × 5 is adopted as the patch size. The total number of epochs was 200. The batch size was set to 64. The dimensions d and d t were set to 32 and 64. The number of heads in ST-Former and STS-Former is 8. Specifically, we only choose 6197 (0.5% of the all-image patches) image patches to train our network for heterogeneous CD. The details are listed in Table IV. Similar to the homogeneous experiment, OA and Kappa are adopted as evaluation criteria.
For the purpose of demonstrating the superiority of STCD-Former, we choose five representative methods for comparison: SVM, M3D-DCNN [42], ViT-spectral, ViT-spatial, and SST-Former [36]. We ditto used 0.5% train samples to train these networks. And the patch size is 5 × 5. The other details of the comparison experiments are listed as follows.
When utilizing SVM for heterogeneous CD, we employ PCA to extract the principal components of the map, which is concatenated by the two heterogeneous images. Then, the SVM was used to distinguish the changed samples and unchanged samples.
In the heterogeneous CD experiment of M3D-DCNN, we input the concatenated map directly to M3D-DCNN to realize binary classification. The detailed parameters are the same as the original settings in their article.
For ViT-spectral, we set a dual-branch ViT structure in spectral-wise and concatenate class tokens for heterogeneous CD. Similar to ViT-spectral, we converted the spectral encoding of ViT-spectral to spatial encoding to realize ViT-spatial for heterogeneous CD.
For SST-Former in heterogeneous CD, we randomly select 7 bands in the postevent image, and input the pre-event image

C. Comparison With Other Methods
All these methods used in comparison were tested five times and the qualitative results are shown in Table V. Fig. 13(a)-(f) list the visualization results, respectively.
As shown in Table V, the traditional machine learning SVM algorithm still achieved the lowest OA and Kappa in heterogeneous CD, and Fig. 13(a) shows that SVM was not good at dealing with some prominent noise in the image. The OA and Kappa of CNN-based methods, M3D-DCNN, were better than SVM. But, the visualization result of M3D-DCNN was also not good within the image noise area. The transformer-based methods, including ViT-spectral, ViT-spatial, SST-Former, and our proposed STCD-Former, achieved excellent performance in the experiment. The experimental results of ViT-spectral were similar to ViT-spatial but produced slightly higher OA and Kappa values. Our proposed STCD-Former performed the best in both OA and Kappa, where the OA was about 0.4% higher, and Kappa was 2% higher than the second-best method. In Fig. 13(e), the visualization result of SST-Former appeared some false detection area in the upper of the image, and SST-Former was also affected by image noise. Generally, from the qualitative and visualization results, the proposed STCD-Former achieved outstanding CD performance. From the result in Fig. 13(f), we found that it accurately detected most changed areas and could hardly be affected by image noise in unchanged areas.

D. Ablation Analysis
We verified the influence of each key module (including ST-Former, STS-Former, and DAM module) on STCD-Former for heterogeneous CD. The ablation experiments in heterogeneous CD were similar to the homogeneous CD. As shown in Table VI, the experiment result of ST-Former has already achieved good performance, while the STS-Former using the random class token produced an unsatisfactory CD result. The Kappa of ST-Former is higher than STS-Former about 1%. From the two STCD-Former networks, it can be seen that DAM module has a little improvement on CD results, and the Kappa only rises by about 0.5% with DAM module applied.

E. Effect of Parameters
Similar to the experiment for homogeneous CD, we also verify the influence of several parameters on heterogeneous CD. We fixed the other parameters when analyzing the impact of a certain parameter. The Kappa results of parameter experiment on Bastrop dataset are shown in Fig. 14. 1) The proportion of training samples: The set of proportions {0.1%, 0.5%, 1%, 2%, 5%, 7%, 10%} were selected in heterogeneous CD experiment. As shown in Fig. 14(a), the larger training samples we input, the higher accuracy we get.
2) The number of encoders in pre-event: We tested the number of encoders in pre-event with 5 numbers {3, 4, 5, 6, 7}. In Fig. 14(b), we found that the set of 3 spectral encoders can realize efficient generation of spectral token.  Fig. 14(h), we can see that the patch size of 5×5 achieves the best performance, and 3×3 has a lower difference on Kappa.

VI. DISCUSSION
Adaptive CD for both homogeneous and heterogeneous data is an important but arduous and challenging task. Especially, as the imaging physical mechanism of multisource images is different, it is hard to directly detect changes in the same geographic area. In this article, a novel spatial-spectral Transformer based spectral token guidance (STCD-Former) is proposed to address the challenge. Experiments are carried out on three commonly used homogeneous datasets and one challenging heterogeneous dataset. The results show that the proposed framework achieves the state-of-the-art performance in both homogeneous and heterogeneous CD. However, some key issues still need to be discussed in the further application of STCD-Former.
1) Further research on heterogeneous CD is necessary. Although the heterogeneous dataset used in the experiment is obtained by different sensor and the spectral range is different, the bitemporal images are still multispectral images. We will introduce our method for wider variety of heterogeneous images in the future work. 2) Manual data labeling is required. Although STCD-Former achieves a good performance in homogeneous and heterogeneous CD, it still requires manual labeling by experienced research for ground truth in a short time. The unsupervised and self-supervised learning are research hotspot now. We will study new training strategy with no labeled or few labeled datasets in our future work.

VII. CONCLUSION
As the imaging mechanisms of heterogeneous images are different, the spectral signals of multitemporal images in the same geographical are different. It is significant to adaptively model the diverse spectral information for heterogeneous CD. In this article, we proposed an effective end-to-end network called STCD-Former to adaptively realize CD for both homogeneous and heterogeneous images. STCD-Former mainly contains ST-Former, STS-Former, and DAM module. The ST-Former can adaptively model the physical mechanisms as spectral sequence features and generate the corresponding spectral token. STS-Former can effectively learn the change rules through the guidance of spectral token and the differential amplification of DAM. Within the experiment, we employed very limited training datasets to validate the proposed STCD-Former on three homogeneous and one challenging heterogeneous datasets. The experimental results demonstrate that the STCD-Former outperforms the other state-of-the-art methods quantitatively and visually. Although STCD-Former achieve state-of-the-art performance with very limited training samples, it still requires manual labeling by experienced research for ground truth in a short time. Therefore, we will study the training strategy with no labeled or fewer labeled datasets. At the same time, we observed that adaptively realizing CD with diverse bitemporal RS images is still a challenging and worth studying task in the future. He is currently a Full Professor with Xi'an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences of China. His current research interests include precision spectral detection, metrology spectral analysis technology, and optical remote sensing of water environment.